Skip to main content

extract-pdf

Overview

Extract content from one or more PDF files and convert to compliant markdown with YAML frontmatter. Supports OCR for scanned documents, password-protected files, template-based field extraction, and batch processing of entire folders.

Usage

python main.py extract-pdf --source SOURCE [OPTIONS]

Options

OptionDescriptionDefault
--sourcePath to PDF file or folder of PDFs (required)--
--outputOutput directory_output
--titleCustom document titleFrom PDF metadata
--tagsComma-separated tags for frontmatterpdf,extracted
--categoryDocument categoryKB Article
--no-imagesDisable image extractionfalse
--ocr / --no-ocrEnable/disable OCRAuto-detect
--ocr-languageOCR language code (e.g., eng, deu)eng
--passwordPassword for encrypted PDF--
--password-promptPrompt for password securelyfalse
--templateExtraction template YAML for field extraction--
--template-repoTemplate repository directory--
--auto-classifyAuto-classify document and select templatefalse
--formatOutput format: markdown, jsonmarkdown
--aggregateAggregate batch results into single JSONfalse

Prerequisites

  • Repo: content-conductor
  • Install: pip install -r requirements.txt from repo root
  • OCR (optional): pip install -r requirements-ocr.txt + Tesseract engine

Examples

Extract a single PDF

python main.py extract-pdf --source document.pdf

Batch extract all PDFs in a folder

python main.py extract-pdf --source ./pdf_folder/ --output ./converted/

Force OCR on a scanned document

python main.py extract-pdf --source scanned.pdf --ocr

Template-based extraction with auto-classification

python main.py extract-pdf --source ./invoices/ --template-repo ./templates/ --auto-classify

Output

Creates a timestamped subdirectory with markdown file(s) and extracted images. Each file includes YAML frontmatter with title, tags, category, and extraction metadata.