extract-pdf
Overview
Extract content from one or more PDF files and convert to compliant markdown with YAML frontmatter. Supports OCR for scanned documents, password-protected files, template-based field extraction, and batch processing of entire folders.
Usage
python main.py extract-pdf --source SOURCE [OPTIONS]
Options
| Option | Description | Default |
|---|---|---|
--source | Path to PDF file or folder of PDFs (required) | -- |
--output | Output directory | _output |
--title | Custom document title | From PDF metadata |
--tags | Comma-separated tags for frontmatter | pdf,extracted |
--category | Document category | KB Article |
--no-images | Disable image extraction | false |
--ocr / --no-ocr | Enable/disable OCR | Auto-detect |
--ocr-language | OCR language code (e.g., eng, deu) | eng |
--password | Password for encrypted PDF | -- |
--password-prompt | Prompt for password securely | false |
--template | Extraction template YAML for field extraction | -- |
--template-repo | Template repository directory | -- |
--auto-classify | Auto-classify document and select template | false |
--format | Output format: markdown, json | markdown |
--aggregate | Aggregate batch results into single JSON | false |
Prerequisites
- Repo: content-conductor
- Install:
pip install -r requirements.txtfrom repo root - OCR (optional):
pip install -r requirements-ocr.txt+ Tesseract engine
Examples
Extract a single PDF
python main.py extract-pdf --source document.pdf
Batch extract all PDFs in a folder
python main.py extract-pdf --source ./pdf_folder/ --output ./converted/
Force OCR on a scanned document
python main.py extract-pdf --source scanned.pdf --ocr
Template-based extraction with auto-classification
python main.py extract-pdf --source ./invoices/ --template-repo ./templates/ --auto-classify
Output
Creates a timestamped subdirectory with markdown file(s) and extracted images. Each file includes YAML frontmatter with title, tags, category, and extraction metadata.
Related Commands
cc extract-web-- extract web pages to markdowncc extract-youtube-auto-- extract YouTube video data