`extract-pdf`

Overview

Extract content from one or more PDF files and convert to compliant markdown with YAML frontmatter. Supports OCR for scanned documents, password-protected files, template-based field extraction, and batch processing of entire folders.

Usage

python main.py extract-pdf --source SOURCE [OPTIONS]

Options

Option	Description	Default
`--source`	Path to PDF file or folder of PDFs (required)	--
`--output`	Output directory	`_output`
`--title`	Custom document title	From PDF metadata
`--tags`	Comma-separated tags for frontmatter	`pdf,extracted`
`--category`	Document category	`KB Article`
`--no-images`	Disable image extraction	`false`
`--ocr / --no-ocr`	Enable/disable OCR	Auto-detect
`--ocr-language`	OCR language code (e.g., `eng`, `deu`)	`eng`
`--password`	Password for encrypted PDF	--
`--password-prompt`	Prompt for password securely	`false`
`--template`	Extraction template YAML for field extraction	--
`--template-repo`	Template repository directory	--
`--auto-classify`	Auto-classify document and select template	`false`
`--format`	Output format: `markdown`, `json`	`markdown`
`--aggregate`	Aggregate batch results into single JSON	`false`

Prerequisites

Repo: content-conductor
Install: pip install -r requirements.txt from repo root
OCR (optional): pip install -r requirements-ocr.txt + Tesseract engine

Examples

Extract a single PDF

python main.py extract-pdf --source document.pdf

Batch extract all PDFs in a folder

python main.py extract-pdf --source ./pdf_folder/ --output ./converted/

Force OCR on a scanned document

python main.py extract-pdf --source scanned.pdf --ocr

Template-based extraction with auto-classification

python main.py extract-pdf --source ./invoices/ --template-repo ./templates/ --auto-classify

Output

Creates a timestamped subdirectory with markdown file(s) and extracted images. Each file includes YAML frontmatter with title, tags, category, and extraction metadata.

cc extract-web -- extract web pages to markdown
cc extract-youtube-auto -- extract YouTube video data

Overview​

Usage​

Options​

Prerequisites​

Examples​

Extract a single PDF​

Batch extract all PDFs in a folder​

Force OCR on a scanned document​

Template-based extraction with auto-classification​

Output​

Related Commands​