extract-web
Overview
Fetch web pages via headless Playwright browser and convert to markdown. Optionally extract HTML tables to CSV. Supports concurrent fetching, URL lists from files, config profiles, and SC compliance formatting.
Usage
python main.py extract-web [OPTIONS]
Options
| Option | Description | Default |
|---|---|---|
--url | Single URL to extract (one of --url / --urls required) | -- |
--urls | File with URLs, one per line | -- |
--output, -o | Output directory | _output |
--profile | Config profile from config/web/profiles/ | -- |
--csv | Extract HTML tables to CSV files | false |
--concurrency | Max concurrent page fetches | 3 |
--sc-compliance | Apply SC compliance rules | false |
--timeout | Page load timeout in milliseconds | 30000 |
--wait-for | Wait strategy: networkidle, load, domcontentloaded | networkidle |
--follow-links | Follow child links sharing URL prefix (depth=1) | false |
--consolidate | Merge all pages into single markdown with TOC | false |
Prerequisites
- Repo: content-conductor
- Install:
pip install -r requirements.txtfrom repo root - Browser:
pip install playwright && python -m playwright install chromium
Examples
Extract a single URL
python main.py extract-web --url https://example.com/page
Multiple URLs with CSV table extraction
python main.py extract-web --urls urls.txt --csv --output docs/extracted/
With SC compliance and config profile
python main.py extract-web --url https://example.com --profile ghl-help --sc-compliance
Related Commands
cc extract-pdf-- extract PDF files to markdown