`extract-web`

Overview

Fetch web pages via headless Playwright browser and convert to markdown. Optionally extract HTML tables to CSV. Supports concurrent fetching, URL lists from files, config profiles, and SC compliance formatting.

Usage

python main.py extract-web [OPTIONS]

Options

Option	Description	Default
`--url`	Single URL to extract (one of `--url` / `--urls` required)	--
`--urls`	File with URLs, one per line	--
`--output`, `-o`	Output directory	`_output`
`--profile`	Config profile from `config/web/profiles/`	--
`--csv`	Extract HTML tables to CSV files	`false`
`--concurrency`	Max concurrent page fetches	`3`
`--sc-compliance`	Apply SC compliance rules	`false`
`--timeout`	Page load timeout in milliseconds	`30000`
`--wait-for`	Wait strategy: `networkidle`, `load`, `domcontentloaded`	`networkidle`
`--follow-links`	Follow child links sharing URL prefix (depth=1)	`false`
`--consolidate`	Merge all pages into single markdown with TOC	`false`

Prerequisites

Repo: content-conductor
Install: pip install -r requirements.txt from repo root
Browser: pip install playwright && python -m playwright install chromium

Examples

Extract a single URL

python main.py extract-web --url https://example.com/page

Multiple URLs with CSV table extraction

python main.py extract-web --urls urls.txt --csv --output docs/extracted/

With SC compliance and config profile

python main.py extract-web --url https://example.com --profile ghl-help --sc-compliance

cc extract-pdf -- extract PDF files to markdown

Overview​

Usage​

Options​

Prerequisites​

Examples​

Extract a single URL​

Multiple URLs with CSV table extraction​

With SC compliance and config profile​

Related Commands​