Skip to main content

extract-web

Overview

Fetch web pages via headless Playwright browser and convert to markdown. Optionally extract HTML tables to CSV. Supports concurrent fetching, URL lists from files, config profiles, and SC compliance formatting.

Usage

python main.py extract-web [OPTIONS]

Options

OptionDescriptionDefault
--urlSingle URL to extract (one of --url / --urls required)--
--urlsFile with URLs, one per line--
--output, -oOutput directory_output
--profileConfig profile from config/web/profiles/--
--csvExtract HTML tables to CSV filesfalse
--concurrencyMax concurrent page fetches3
--sc-complianceApply SC compliance rulesfalse
--timeoutPage load timeout in milliseconds30000
--wait-forWait strategy: networkidle, load, domcontentloadednetworkidle
--follow-linksFollow child links sharing URL prefix (depth=1)false
--consolidateMerge all pages into single markdown with TOCfalse

Prerequisites

  • Repo: content-conductor
  • Install: pip install -r requirements.txt from repo root
  • Browser: pip install playwright && python -m playwright install chromium

Examples

Extract a single URL

python main.py extract-web --url https://example.com/page

Multiple URLs with CSV table extraction

python main.py extract-web --urls urls.txt --csv --output docs/extracted/

With SC compliance and config profile

python main.py extract-web --url https://example.com --profile ghl-help --sc-compliance