Fetch one or more public webpages with Scrapling, extract the main content, and convert HTML into Markdown using html2text. Supports static HTTP, concurrent...
Use this skill when the user wants to:
.md files| Mode | Fetcher Class | Best For |
|---|---|---|
http (default) | Fetcher | Fast static pages, RSS, APIs |
async | AsyncFetcher | Batch of 5+ static URLs in parallel |
stealth | StealthyFetcher | Anti-bot sites, Cloudflare, fingerprint checks |
dynamic | PlayWrightFetcher | Heavy SPAs, React/Vue/Angular apps |
Decision rule: Start with http. If you get a 403 / CAPTCHA / empty body, switch
to stealth. If the content is rendered client-side (empty on first load), use dynamic.
Use async when scraping many static URLs at once to save time.
--url URL — one target URL (repeat flag for multiple: --url A --url B)--url-file FILE — plain text file with one URL per line--mode http|async|stealth|dynamic — fetcher backend (default: http)--selector CSS — CSS selector for the main content area (omit = full page)--preserve-links — keep hyperlinks in the Markdown output--output-dir DIR — save per-page .md files and a master index.json here--auto-save — fingerprint & persist selected elements to the local DB on first run--auto-match — on subsequent runs, find elements by fingerprint even if the site
layout has changed (do NOT need to update the CSS selector)--headless true|false|virtual — headless mode; virtual uses Xvfb (default: true)--network-idle — wait until no network activity for ≥500 ms before capturing--block-images — block image loading (saves bandwidth and proxy quota)--disable-resources — drop fonts/images/media/stylesheets for ~25% faster loads--wait-selector CSS — pause until this element appears in the DOM--wait-selector-state attached|visible|detached|hidden — element state (default: attached)--timeout MS — global timeout in ms (default: 30 000)--wait MS — extra idle wait after page load in ms--humanize SECONDS — simulate human-like cursor movement (max duration in seconds)--geoip — spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation--block-webrtc — prevent real-IP leaks via WebRTC--disable-ads — install uBlock Origin in the browser session--proxy URL — HTTP/SOCKS proxy as a URL string, or JSON:
'{"server":"host:port","username":"u","password":"p"}'--retry N — retry failed requests up to N times with exponential backoff (max 30 s)http:// or https:// pages.--selector to target the content area.--auto-save is used, always also pass --selector so Scrapling knows which
element fingerprint to record.--auto-match instead of --auto-save.
Do not use both flags at once.--mode async for batch jobs with 5+ static URLs for parallel execution.--disable-resources with --block-images in stealth/dynamic mode when
you only need text content — this can cut load times by up to 40%.ok field and per-result ok fields before using content.ok is false, report the exact error string — do not invent or guess content.--network-idle is insufficient, use --wait-selector for a specific DOM element
to guarantee the content has loaded before capture.python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--proxy "http://user:pass@host:port" \
--humanize 2.0 \
--geoip \
--block-webrtc \
--network-idle
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode dynamic \
--wait-selector ".product-list" \
--network-idle \
--disable-resources
python3 "{baseDir}/scrape_to_markdown.py" \
--mode async \
--url "<URL1>" --url "<URL2>" --url "<URL3>"
python3 "{baseDir}/scrape_to_markdown.py" \
--url-file urls.txt \
--mode stealth \
--disable-resources \
--output-dir outputs
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-save \
--output-dir outputs
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-match \
--output-dir outputs
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--selector "main article" \
--auto-match \
--preserve-links \
--network-idle \
--disable-resources \
--timeout 60000 \
--retry 3 \
--output-dir outputs
JSON is printed to stdout. Always check ok before using content.
Top-level fields:
ok — true only if every URL succeededtotal / succeeded / failed — count summaryresults — array of per-URL result objectsoutput_index_file — path to saved index.json (if --output-dir used)Per-URL result fields (when ok: true):
url — the requested URLstatus — HTTP status code (e.g. 200)title — page <title> textmarkdown — extracted content as Markdown ← use this as main contentmarkdown_length — character count (useful for quality checks)output_markdown_file — path to saved .md file (if --output-dir used)On failure (ok: false in a result):
error — exact error message; report this verbatim, do not invent contentZIP package — ready to use