Files
sigilbox/AGENTS.md

10 KiB
Raw Blame History

Agent notes for local-page-archiver

Project overview

This tool renders web pages in Chromium (via Playwright) and saves them as fully self-contained HTML files. All external assets (images, fonts, stylesheets) are inlined as data URIs so the resulting file works offline.

The pipeline is:

URL ──► set request filters ──► Playwright render ──► inject cosmetic filters/userscripts ──► inline assets ──► write HTML

Source layout

  • src/cli.mjs — CLI entrypoint. Supports archive and help. Accepts --archive-path, --id, and --headful flags.
  • src/archiver.mjs — Core archiving logic. Loads privacy filters, steers the browser, injects adblockers/userscripts, and calls the inliner.
  • src/asset-inliner.mjs — Fetches and inlines external resources (images, CSS, iframes). Also strips <script> and <noscript> tags for a static archive.
  • privacy-filters/ — Third-party filter lists and userscripts used to strip paywalls, trackers, and ad banners before the snapshot is taken.

Privacy filters (privacy-filters/)

bpc-paywall-filter.txt

An AdBlock Plus / uBlock Origin filter list. It contains three kinds of rules:

  1. Network rules (||tracker.com^, /regex/) — block specific third-party paywall / tracking scripts.
  2. Exception rules (@@||example.com^) — whitelist requests that a global block rule would otherwise hit.
  3. Cosmetic rules (example.com##.paywall) — inject CSS to hide DOM elements (e.g. subscription banners, blurred overlays).

At module load time archiver.mjs parses this file into three arrays (blockRules, allowRules, cosmeticRules). Network rules are enforced at the Playwright level with page.route(...). Cosmetic rules are injected as a <style> tag after the page reaches domcontentloaded.

Cosmetic rule caveats

Some advanced cosmetic syntax is unsupported and is silently discarded or downgraded:

  • Terminal :remove() — converted to CSS hiding; we can't actually remove DOM nodes from the CSS layer.
  • :style(...) — converted to real CSS using the style content from the filter rule.
  • :xpath(...), :upward(...), :matches-css(...), :matches-media(...), :matches-path(...) — discarded during parsing.

userscript/ directory

Contains Greasemonkey-style userscripts (bpc.*.user.js) plus a shared library bpc_func.js. They do heavy lifting: decrypt paywalls, reconstruct article text from JSON data embedded in the page, remove blur overlays, etc.

Selective injection (important)

Each userscript declares @match and @exclude metadata. Only matching scripts are injected. For example, on bloomberg.com only bpc.en.user.js is injected. The shared bpc_func.js helper is injected first, then the matching userscript files.

Malformed asset fetches such as quoted Stripe or Google Pay script URLs usually mean escaped markup inside srcdoc or another HTML attribute is being parsed as top-level HTML. The inliner should only read attributes from real opening tags, and it sanitizes srcdoc iframe HTML recursively.

The matching logic is a simple glob parser for userscript @match patterns:

  • *://*.com/* matches any .com domain
  • *://example.com/path/* matches that path prefix
  • @exclude patterns take precedence and skip the script

GM.xmlHttpRequest mock

The userscripts rely on GM.xmlHttpRequest to fetch article text from archive mirrors or API endpoints. In a Playwright context this doesn't exist, so we inject a tiny mock that wraps the browser's native fetch() and presents the same callback interface (onload, onerror).

Timing

Userscripts are injected after domcontentloaded but before networkidle. We then wait an extra 2 s (page.waitForTimeout) so any setTimeout(..., 1000) callbacks inside the scripts have time to fire before we snapshot the DOM.

Stealth / anti-detection

Stealth package naming trap

The package names that look like Playwright-specific stealth plugins are placeholders:

npm install playwright-extra playwright-extra-plugin-stealth

playwright-extra-plugin-stealth, playwright-extra-stealth, and playwright-stealth are placeholder packages (version 0.0.1) that literally throw on require():

Error: Wrong package, please see this:
https://github.com/berstend/puppeteer-extra/issues/454

If we revisit package-based stealth, the working route to evaluate is playwright-extra with puppeteer-extra-plugin-stealth. This project currently uses manual evasions instead, keeping package.json limited to plain Playwright.

Manual stealth evasions

Instead we apply the same core evasions manually via context.addInitScript() and browser launch flags.

Launch flags:

  • --disable-blink-features=AutomationControlled
  • --disable-infobars
  • --disable-web-security
  • --no-sandbox, --disable-setuid-sandbox
  • --disable-dev-shm-usage
  • Removed --enable-automation via ignoreDefaultArgs

Init script (injected into every page before any scripts run):

Object.defineProperty(navigator, 'webdriver',
  { get: () => undefined, configurable: true, enumerable: true });
window.chrome = window.chrome || { runtime: {} };
window.navigator.permissions.query = (/* patched for notifications */);

CRITICAL: Avoid delete navigator.webdriver + iframe trick

An earlier version used a more elaborate stealth snippet that did delete navigator.webdriver and then created an <iframe> to steal the real navigator descriptor. This crashed the Chromium renderer process on tab creation with:

Protocol error (Page.addScriptToEvaluateOnNewDocument): Target crashed

The current init script is minimal and safe — it only overrides the getter via Object.defineProperty and avoids DOM mutation during page init.

Browser context & headful mode

renderPage() auto-detects whether a display is available ($DISPLAY / $WAYLAND_DISPLAY). If neither is set it defaults to headless. The caller can override via options.headless.

  • Viewport: 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
  • Locale: en-US
  • Timezone: America/New_York
  • User-Agent: macOS Chrome 130. This is pinned for site compatibility and is not automatically synchronized with the installed Chromium version.

Docker / Podman support

Dockerfile

  • Base: mcr.microsoft.com/playwright:v1.60.0-noble (must stay in sync with the playwright npm version)
  • Installs only the worker runtime helpers that are not part of the Playwright image: dumb-init, xvfb, and x11vnc
  • Uses /app/scripts/archive-worker-entrypoint.sh as the entrypoint. The entrypoint starts Xvfb on $DISPLAY and then runs node src/cli.mjs ... for archive/help commands.
  • The worker is intended to be ephemeral: one container per archive job, with /archives mounted from the host.

Host-to-worker contract

src/container-runner.mjs is the host/backend-facing boundary. It:

  1. Picks podman or docker.
  2. Starts local-page-archiver:latest with /archives mounted from the host.
  3. Calls the in-container CLI as archive <input> --json.
  4. Parses the JSON result and rewrites /archives/... paths back to host paths.

This is the integration point a future backend should use instead of shelling out to podman run directly.

podman-run.sh

Helper for local Podman runs. It delegates to src/container-runner.mjs.

  1. ./podman-run.sh build — build local-page-archiver:latest
  2. ./podman-run.sh archive <URL> — run one ephemeral Xvfb/Chromium worker and write to ./archives
  3. ./podman-run.sh vnc-archive <URL> — same worker with x11vnc exposed on vnc://localhost:5901

The helper builds the image if it is missing. Override with:

ARCHIVE_WORKER_IMAGE=local-page-archiver:dev ARCHIVE_DIR=/tmp/archives ./podman-run.sh archive https://example.com

docker-compose.yml

Compose is mainly a direct worker smoke test. It runs the same image and command shape as the host runner:

URL=https://example.com docker compose up --build archive-worker

For visual debugging:

URL=https://example.com docker compose --profile debug up --build archive-worker-vnc

Unlike podman-run.sh, Compose maps VNC to host port 5900.

Known limitations

Site-specific blocking

Some publishers can still return bot walls, consent walls, or region-specific variants depending on IP reputation and timing. Treat these as site/network-sensitive failures and reproduce with the exact URL, mode, and environment before assuming the browser stealth layer is the root cause. Bloomberg has recently archived successfully in local verification.

Unsupported adblock syntax

  • Advanced procedural cosmetic filters (:upward(), :xpath(), :matches-css(), etc.) are silently ignored.
  • Terminal :remove() cosmetic filters are downgraded to CSS hiding.
  • Scriptlet injection (##+js(...)) is not supported by the filter parser. The BPC userscripts still run separately when their metadata matches the page.
  • Preprocessor directives (!#if) are skipped.

Adding a new site to privacy filters

If you add a new filter rule or userscript:

  1. Filter rules: Edit privacy-filters/bpc-paywall-filter.txt. archiver.mjs reloads the file on every process start, so no code changes are needed.
  2. Userscripts: Drop a new .user.js into privacy-filters/userscript/ and add its filename to the userScriptFiles array inside loadPrivacyFilters() in archiver.mjs.
  3. Test with node src/cli.mjs archive <URL> and inspect the generated HTML.

Rebuilding the Docker image

Playwright npm version image tags must match:

# Check what npm installed
node -e "console.log(require('playwright/package.json').version)"

# Update the FROM line in Dockerfile if needed
# Then rebuild
podman build -t local-page-archiver .

Development quick reference

# Install deps
npm install

# Install browser binaries
npm run install-browsers

# Archive a page (headless)
node src/cli.mjs archive https://example.com

# Archive a page (headful on macOS)
node src/cli.mjs archive https://example.com --headful

# Build worker image
./podman-run.sh build

# Archive inside an ephemeral Xvfb/Chromium worker
./podman-run.sh archive https://example.com

# Archive inside worker + expose VNC for debugging
./podman-run.sh vnc-archive https://example.com
# Then open vnc://localhost:5901