Files
sigilbox/AGENTS.md
2026-05-15 09:25:19 -07:00

206 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent notes for local-page-archiver
## Project overview
This tool renders web pages in Chromium (via Playwright) and saves them as fully self-contained HTML files. All external assets (images, fonts, stylesheets) are inlined as data URIs so the resulting file works offline.
The pipeline is:
```
URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML
```
## Source layout
- `src/cli.mjs` — CLI entrypoint. Supports `archive` and `help`. Accepts `--archive-path`, `--id`, and `--headful` flags.
- `src/archiver.mjs` — Core archiving logic. Loads privacy filters, steers the browser, injects adblockers/userscripts, and calls the inliner.
- `src/asset-inliner.mjs` — Fetches and inlines external resources (images, CSS, iframes). Also strips `<script>` and `<noscript>` tags for a static archive.
- `privacy-filters/` — Third-party filter lists and userscripts used to strip paywalls, trackers, and ad banners before the snapshot is taken.
## Privacy filters (`privacy-filters/`)
### `bpc-paywall-filter.txt`
An AdBlock Plus / uBlock Origin filter list. It contains three kinds of rules:
1. **Network rules** (`||tracker.com^`, `/regex/`) — block specific third-party paywall / tracking scripts.
2. **Exception rules** (`@@||example.com^`) — whitelist requests that a global block rule would otherwise hit.
3. **Cosmetic rules** (`example.com##.paywall`) — inject CSS to hide DOM elements (e.g. subscription banners, blurred overlays).
At module load time `archiver.mjs` parses this file into three arrays (`blockRules`, `allowRules`, `cosmeticRules`). Network rules are enforced at the Playwright level with `page.route(...)`. Cosmetic rules are injected as a `<style>` tag after the page reaches `domcontentloaded`.
#### Cosmetic rule caveats
Some advanced cosmetic syntax is unsupported and is silently discarded:
- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them.
- `:style(...)` — converted to real CSS `display: none`.
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing.
### `userscript/` directory
Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library `bpc_func.js`. They do heavy lifting: decrypt paywalls, reconstruct article text from JSON data embedded in the page, remove blur overlays, etc.
#### Selective injection (important)
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings.
The matching logic is a simple glob parser for userscript `@match` patterns:
- `*://*.com/*` matches any `.com` domain
- `*://example.com/path/*` matches that path prefix
- `@exclude` patterns take precedence and skip the script
#### `GM.xmlHttpRequest` mock
The userscripts rely on `GM.xmlHttpRequest` to fetch article text from archive mirrors or API endpoints. In a Playwright context this doesn't exist, so we inject a tiny mock that wraps the browser's native `fetch()` and presents the same callback interface (`onload`, `onerror`).
#### Timing
Userscripts are injected **after** `domcontentloaded` but **before** `networkidle`. We then wait an extra 2 s (`page.waitForTimeout`) so any `setTimeout(..., 1000)` callbacks inside the scripts have time to fire before we snapshot the DOM.
## Stealth / anti-detection
### The `playwright-extra` stealth packages are broken
The common recommendation is:
```bash
npm install playwright-extra playwright-extra-plugin-stealth
```
**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
```
Error: Wrong package, please see this:
https://github.com/berstend/puppeteer-extra/issues/454
```
No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`.
### Manual stealth evasions
Instead we apply the same core evasions manually via `context.addInitScript()` and browser launch flags.
**Launch flags:**
- `--disable-blink-features=AutomationControlled`
- `--disable-infobars`
- `--disable-web-security`
- `--no-sandbox`, `--disable-setuid-sandbox`
- `--disable-dev-shm-usage`
- Removed `--enable-automation` via `ignoreDefaultArgs`
**Init script (injected into every page before any scripts run):**
```js
Object.defineProperty(navigator, 'webdriver',
{ get: () => undefined, configurable: true, enumerable: true });
window.chrome = window.chrome || { runtime: {} };
window.navigator.permissions.query = (/* patched for notifications */);
```
#### CRITICAL: Avoid `delete navigator.webdriver` + iframe trick
An earlier version used a more elaborate stealth snippet that did `delete navigator.webdriver` and then created an `<iframe>` to steal the real navigator descriptor. **This crashed the Chromium renderer process on tab creation** with:
```
Protocol error (Page.addScriptToEvaluateOnNewDocument): Target crashed
```
The current init script is minimal and safe — it only overrides the getter via `Object.defineProperty` and avoids DOM mutation during page init.
## Browser context & headful mode
`renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.
- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
- **Locale:** `en-US`
- **Timezone:** `America/New_York`
- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version)
## Docker / Podman support
### Dockerfile
- Base: `mcr.microsoft.com/playwright:v1.60.0` (must stay in sync with the `playwright` npm version)
- Installs Node 22 (the base image may ship an older Node)
- Runs `npx playwright install chromium` so the browser binary is baked into the image
### `podman-run.sh`
Helper for local runs. Two modes:
1. **`./podman-run.sh archive <URL>`** — headless, mounts `./archives`
2. **`./podman-run.sh headful-archive <URL>`** — headful with internal VNC
**Headful mode details:**
The container's `ENTRYPOINT` is `node src/cli.mjs`. To run a shell command inside the container (setting up Xvfb + x11vnc) we must override the entrypoint:
```bash
podman run --rm --entrypoint sh <image> -c "...setup Xvfb... && node src/cli.mjs archive <URL>"
```
Port `5900` inside the container maps to `5901` on the host to avoid conflicts with macOS's built-in VNC.
### `docker-compose.yml`
Includes a `headful` profile that can be run with:
```bash
docker compose --profile headful up archiver-headful
```
## Known limitations
### Bloomberg bot wall
Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service.
### Unsupported adblock syntax
- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored.
- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works.
- Preprocessor directives (`!#if`) are skipped.
## Adding a new site to privacy filters
If you add a new filter rule or userscript:
1. **Filter rules:** Edit `privacy-filters/bpc-paywall-filter.txt`. `archiver.mjs` reloads the file on every process start, so no code changes are needed.
2. **Userscripts:** Drop a new `.user.js` into `privacy-filters/userscript/` and add its filename to the `userScriptFiles` array inside `loadPrivacyFilters()` in `archiver.mjs`.
3. Test with `node src/cli.mjs archive <URL>` and inspect the generated HTML.
## Rebuilding the Docker image
Playwright npm version image tags must match:
```bash
# Check what npm installed
node -e "console.log(require('playwright/package.json').version)"
# Update the FROM line in Dockerfile if needed
# Then rebuild
podman build -t local-page-archiver .
```
## Development quick reference
```bash
# Install deps
npm install
# Install browser binaries
npm run install-browsers
# Archive a page (headless)
node src/cli.mjs archive https://example.com
# Archive a page (headful on macOS)
node src/cli.mjs archive https://example.com --headful
# Archive inside container (headless)
./podman-run.sh archive https://example.com
# Archive inside container (headful + VNC)
./podman-run.sh headful-archive https://example.com
# Then open vnc://localhost:5901
```