Files
sigilbox/AGENTS.md

230 lines
10 KiB
Markdown
Raw Normal View History

2026-05-15 09:25:19 -07:00
# Agent notes for local-page-archiver
## Project overview
This tool renders web pages in Chromium (via Playwright) and saves them as fully self-contained HTML files. All external assets (images, fonts, stylesheets) are inlined as data URIs so the resulting file works offline.
The pipeline is:
```
2026-05-15 09:29:22 -07:00
URL ──► set request filters ──► Playwright render ──► inject cosmetic filters/userscripts ──► inline assets ──► write HTML
2026-05-15 09:25:19 -07:00
```
## Source layout
- `src/cli.mjs` — CLI entrypoint. Supports `archive` and `help`. Accepts `--archive-path`, `--id`, and `--headful` flags.
- `src/archiver.mjs` — Core archiving logic. Loads privacy filters, steers the browser, injects adblockers/userscripts, and calls the inliner.
- `src/asset-inliner.mjs` — Fetches and inlines external resources (images, CSS, iframes). Also strips `<script>` and `<noscript>` tags for a static archive.
- `privacy-filters/` — Third-party filter lists and userscripts used to strip paywalls, trackers, and ad banners before the snapshot is taken.
## Privacy filters (`privacy-filters/`)
### `bpc-paywall-filter.txt`
An AdBlock Plus / uBlock Origin filter list. It contains three kinds of rules:
1. **Network rules** (`||tracker.com^`, `/regex/`) — block specific third-party paywall / tracking scripts.
2. **Exception rules** (`@@||example.com^`) — whitelist requests that a global block rule would otherwise hit.
3. **Cosmetic rules** (`example.com##.paywall`) — inject CSS to hide DOM elements (e.g. subscription banners, blurred overlays).
At module load time `archiver.mjs` parses this file into three arrays (`blockRules`, `allowRules`, `cosmeticRules`). Network rules are enforced at the Playwright level with `page.route(...)`. Cosmetic rules are injected as a `<style>` tag after the page reaches `domcontentloaded`.
#### Cosmetic rule caveats
2026-05-15 09:29:22 -07:00
Some advanced cosmetic syntax is unsupported and is silently discarded or downgraded:
2026-05-15 09:25:19 -07:00
2026-05-15 09:29:22 -07:00
- Terminal `:remove()` — converted to CSS hiding; we can't actually remove DOM nodes from the CSS layer.
- `:style(...)` — converted to real CSS using the style content from the filter rule.
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)`, `:matches-media(...)`, `:matches-path(...)` — discarded during parsing.
2026-05-15 09:25:19 -07:00
### `userscript/` directory
Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library `bpc_func.js`. They do heavy lifting: decrypt paywalls, reconstruct article text from JSON data embedded in the page, remove blur overlays, etc.
#### Selective injection (important)
2026-05-15 09:29:22 -07:00
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. The shared `bpc_func.js` helper is injected first, then the matching userscript files.
Malformed asset fetches such as quoted Stripe or Google Pay script URLs usually mean escaped markup inside `srcdoc` or another HTML attribute is being parsed as top-level HTML. The inliner should only read attributes from real opening tags, and it sanitizes `srcdoc` iframe HTML recursively.
2026-05-15 09:25:19 -07:00
The matching logic is a simple glob parser for userscript `@match` patterns:
- `*://*.com/*` matches any `.com` domain
- `*://example.com/path/*` matches that path prefix
- `@exclude` patterns take precedence and skip the script
#### `GM.xmlHttpRequest` mock
The userscripts rely on `GM.xmlHttpRequest` to fetch article text from archive mirrors or API endpoints. In a Playwright context this doesn't exist, so we inject a tiny mock that wraps the browser's native `fetch()` and presents the same callback interface (`onload`, `onerror`).
#### Timing
Userscripts are injected **after** `domcontentloaded` but **before** `networkidle`. We then wait an extra 2 s (`page.waitForTimeout`) so any `setTimeout(..., 1000)` callbacks inside the scripts have time to fire before we snapshot the DOM.
## Stealth / anti-detection
2026-05-15 09:29:22 -07:00
### Stealth package naming trap
2026-05-15 09:25:19 -07:00
2026-05-15 09:29:22 -07:00
The package names that look like Playwright-specific stealth plugins are placeholders:
2026-05-15 09:25:19 -07:00
```bash
npm install playwright-extra playwright-extra-plugin-stealth
```
2026-05-15 09:29:22 -07:00
`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, and `playwright-stealth` are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
2026-05-15 09:25:19 -07:00
```
Error: Wrong package, please see this:
https://github.com/berstend/puppeteer-extra/issues/454
```
2026-05-15 09:29:22 -07:00
If we revisit package-based stealth, the working route to evaluate is `playwright-extra` with `puppeteer-extra-plugin-stealth`. This project currently uses manual evasions instead, keeping `package.json` limited to plain Playwright.
2026-05-15 09:25:19 -07:00
### Manual stealth evasions
Instead we apply the same core evasions manually via `context.addInitScript()` and browser launch flags.
**Launch flags:**
- `--disable-blink-features=AutomationControlled`
- `--disable-infobars`
- `--disable-web-security`
- `--no-sandbox`, `--disable-setuid-sandbox`
- `--disable-dev-shm-usage`
- Removed `--enable-automation` via `ignoreDefaultArgs`
**Init script (injected into every page before any scripts run):**
```js
Object.defineProperty(navigator, 'webdriver',
{ get: () => undefined, configurable: true, enumerable: true });
window.chrome = window.chrome || { runtime: {} };
window.navigator.permissions.query = (/* patched for notifications */);
```
#### CRITICAL: Avoid `delete navigator.webdriver` + iframe trick
An earlier version used a more elaborate stealth snippet that did `delete navigator.webdriver` and then created an `<iframe>` to steal the real navigator descriptor. **This crashed the Chromium renderer process on tab creation** with:
```
Protocol error (Page.addScriptToEvaluateOnNewDocument): Target crashed
```
The current init script is minimal and safe — it only overrides the getter via `Object.defineProperty` and avoids DOM mutation during page init.
## Browser context & headful mode
`renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.
2026-05-15 09:29:22 -07:00
- **Viewport:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
2026-05-15 09:25:19 -07:00
- **Locale:** `en-US`
- **Timezone:** `America/New_York`
2026-05-15 09:29:22 -07:00
- **User-Agent:** macOS Chrome 130. This is pinned for site compatibility and is not automatically synchronized with the installed Chromium version.
2026-05-15 09:25:19 -07:00
## Docker / Podman support
### Dockerfile
2026-05-16 16:05:32 -07:00
- Base: `mcr.microsoft.com/playwright:v1.60.0-noble` (must stay in sync with the `playwright` npm version)
- Installs only the worker runtime helpers that are not part of the Playwright image: `dumb-init`, `xvfb`, and `x11vnc`
- Uses `/app/scripts/archive-worker-entrypoint.sh` as the entrypoint. The entrypoint starts Xvfb on `$DISPLAY` and then runs `node src/cli.mjs ...` for `archive`/`help` commands.
- The worker is intended to be ephemeral: one container per archive job, with `/archives` mounted from the host.
### Host-to-worker contract
`src/container-runner.mjs` is the host/backend-facing boundary. It:
1. Picks `podman` or `docker`.
2. Starts `local-page-archiver:latest` with `/archives` mounted from the host.
3. Calls the in-container CLI as `archive <input> --json`.
4. Parses the JSON result and rewrites `/archives/...` paths back to host paths.
This is the integration point a future backend should use instead of shelling out to `podman run` directly.
2026-05-15 09:25:19 -07:00
### `podman-run.sh`
2026-05-16 16:05:32 -07:00
Helper for local Podman runs. It delegates to `src/container-runner.mjs`.
2026-05-15 09:25:19 -07:00
2026-05-16 16:05:32 -07:00
1. **`./podman-run.sh build`** — build `local-page-archiver:latest`
2. **`./podman-run.sh archive <URL>`** — run one ephemeral Xvfb/Chromium worker and write to `./archives`
3. **`./podman-run.sh vnc-archive <URL>`** — same worker with x11vnc exposed on `vnc://localhost:5901`
2026-05-15 09:25:19 -07:00
2026-05-16 16:05:32 -07:00
The helper builds the image if it is missing. Override with:
2026-05-15 09:25:19 -07:00
2026-05-16 16:05:32 -07:00
```sh
ARCHIVE_WORKER_IMAGE=local-page-archiver:dev ARCHIVE_DIR=/tmp/archives ./podman-run.sh archive https://example.com
2026-05-15 09:25:19 -07:00
```
### `docker-compose.yml`
2026-05-16 16:05:32 -07:00
Compose is mainly a direct worker smoke test. It runs the same image and command shape as the host runner:
2026-05-15 09:25:19 -07:00
```bash
2026-05-16 16:05:32 -07:00
URL=https://example.com docker compose up --build archive-worker
2026-05-15 09:25:19 -07:00
```
2026-05-16 16:05:32 -07:00
For visual debugging:
```bash
URL=https://example.com docker compose --profile debug up --build archive-worker-vnc
```
Unlike `podman-run.sh`, Compose maps VNC to host port `5900`.
2026-05-15 09:29:22 -07:00
2026-05-15 09:25:19 -07:00
## Known limitations
2026-05-15 09:29:22 -07:00
### Site-specific blocking
2026-05-15 09:25:19 -07:00
2026-05-15 09:29:22 -07:00
Some publishers can still return bot walls, consent walls, or region-specific variants depending on IP reputation and timing. Treat these as site/network-sensitive failures and reproduce with the exact URL, mode, and environment before assuming the browser stealth layer is the root cause. Bloomberg has recently archived successfully in local verification.
2026-05-15 09:25:19 -07:00
### Unsupported adblock syntax
2026-05-15 09:29:22 -07:00
- Advanced procedural cosmetic filters (`:upward()`, `:xpath()`, `:matches-css()`, etc.) are silently ignored.
- Terminal `:remove()` cosmetic filters are downgraded to CSS hiding.
- Scriptlet injection (`##+js(...)`) is not supported by the filter parser. The BPC userscripts still run separately when their metadata matches the page.
2026-05-15 09:25:19 -07:00
- Preprocessor directives (`!#if`) are skipped.
## Adding a new site to privacy filters
If you add a new filter rule or userscript:
1. **Filter rules:** Edit `privacy-filters/bpc-paywall-filter.txt`. `archiver.mjs` reloads the file on every process start, so no code changes are needed.
2. **Userscripts:** Drop a new `.user.js` into `privacy-filters/userscript/` and add its filename to the `userScriptFiles` array inside `loadPrivacyFilters()` in `archiver.mjs`.
3. Test with `node src/cli.mjs archive <URL>` and inspect the generated HTML.
## Rebuilding the Docker image
Playwright npm version image tags must match:
```bash
# Check what npm installed
node -e "console.log(require('playwright/package.json').version)"
# Update the FROM line in Dockerfile if needed
# Then rebuild
podman build -t local-page-archiver .
```
## Development quick reference
```bash
# Install deps
npm install
# Install browser binaries
npm run install-browsers
# Archive a page (headless)
node src/cli.mjs archive https://example.com
# Archive a page (headful on macOS)
node src/cli.mjs archive https://example.com --headful
2026-05-16 16:05:32 -07:00
# Build worker image
./podman-run.sh build
# Archive inside an ephemeral Xvfb/Chromium worker
2026-05-15 09:25:19 -07:00
./podman-run.sh archive https://example.com
2026-05-16 16:05:32 -07:00
# Archive inside worker + expose VNC for debugging
./podman-run.sh vnc-archive https://example.com
2026-05-15 09:25:19 -07:00
# Then open vnc://localhost:5901
```