fix agents.md

This commit is contained in:
2026-05-15 09:29:22 -07:00
parent bc6d9893a1
commit 187d65cca7

View File

@@ -7,7 +7,7 @@ This tool renders web pages in Chromium (via Playwright) and saves them as fully
The pipeline is:
```
URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML
URL ──► set request filters ──► Playwright render ──► inject cosmetic filters/userscripts ──► inline assets ──► write HTML
```
## Source layout
@@ -31,11 +31,11 @@ At module load time `archiver.mjs` parses this file into three arrays (`blockRul
#### Cosmetic rule caveats
Some advanced cosmetic syntax is unsupported and is silently discarded:
Some advanced cosmetic syntax is unsupported and is silently discarded or downgraded:
- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them.
- `:style(...)` — converted to real CSS `display: none`.
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing.
- Terminal `:remove()` converted to CSS hiding; we can't actually remove DOM nodes from the CSS layer.
- `:style(...)` — converted to real CSS using the style content from the filter rule.
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)`, `:matches-media(...)`, `:matches-path(...)` — discarded during parsing.
### `userscript/` directory
@@ -43,7 +43,9 @@ Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library
#### Selective injection (important)
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings.
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. The shared `bpc_func.js` helper is injected first, then the matching userscript files.
Malformed asset fetches such as quoted Stripe or Google Pay script URLs usually mean escaped markup inside `srcdoc` or another HTML attribute is being parsed as top-level HTML. The inliner should only read attributes from real opening tags, and it sanitizes `srcdoc` iframe HTML recursively.
The matching logic is a simple glob parser for userscript `@match` patterns:
- `*://*.com/*` matches any `.com` domain
@@ -60,22 +62,22 @@ Userscripts are injected **after** `domcontentloaded` but **before** `networkidl
## Stealth / anti-detection
### The `playwright-extra` stealth packages are broken
### Stealth package naming trap
The common recommendation is:
The package names that look like Playwright-specific stealth plugins are placeholders:
```bash
npm install playwright-extra playwright-extra-plugin-stealth
```
**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, and `playwright-stealth` are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
```
Error: Wrong package, please see this:
https://github.com/berstend/puppeteer-extra/issues/454
```
No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`.
If we revisit package-based stealth, the working route to evaluate is `playwright-extra` with `puppeteer-extra-plugin-stealth`. This project currently uses manual evasions instead, keeping `package.json` limited to plain Playwright.
### Manual stealth evasions
@@ -111,10 +113,10 @@ The current init script is minimal and safe — it only overrides the getter via
`renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.
- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
- **Viewport:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
- **Locale:** `en-US`
- **Timezone:** `America/New_York`
- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version)
- **User-Agent:** macOS Chrome 130. This is pinned for site compatibility and is not automatically synchronized with the installed Chromium version.
## Docker / Podman support
@@ -145,19 +147,22 @@ Port `5900` inside the container maps to `5901` on the host to avoid conflicts w
Includes a `headful` profile that can be run with:
```bash
docker compose --profile headful up archiver-headful
URL=https://example.com docker compose --profile headful up archiver-headful
```
Unlike `podman-run.sh`, Compose currently maps VNC to host port `5900`.
## Known limitations
### Bloomberg bot wall
### Site-specific blocking
Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service.
Some publishers can still return bot walls, consent walls, or region-specific variants depending on IP reputation and timing. Treat these as site/network-sensitive failures and reproduce with the exact URL, mode, and environment before assuming the browser stealth layer is the root cause. Bloomberg has recently archived successfully in local verification.
### Unsupported adblock syntax
- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored.
- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works.
- Advanced procedural cosmetic filters (`:upward()`, `:xpath()`, `:matches-css()`, etc.) are silently ignored.
- Terminal `:remove()` cosmetic filters are downgraded to CSS hiding.
- Scriptlet injection (`##+js(...)`) is not supported by the filter parser. The BPC userscripts still run separately when their metadata matches the page.
- Preprocessor directives (`!#if`) are skipped.
## Adding a new site to privacy filters