fix agents.md

2026-05-15 09:29:22 -07:00
parent bc6d9893a1
commit 187d65cca7
1 changed files with 22 additions and 17 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -7,7 +7,7 @@ This tool renders web pages in Chromium (via Playwright) and saves them as fully
 The pipeline is:

 ```
-URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML
+URL ──► set request filters ──► Playwright render ──► inject cosmetic filters/userscripts ──► inline assets ──► write HTML
 ```

 ## Source layout
@@ -31,11 +31,11 @@ At module load time `archiver.mjs` parses this file into three arrays (`blockRul

 #### Cosmetic rule caveats

-Some advanced cosmetic syntax is unsupported and is silently discarded:
+Some advanced cosmetic syntax is unsupported and is silently discarded or downgraded:

- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them.
- `:style(...)` — converted to real CSS `display: none`.
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing.
+- Terminal `:remove()` — converted to CSS hiding; we can't actually remove DOM nodes from the CSS layer.
+- `:style(...)` — converted to real CSS using the style content from the filter rule.
+- `:xpath(...)`, `:upward(...)`, `:matches-css(...)`, `:matches-media(...)`, `:matches-path(...)` — discarded during parsing.

 ### `userscript/` directory

@@ -43,7 +43,9 @@ Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library

 #### Selective injection (important)

-Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings.
+Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. The shared `bpc_func.js` helper is injected first, then the matching userscript files.
+
+Malformed asset fetches such as quoted Stripe or Google Pay script URLs usually mean escaped markup inside `srcdoc` or another HTML attribute is being parsed as top-level HTML. The inliner should only read attributes from real opening tags, and it sanitizes `srcdoc` iframe HTML recursively.

 The matching logic is a simple glob parser for userscript `@match` patterns:
 - `*://*.com/*` matches any `.com` domain
@@ -60,22 +62,22 @@ Userscripts are injected **after** `domcontentloaded` but **before** `networkidl

 ## Stealth / anti-detection

-### The `playwright-extra` stealth packages are broken
+### Stealth package naming trap

-The common recommendation is:
+The package names that look like Playwright-specific stealth plugins are placeholders:

 ```bash
 npm install playwright-extra playwright-extra-plugin-stealth
 ```

-**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
+`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, and `playwright-stealth` are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:

 ```
 Error: Wrong package, please see this:
 https://github.com/berstend/puppeteer-extra/issues/454
 ```

-No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`.
+If we revisit package-based stealth, the working route to evaluate is `playwright-extra` with `puppeteer-extra-plugin-stealth`. This project currently uses manual evasions instead, keeping `package.json` limited to plain Playwright.

 ### Manual stealth evasions

@@ -111,10 +113,10 @@ The current init script is minimal and safe — it only overrides the getter via

 `renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.

- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
+- **Viewport:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
 - **Locale:** `en-US`
 - **Timezone:** `America/New_York`
- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version)
+- **User-Agent:** macOS Chrome 130. This is pinned for site compatibility and is not automatically synchronized with the installed Chromium version.

 ## Docker / Podman support

@@ -145,19 +147,22 @@ Port `5900` inside the container maps to `5901` on the host to avoid conflicts w
 Includes a `headful` profile that can be run with:

 ```bash
-docker compose --profile headful up archiver-headful
+URL=https://example.com docker compose --profile headful up archiver-headful
 ```

+Unlike `podman-run.sh`, Compose currently maps VNC to host port `5900`.
+
 ## Known limitations

-### Bloomberg bot wall
+### Site-specific blocking

-Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service.
+Some publishers can still return bot walls, consent walls, or region-specific variants depending on IP reputation and timing. Treat these as site/network-sensitive failures and reproduce with the exact URL, mode, and environment before assuming the browser stealth layer is the root cause. Bloomberg has recently archived successfully in local verification.

 ### Unsupported adblock syntax

- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored.
- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works.
+- Advanced procedural cosmetic filters (`:upward()`, `:xpath()`, `:matches-css()`, etc.) are silently ignored.
+- Terminal `:remove()` cosmetic filters are downgraded to CSS hiding.
+- Scriptlet injection (`##+js(...)`) is not supported by the filter parser. The BPC userscripts still run separately when their metadata matches the page.
 - Preprocessor directives (`!#if`) are skipped.

 ## Adding a new site to privacy filters