diff --git a/AGENTS.md b/AGENTS.md index 0ded477..6239381 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -7,7 +7,7 @@ This tool renders web pages in Chromium (via Playwright) and saves them as fully The pipeline is: ``` -URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML +URL ──► set request filters ──► Playwright render ──► inject cosmetic filters/userscripts ──► inline assets ──► write HTML ``` ## Source layout @@ -31,11 +31,11 @@ At module load time `archiver.mjs` parses this file into three arrays (`blockRul #### Cosmetic rule caveats -Some advanced cosmetic syntax is unsupported and is silently discarded: +Some advanced cosmetic syntax is unsupported and is silently discarded or downgraded: -- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them. -- `:style(...)` — converted to real CSS `display: none`. -- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing. +- Terminal `:remove()` — converted to CSS hiding; we can't actually remove DOM nodes from the CSS layer. +- `:style(...)` — converted to real CSS using the style content from the filter rule. +- `:xpath(...)`, `:upward(...)`, `:matches-css(...)`, `:matches-media(...)`, `:matches-path(...)` — discarded during parsing. ### `userscript/` directory @@ -43,7 +43,9 @@ Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library #### Selective injection (important) -Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings. +Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. The shared `bpc_func.js` helper is injected first, then the matching userscript files. + +Malformed asset fetches such as quoted Stripe or Google Pay script URLs usually mean escaped markup inside `srcdoc` or another HTML attribute is being parsed as top-level HTML. The inliner should only read attributes from real opening tags, and it sanitizes `srcdoc` iframe HTML recursively. The matching logic is a simple glob parser for userscript `@match` patterns: - `*://*.com/*` matches any `.com` domain @@ -60,22 +62,22 @@ Userscripts are injected **after** `domcontentloaded` but **before** `networkidl ## Stealth / anti-detection -### The `playwright-extra` stealth packages are broken +### Stealth package naming trap -The common recommendation is: +The package names that look like Playwright-specific stealth plugins are placeholders: ```bash npm install playwright-extra playwright-extra-plugin-stealth ``` -**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`: +`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, and `playwright-stealth` are **placeholder packages** (version `0.0.1`) that literally throw on `require()`: ``` Error: Wrong package, please see this: https://github.com/berstend/puppeteer-extra/issues/454 ``` -No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`. +If we revisit package-based stealth, the working route to evaluate is `playwright-extra` with `puppeteer-extra-plugin-stealth`. This project currently uses manual evasions instead, keeping `package.json` limited to plain Playwright. ### Manual stealth evasions @@ -111,10 +113,10 @@ The current init script is minimal and safe — it only overrides the getter via `renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`. -- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720) +- **Viewport:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720) - **Locale:** `en-US` - **Timezone:** `America/New_York` -- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version) +- **User-Agent:** macOS Chrome 130. This is pinned for site compatibility and is not automatically synchronized with the installed Chromium version. ## Docker / Podman support @@ -145,19 +147,22 @@ Port `5900` inside the container maps to `5901` on the host to avoid conflicts w Includes a `headful` profile that can be run with: ```bash -docker compose --profile headful up archiver-headful +URL=https://example.com docker compose --profile headful up archiver-headful ``` +Unlike `podman-run.sh`, Compose currently maps VNC to host port `5900`. + ## Known limitations -### Bloomberg bot wall +### Site-specific blocking -Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service. +Some publishers can still return bot walls, consent walls, or region-specific variants depending on IP reputation and timing. Treat these as site/network-sensitive failures and reproduce with the exact URL, mode, and environment before assuming the browser stealth layer is the root cause. Bloomberg has recently archived successfully in local verification. ### Unsupported adblock syntax -- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored. -- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works. +- Advanced procedural cosmetic filters (`:upward()`, `:xpath()`, `:matches-css()`, etc.) are silently ignored. +- Terminal `:remove()` cosmetic filters are downgraded to CSS hiding. +- Scriptlet injection (`##+js(...)`) is not supported by the filter parser. The BPC userscripts still run separately when their metadata matches the page. - Preprocessor directives (`!#if`) are skipped. ## Adding a new site to privacy filters