fix agents.md
This commit is contained in:
39
AGENTS.md
39
AGENTS.md
@@ -7,7 +7,7 @@ This tool renders web pages in Chromium (via Playwright) and saves them as fully
|
||||
The pipeline is:
|
||||
|
||||
```
|
||||
URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML
|
||||
URL ──► set request filters ──► Playwright render ──► inject cosmetic filters/userscripts ──► inline assets ──► write HTML
|
||||
```
|
||||
|
||||
## Source layout
|
||||
@@ -31,11 +31,11 @@ At module load time `archiver.mjs` parses this file into three arrays (`blockRul
|
||||
|
||||
#### Cosmetic rule caveats
|
||||
|
||||
Some advanced cosmetic syntax is unsupported and is silently discarded:
|
||||
Some advanced cosmetic syntax is unsupported and is silently discarded or downgraded:
|
||||
|
||||
- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them.
|
||||
- `:style(...)` — converted to real CSS `display: none`.
|
||||
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing.
|
||||
- Terminal `:remove()` — converted to CSS hiding; we can't actually remove DOM nodes from the CSS layer.
|
||||
- `:style(...)` — converted to real CSS using the style content from the filter rule.
|
||||
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)`, `:matches-media(...)`, `:matches-path(...)` — discarded during parsing.
|
||||
|
||||
### `userscript/` directory
|
||||
|
||||
@@ -43,7 +43,9 @@ Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library
|
||||
|
||||
#### Selective injection (important)
|
||||
|
||||
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings.
|
||||
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. The shared `bpc_func.js` helper is injected first, then the matching userscript files.
|
||||
|
||||
Malformed asset fetches such as quoted Stripe or Google Pay script URLs usually mean escaped markup inside `srcdoc` or another HTML attribute is being parsed as top-level HTML. The inliner should only read attributes from real opening tags, and it sanitizes `srcdoc` iframe HTML recursively.
|
||||
|
||||
The matching logic is a simple glob parser for userscript `@match` patterns:
|
||||
- `*://*.com/*` matches any `.com` domain
|
||||
@@ -60,22 +62,22 @@ Userscripts are injected **after** `domcontentloaded` but **before** `networkidl
|
||||
|
||||
## Stealth / anti-detection
|
||||
|
||||
### The `playwright-extra` stealth packages are broken
|
||||
### Stealth package naming trap
|
||||
|
||||
The common recommendation is:
|
||||
The package names that look like Playwright-specific stealth plugins are placeholders:
|
||||
|
||||
```bash
|
||||
npm install playwright-extra playwright-extra-plugin-stealth
|
||||
```
|
||||
|
||||
**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
|
||||
`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, and `playwright-stealth` are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
|
||||
|
||||
```
|
||||
Error: Wrong package, please see this:
|
||||
https://github.com/berstend/puppeteer-extra/issues/454
|
||||
```
|
||||
|
||||
No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`.
|
||||
If we revisit package-based stealth, the working route to evaluate is `playwright-extra` with `puppeteer-extra-plugin-stealth`. This project currently uses manual evasions instead, keeping `package.json` limited to plain Playwright.
|
||||
|
||||
### Manual stealth evasions
|
||||
|
||||
@@ -111,10 +113,10 @@ The current init script is minimal and safe — it only overrides the getter via
|
||||
|
||||
`renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.
|
||||
|
||||
- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
|
||||
- **Viewport:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
|
||||
- **Locale:** `en-US`
|
||||
- **Timezone:** `America/New_York`
|
||||
- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version)
|
||||
- **User-Agent:** macOS Chrome 130. This is pinned for site compatibility and is not automatically synchronized with the installed Chromium version.
|
||||
|
||||
## Docker / Podman support
|
||||
|
||||
@@ -145,19 +147,22 @@ Port `5900` inside the container maps to `5901` on the host to avoid conflicts w
|
||||
Includes a `headful` profile that can be run with:
|
||||
|
||||
```bash
|
||||
docker compose --profile headful up archiver-headful
|
||||
URL=https://example.com docker compose --profile headful up archiver-headful
|
||||
```
|
||||
|
||||
Unlike `podman-run.sh`, Compose currently maps VNC to host port `5900`.
|
||||
|
||||
## Known limitations
|
||||
|
||||
### Bloomberg bot wall
|
||||
### Site-specific blocking
|
||||
|
||||
Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service.
|
||||
Some publishers can still return bot walls, consent walls, or region-specific variants depending on IP reputation and timing. Treat these as site/network-sensitive failures and reproduce with the exact URL, mode, and environment before assuming the browser stealth layer is the root cause. Bloomberg has recently archived successfully in local verification.
|
||||
|
||||
### Unsupported adblock syntax
|
||||
|
||||
- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored.
|
||||
- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works.
|
||||
- Advanced procedural cosmetic filters (`:upward()`, `:xpath()`, `:matches-css()`, etc.) are silently ignored.
|
||||
- Terminal `:remove()` cosmetic filters are downgraded to CSS hiding.
|
||||
- Scriptlet injection (`##+js(...)`) is not supported by the filter parser. The BPC userscripts still run separately when their metadata matches the page.
|
||||
- Preprocessor directives (`!#if`) are skipped.
|
||||
|
||||
## Adding a new site to privacy filters
|
||||
|
||||
Reference in New Issue
Block a user