fixes
This commit is contained in:
205
AGENTS.md
Normal file
205
AGENTS.md
Normal file
@@ -0,0 +1,205 @@
|
|||||||
|
# Agent notes for local-page-archiver
|
||||||
|
|
||||||
|
## Project overview
|
||||||
|
|
||||||
|
This tool renders web pages in Chromium (via Playwright) and saves them as fully self-contained HTML files. All external assets (images, fonts, stylesheets) are inlined as data URIs so the resulting file works offline.
|
||||||
|
|
||||||
|
The pipeline is:
|
||||||
|
|
||||||
|
```
|
||||||
|
URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML
|
||||||
|
```
|
||||||
|
|
||||||
|
## Source layout
|
||||||
|
|
||||||
|
- `src/cli.mjs` — CLI entrypoint. Supports `archive` and `help`. Accepts `--archive-path`, `--id`, and `--headful` flags.
|
||||||
|
- `src/archiver.mjs` — Core archiving logic. Loads privacy filters, steers the browser, injects adblockers/userscripts, and calls the inliner.
|
||||||
|
- `src/asset-inliner.mjs` — Fetches and inlines external resources (images, CSS, iframes). Also strips `<script>` and `<noscript>` tags for a static archive.
|
||||||
|
- `privacy-filters/` — Third-party filter lists and userscripts used to strip paywalls, trackers, and ad banners before the snapshot is taken.
|
||||||
|
|
||||||
|
## Privacy filters (`privacy-filters/`)
|
||||||
|
|
||||||
|
### `bpc-paywall-filter.txt`
|
||||||
|
|
||||||
|
An AdBlock Plus / uBlock Origin filter list. It contains three kinds of rules:
|
||||||
|
|
||||||
|
1. **Network rules** (`||tracker.com^`, `/regex/`) — block specific third-party paywall / tracking scripts.
|
||||||
|
2. **Exception rules** (`@@||example.com^`) — whitelist requests that a global block rule would otherwise hit.
|
||||||
|
3. **Cosmetic rules** (`example.com##.paywall`) — inject CSS to hide DOM elements (e.g. subscription banners, blurred overlays).
|
||||||
|
|
||||||
|
At module load time `archiver.mjs` parses this file into three arrays (`blockRules`, `allowRules`, `cosmeticRules`). Network rules are enforced at the Playwright level with `page.route(...)`. Cosmetic rules are injected as a `<style>` tag after the page reaches `domcontentloaded`.
|
||||||
|
|
||||||
|
#### Cosmetic rule caveats
|
||||||
|
|
||||||
|
Some advanced cosmetic syntax is unsupported and is silently discarded:
|
||||||
|
|
||||||
|
- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them.
|
||||||
|
- `:style(...)` — converted to real CSS `display: none`.
|
||||||
|
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing.
|
||||||
|
|
||||||
|
### `userscript/` directory
|
||||||
|
|
||||||
|
Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library `bpc_func.js`. They do heavy lifting: decrypt paywalls, reconstruct article text from JSON data embedded in the page, remove blur overlays, etc.
|
||||||
|
|
||||||
|
#### Selective injection (important)
|
||||||
|
|
||||||
|
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings.
|
||||||
|
|
||||||
|
The matching logic is a simple glob parser for userscript `@match` patterns:
|
||||||
|
- `*://*.com/*` matches any `.com` domain
|
||||||
|
- `*://example.com/path/*` matches that path prefix
|
||||||
|
- `@exclude` patterns take precedence and skip the script
|
||||||
|
|
||||||
|
#### `GM.xmlHttpRequest` mock
|
||||||
|
|
||||||
|
The userscripts rely on `GM.xmlHttpRequest` to fetch article text from archive mirrors or API endpoints. In a Playwright context this doesn't exist, so we inject a tiny mock that wraps the browser's native `fetch()` and presents the same callback interface (`onload`, `onerror`).
|
||||||
|
|
||||||
|
#### Timing
|
||||||
|
|
||||||
|
Userscripts are injected **after** `domcontentloaded` but **before** `networkidle`. We then wait an extra 2 s (`page.waitForTimeout`) so any `setTimeout(..., 1000)` callbacks inside the scripts have time to fire before we snapshot the DOM.
|
||||||
|
|
||||||
|
## Stealth / anti-detection
|
||||||
|
|
||||||
|
### The `playwright-extra` stealth packages are broken
|
||||||
|
|
||||||
|
The common recommendation is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npm install playwright-extra playwright-extra-plugin-stealth
|
||||||
|
```
|
||||||
|
|
||||||
|
**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
|
||||||
|
|
||||||
|
```
|
||||||
|
Error: Wrong package, please see this:
|
||||||
|
https://github.com/berstend/puppeteer-extra/issues/454
|
||||||
|
```
|
||||||
|
|
||||||
|
No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`.
|
||||||
|
|
||||||
|
### Manual stealth evasions
|
||||||
|
|
||||||
|
Instead we apply the same core evasions manually via `context.addInitScript()` and browser launch flags.
|
||||||
|
|
||||||
|
**Launch flags:**
|
||||||
|
- `--disable-blink-features=AutomationControlled`
|
||||||
|
- `--disable-infobars`
|
||||||
|
- `--disable-web-security`
|
||||||
|
- `--no-sandbox`, `--disable-setuid-sandbox`
|
||||||
|
- `--disable-dev-shm-usage`
|
||||||
|
- Removed `--enable-automation` via `ignoreDefaultArgs`
|
||||||
|
|
||||||
|
**Init script (injected into every page before any scripts run):**
|
||||||
|
```js
|
||||||
|
Object.defineProperty(navigator, 'webdriver',
|
||||||
|
{ get: () => undefined, configurable: true, enumerable: true });
|
||||||
|
window.chrome = window.chrome || { runtime: {} };
|
||||||
|
window.navigator.permissions.query = (/* patched for notifications */);
|
||||||
|
```
|
||||||
|
|
||||||
|
#### CRITICAL: Avoid `delete navigator.webdriver` + iframe trick
|
||||||
|
|
||||||
|
An earlier version used a more elaborate stealth snippet that did `delete navigator.webdriver` and then created an `<iframe>` to steal the real navigator descriptor. **This crashed the Chromium renderer process on tab creation** with:
|
||||||
|
|
||||||
|
```
|
||||||
|
Protocol error (Page.addScriptToEvaluateOnNewDocument): Target crashed
|
||||||
|
```
|
||||||
|
|
||||||
|
The current init script is minimal and safe — it only overrides the getter via `Object.defineProperty` and avoids DOM mutation during page init.
|
||||||
|
|
||||||
|
## Browser context & headful mode
|
||||||
|
|
||||||
|
`renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.
|
||||||
|
|
||||||
|
- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
|
||||||
|
- **Locale:** `en-US`
|
||||||
|
- **Timezone:** `America/New_York`
|
||||||
|
- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version)
|
||||||
|
|
||||||
|
## Docker / Podman support
|
||||||
|
|
||||||
|
### Dockerfile
|
||||||
|
|
||||||
|
- Base: `mcr.microsoft.com/playwright:v1.60.0` (must stay in sync with the `playwright` npm version)
|
||||||
|
- Installs Node 22 (the base image may ship an older Node)
|
||||||
|
- Runs `npx playwright install chromium` so the browser binary is baked into the image
|
||||||
|
|
||||||
|
### `podman-run.sh`
|
||||||
|
|
||||||
|
Helper for local runs. Two modes:
|
||||||
|
|
||||||
|
1. **`./podman-run.sh archive <URL>`** — headless, mounts `./archives`
|
||||||
|
2. **`./podman-run.sh headful-archive <URL>`** — headful with internal VNC
|
||||||
|
|
||||||
|
**Headful mode details:**
|
||||||
|
The container's `ENTRYPOINT` is `node src/cli.mjs`. To run a shell command inside the container (setting up Xvfb + x11vnc) we must override the entrypoint:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
podman run --rm --entrypoint sh <image> -c "...setup Xvfb... && node src/cli.mjs archive <URL>"
|
||||||
|
```
|
||||||
|
|
||||||
|
Port `5900` inside the container maps to `5901` on the host to avoid conflicts with macOS's built-in VNC.
|
||||||
|
|
||||||
|
### `docker-compose.yml`
|
||||||
|
|
||||||
|
Includes a `headful` profile that can be run with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose --profile headful up archiver-headful
|
||||||
|
```
|
||||||
|
|
||||||
|
## Known limitations
|
||||||
|
|
||||||
|
### Bloomberg bot wall
|
||||||
|
|
||||||
|
Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service.
|
||||||
|
|
||||||
|
### Unsupported adblock syntax
|
||||||
|
|
||||||
|
- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored.
|
||||||
|
- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works.
|
||||||
|
- Preprocessor directives (`!#if`) are skipped.
|
||||||
|
|
||||||
|
## Adding a new site to privacy filters
|
||||||
|
|
||||||
|
If you add a new filter rule or userscript:
|
||||||
|
|
||||||
|
1. **Filter rules:** Edit `privacy-filters/bpc-paywall-filter.txt`. `archiver.mjs` reloads the file on every process start, so no code changes are needed.
|
||||||
|
2. **Userscripts:** Drop a new `.user.js` into `privacy-filters/userscript/` and add its filename to the `userScriptFiles` array inside `loadPrivacyFilters()` in `archiver.mjs`.
|
||||||
|
3. Test with `node src/cli.mjs archive <URL>` and inspect the generated HTML.
|
||||||
|
|
||||||
|
## Rebuilding the Docker image
|
||||||
|
|
||||||
|
Playwright npm version image tags must match:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check what npm installed
|
||||||
|
node -e "console.log(require('playwright/package.json').version)"
|
||||||
|
|
||||||
|
# Update the FROM line in Dockerfile if needed
|
||||||
|
# Then rebuild
|
||||||
|
podman build -t local-page-archiver .
|
||||||
|
```
|
||||||
|
|
||||||
|
## Development quick reference
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install deps
|
||||||
|
npm install
|
||||||
|
|
||||||
|
# Install browser binaries
|
||||||
|
npm run install-browsers
|
||||||
|
|
||||||
|
# Archive a page (headless)
|
||||||
|
node src/cli.mjs archive https://example.com
|
||||||
|
|
||||||
|
# Archive a page (headful on macOS)
|
||||||
|
node src/cli.mjs archive https://example.com --headful
|
||||||
|
|
||||||
|
# Archive inside container (headless)
|
||||||
|
./podman-run.sh archive https://example.com
|
||||||
|
|
||||||
|
# Archive inside container (headful + VNC)
|
||||||
|
./podman-run.sh headful-archive https://example.com
|
||||||
|
# Then open vnc://localhost:5901
|
||||||
|
```
|
||||||
@@ -9,6 +9,7 @@
|
|||||||
},
|
},
|
||||||
"scripts": {
|
"scripts": {
|
||||||
"archive": "node src/cli.mjs archive",
|
"archive": "node src/cli.mjs archive",
|
||||||
|
"test": "node --test test/*.test.mjs",
|
||||||
"install-browsers": "playwright install chromium"
|
"install-browsers": "playwright install chromium"
|
||||||
},
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
|
|||||||
255
src/archiver.mjs
255
src/archiver.mjs
@@ -32,6 +32,7 @@ const PRIVACY_FILTERS_DIR = path.join(__dirname, "..", "privacy-filters");
|
|||||||
let privacyFiltersAvailable = false;
|
let privacyFiltersAvailable = false;
|
||||||
let filterRules = { blockRules: [], allowRules: [], cosmeticRules: [] };
|
let filterRules = { blockRules: [], allowRules: [], cosmeticRules: [] };
|
||||||
let userScriptData = []; // { file, content, matches, excludes }
|
let userScriptData = []; // { file, content, matches, excludes }
|
||||||
|
let userScriptRequireContent = "";
|
||||||
|
|
||||||
async function loadPrivacyFilters() {
|
async function loadPrivacyFilters() {
|
||||||
try {
|
try {
|
||||||
@@ -40,6 +41,7 @@ async function loadPrivacyFilters() {
|
|||||||
filterRules = parseFilterRules(filterContent);
|
filterRules = parseFilterRules(filterContent);
|
||||||
|
|
||||||
const userscriptDir = path.join(PRIVACY_FILTERS_DIR, "userscript");
|
const userscriptDir = path.join(PRIVACY_FILTERS_DIR, "userscript");
|
||||||
|
userScriptRequireContent = await fs.readFile(path.join(userscriptDir, "bpc_func.js"), "utf8");
|
||||||
const userScriptFiles = [
|
const userScriptFiles = [
|
||||||
"bpc.en.user.js",
|
"bpc.en.user.js",
|
||||||
"bpc.de.user.js",
|
"bpc.de.user.js",
|
||||||
@@ -126,7 +128,7 @@ function parseNetworkRule(line) {
|
|||||||
const lastDollar = line.lastIndexOf("$");
|
const lastDollar = line.lastIndexOf("$");
|
||||||
if (lastDollar > 0) {
|
if (lastDollar > 0) {
|
||||||
const optsStr = line.slice(lastDollar + 1);
|
const optsStr = line.slice(lastDollar + 1);
|
||||||
if (/^[a-z,=~\-|0-9]+$/i.test(optsStr)) {
|
if (/^[a-z,=~_.\-|0-9]+$/i.test(optsStr)) {
|
||||||
options = optsStr.split(",");
|
options = optsStr.split(",");
|
||||||
pattern = line.slice(0, lastDollar);
|
pattern = line.slice(0, lastDollar);
|
||||||
}
|
}
|
||||||
@@ -134,8 +136,20 @@ function parseNetworkRule(line) {
|
|||||||
|
|
||||||
if (!pattern) return null;
|
if (!pattern) return null;
|
||||||
|
|
||||||
const type = options.find((o) =>
|
const types = options.filter((o) =>
|
||||||
["script", "stylesheet", "image", "media", "xmlhttprequest", "other", "inline-script"].includes(o)
|
[
|
||||||
|
"document",
|
||||||
|
"font",
|
||||||
|
"image",
|
||||||
|
"inline-script",
|
||||||
|
"media",
|
||||||
|
"object",
|
||||||
|
"other",
|
||||||
|
"script",
|
||||||
|
"stylesheet",
|
||||||
|
"subdocument",
|
||||||
|
"xmlhttprequest"
|
||||||
|
].includes(o)
|
||||||
);
|
);
|
||||||
const isThirdParty = options.includes("third-party");
|
const isThirdParty = options.includes("third-party");
|
||||||
const isFirstParty = options.includes("~third-party");
|
const isFirstParty = options.includes("~third-party");
|
||||||
@@ -162,7 +176,7 @@ function parseNetworkRule(line) {
|
|||||||
kind: "domain",
|
kind: "domain",
|
||||||
domain,
|
domain,
|
||||||
path,
|
path,
|
||||||
type,
|
types,
|
||||||
isThirdParty,
|
isThirdParty,
|
||||||
isFirstParty,
|
isFirstParty,
|
||||||
includeDomains,
|
includeDomains,
|
||||||
@@ -171,27 +185,38 @@ function parseNetworkRule(line) {
|
|||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
if (pattern.startsWith("/")) {
|
if (pattern.startsWith("/") && pattern.endsWith("/") && pattern.length > 1) {
|
||||||
const lastSlash = pattern.lastIndexOf("/");
|
const regex = pattern.slice(1, -1);
|
||||||
if (lastSlash > 0) {
|
return {
|
||||||
const regex = pattern.slice(1, lastSlash);
|
kind: "regex",
|
||||||
return {
|
regex,
|
||||||
kind: "regex",
|
types,
|
||||||
regex,
|
isThirdParty,
|
||||||
type,
|
isFirstParty,
|
||||||
isThirdParty,
|
includeDomains,
|
||||||
isFirstParty,
|
excludeDomains,
|
||||||
includeDomains,
|
important
|
||||||
excludeDomains,
|
};
|
||||||
important
|
|
||||||
};
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return null;
|
return {
|
||||||
|
kind: "pattern",
|
||||||
|
regex: adblockPatternToRegex(pattern),
|
||||||
|
types,
|
||||||
|
isThirdParty,
|
||||||
|
isFirstParty,
|
||||||
|
includeDomains,
|
||||||
|
excludeDomains,
|
||||||
|
important
|
||||||
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
function cosmeticSelectorToCss(selector) {
|
function cosmeticSelectorToCss(selector) {
|
||||||
|
if (selector.endsWith(":remove()")) {
|
||||||
|
const baseSelector = selector.slice(0, -":remove()".length);
|
||||||
|
return baseSelector ? `${baseSelector} { display: none !important; }` : null;
|
||||||
|
}
|
||||||
|
|
||||||
const styleMatch = selector.match(/:style\((.+)\)$/);
|
const styleMatch = selector.match(/:style\((.+)\)$/);
|
||||||
if (styleMatch) {
|
if (styleMatch) {
|
||||||
const baseSelector = selector.slice(0, selector.lastIndexOf(":style("));
|
const baseSelector = selector.slice(0, selector.lastIndexOf(":style("));
|
||||||
@@ -246,17 +271,8 @@ function matchesNetworkRule(url, urlObj, hostname, resourceType, sourceHostname,
|
|||||||
if (blocked) return false;
|
if (blocked) return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (rule.type) {
|
if (rule.types.length > 0) {
|
||||||
const typeMap = {
|
if (!rule.types.some((type) => resourceTypeMatches(type, resourceType))) {
|
||||||
script: "script",
|
|
||||||
stylesheet: "stylesheet",
|
|
||||||
image: "image",
|
|
||||||
media: "media",
|
|
||||||
xmlhttprequest: "xhr",
|
|
||||||
other: "other",
|
|
||||||
"inline-script": "script"
|
|
||||||
};
|
|
||||||
if (typeMap[rule.type] && resourceType !== typeMap[rule.type]) {
|
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -271,18 +287,11 @@ function matchesNetworkRule(url, urlObj, hostname, resourceType, sourceHostname,
|
|||||||
}
|
}
|
||||||
|
|
||||||
if (rule.kind === "domain") {
|
if (rule.kind === "domain") {
|
||||||
const domainRe = new RegExp(
|
if (!domainPatternMatches(hostname, rule.domain)) return false;
|
||||||
"^" + rule.domain.replace(/\./g, "\\.").replace(/\*/g, "[^.]*") + "$",
|
|
||||||
"i"
|
|
||||||
);
|
|
||||||
if (!domainRe.test(hostname)) return false;
|
|
||||||
|
|
||||||
if (rule.path) {
|
if (rule.path) {
|
||||||
const pathRe = new RegExp(
|
const pathRe = new RegExp("^" + adblockPatternToRegex(rule.path), "i");
|
||||||
"^" + rule.path.replace(/\./g, "\\.").replace(/\*/g, ".*").replace(/\?/g, "\\?").replace(/\^/g, ""),
|
if (!pathRe.test(urlObj.pathname + urlObj.search)) return false;
|
||||||
"i"
|
|
||||||
);
|
|
||||||
if (!pathRe.test(urlObj.pathname)) return false;
|
|
||||||
}
|
}
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
@@ -296,9 +305,84 @@ function matchesNetworkRule(url, urlObj, hostname, resourceType, sourceHostname,
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (rule.kind === "pattern") {
|
||||||
|
try {
|
||||||
|
const re = new RegExp(rule.regex, "i");
|
||||||
|
return re.test(url);
|
||||||
|
} catch {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function resourceTypeMatches(filterType, resourceType) {
|
||||||
|
const typeMap = {
|
||||||
|
document: ["document"],
|
||||||
|
font: ["font"],
|
||||||
|
image: ["image"],
|
||||||
|
"inline-script": ["script"],
|
||||||
|
media: ["media"],
|
||||||
|
object: ["object"],
|
||||||
|
other: ["other"],
|
||||||
|
script: ["script"],
|
||||||
|
stylesheet: ["stylesheet"],
|
||||||
|
subdocument: ["document"],
|
||||||
|
xmlhttprequest: ["fetch", "xhr"]
|
||||||
|
};
|
||||||
|
const mapped = typeMap[filterType];
|
||||||
|
return mapped ? mapped.includes(resourceType) : false;
|
||||||
|
}
|
||||||
|
|
||||||
|
function domainPatternMatches(hostname, pattern) {
|
||||||
|
const normalized = pattern.replace(/\^$/, "").toLowerCase();
|
||||||
|
if (!normalized) return false;
|
||||||
|
|
||||||
|
if (!normalized.includes("*")) {
|
||||||
|
return hostname === normalized || hostname.endsWith("." + normalized);
|
||||||
|
}
|
||||||
|
|
||||||
|
const re = new RegExp(
|
||||||
|
"^" +
|
||||||
|
normalized
|
||||||
|
.split("*")
|
||||||
|
.map((part) => part.replace(/[|\\{}()[\]^$+?.]/g, "\\$&"))
|
||||||
|
.join("[^.]*") +
|
||||||
|
"$",
|
||||||
|
"i"
|
||||||
|
);
|
||||||
|
return re.test(hostname);
|
||||||
|
}
|
||||||
|
|
||||||
|
function adblockPatternToRegex(pattern) {
|
||||||
|
let source = "";
|
||||||
|
let remaining = pattern;
|
||||||
|
let anchoredStart = false;
|
||||||
|
let anchoredEnd = false;
|
||||||
|
|
||||||
|
if (remaining.startsWith("|")) {
|
||||||
|
anchoredStart = true;
|
||||||
|
remaining = remaining.slice(1);
|
||||||
|
}
|
||||||
|
if (remaining.endsWith("|")) {
|
||||||
|
anchoredEnd = true;
|
||||||
|
remaining = remaining.slice(0, -1);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const ch of remaining) {
|
||||||
|
if (ch === "*") {
|
||||||
|
source += ".*";
|
||||||
|
} else if (ch === "^") {
|
||||||
|
source += "(?:[^A-Za-z0-9_.%-]|$)";
|
||||||
|
} else {
|
||||||
|
source += ch.replace(/[|\\{}()[\]^$+?.]/g, "\\$&");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return `${anchoredStart ? "^" : ""}${source}${anchoredEnd ? "$" : ""}`;
|
||||||
|
}
|
||||||
|
|
||||||
function shouldBlockRequest(url, resourceType, sourceHostname) {
|
function shouldBlockRequest(url, resourceType, sourceHostname) {
|
||||||
if (url === sourceHostname || url.startsWith(sourceHostname + "/")) {
|
if (url === sourceHostname || url.startsWith(sourceHostname + "/")) {
|
||||||
return false;
|
return false;
|
||||||
@@ -512,7 +596,7 @@ async function setupRequestBlocking(page, sourceHostname) {
|
|||||||
await page.route("**/*", (route) => {
|
await page.route("**/*", (route) => {
|
||||||
try {
|
try {
|
||||||
const request = route.request();
|
const request = route.request();
|
||||||
if (request.isNavigationRequest()) {
|
if (request.isNavigationRequest() && request.frame() === page.mainFrame()) {
|
||||||
route.continue();
|
route.continue();
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
@@ -584,6 +668,9 @@ async function injectPrivacyUserScripts(page, sourceUrl) {
|
|||||||
// Inject GM API mock first.
|
// Inject GM API mock first.
|
||||||
try {
|
try {
|
||||||
await page.addScriptTag({ content: GM_MOCK });
|
await page.addScriptTag({ content: GM_MOCK });
|
||||||
|
if (userScriptRequireContent) {
|
||||||
|
await page.addScriptTag({ content: userScriptRequireContent });
|
||||||
|
}
|
||||||
} catch {
|
} catch {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
@@ -731,15 +818,19 @@ function addArchiveComment(html, sourceUrl) {
|
|||||||
|
|
||||||
export function findExternalAssetRefs(html) {
|
export function findExternalAssetRefs(html) {
|
||||||
const refs = new Set();
|
const refs = new Set();
|
||||||
const attrPattern = /\s(?:src|srcset|poster|data)\s*=\s*(["'])([\s\S]*?)\1/gi;
|
const assetTagPattern = /<(?:img|source|audio|video|track|embed|object|input|iframe)\b[^>]*>/gi;
|
||||||
for (const match of html.matchAll(attrPattern)) {
|
for (const match of html.matchAll(assetTagPattern)) {
|
||||||
if (isSelfContainedAssetRef(match[2])) {
|
const tag = match[0];
|
||||||
continue;
|
for (const attr of ["src", "srcset", "poster", "data"]) {
|
||||||
}
|
const value = readAttribute(tag, attr);
|
||||||
for (const part of match[2].split(",")) {
|
if (!value || isSelfContainedAssetRef(value)) {
|
||||||
const candidate = part.trim().split(/\s+/)[0];
|
continue;
|
||||||
if (candidate && !isSelfContainedAssetRef(candidate)) {
|
}
|
||||||
refs.add(candidate);
|
for (const part of value.split(",")) {
|
||||||
|
const candidate = part.trim().split(/\s+/)[0];
|
||||||
|
if (candidate && !isSelfContainedAssetRef(candidate)) {
|
||||||
|
refs.add(candidate);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -779,8 +870,8 @@ function isSelfContainedAssetRef(value) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
function readAttribute(tag, attr) {
|
function readAttribute(tag, attr) {
|
||||||
const match = tag.match(new RegExp(`\\b${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i"));
|
const match = findAttribute(tag, attr);
|
||||||
return match ? match[2] ?? match[3] ?? match[4] ?? "" : "";
|
return match ? match.value : "";
|
||||||
}
|
}
|
||||||
|
|
||||||
function cleanCssUrl(value) {
|
function cleanCssUrl(value) {
|
||||||
@@ -796,3 +887,61 @@ function cleanCssUrl(value) {
|
|||||||
}
|
}
|
||||||
return decoded;
|
return decoded;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function findAttribute(openingTag, attr) {
|
||||||
|
const attrLower = attr.toLowerCase();
|
||||||
|
const nameMatch = openingTag.match(/^<[^\s/>]+/);
|
||||||
|
let index = nameMatch ? nameMatch[0].length : 1;
|
||||||
|
|
||||||
|
while (index < openingTag.length) {
|
||||||
|
while (index < openingTag.length && /\s/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
if (index >= openingTag.length || openingTag[index] === ">" || openingTag[index] === "/") {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
const start = index;
|
||||||
|
while (index < openingTag.length && !/[\s=/>]/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
const name = openingTag.slice(start, index);
|
||||||
|
|
||||||
|
while (index < openingTag.length && /\s/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
let value = "";
|
||||||
|
if (openingTag[index] === "=") {
|
||||||
|
index += 1;
|
||||||
|
while (index < openingTag.length && /\s/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
const quote = openingTag[index];
|
||||||
|
if (quote === '"' || quote === "'") {
|
||||||
|
index += 1;
|
||||||
|
const valueStart = index;
|
||||||
|
while (index < openingTag.length && openingTag[index] !== quote) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
value = openingTag.slice(valueStart, index);
|
||||||
|
if (openingTag[index] === quote) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
const valueStart = index;
|
||||||
|
while (index < openingTag.length && !/[\s>]/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
value = openingTag.slice(valueStart, index);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (name.toLowerCase() === attrLower) {
|
||||||
|
return { start, end: index, value };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|||||||
@@ -197,11 +197,6 @@ export class AssetInliner {
|
|||||||
async (match) => this.rewriteMediaAttributes(match[0], effectiveBase)
|
async (match) => this.rewriteMediaAttributes(match[0], effectiveBase)
|
||||||
);
|
);
|
||||||
|
|
||||||
output = await replaceAsync(output, /srcset=(["'])([\s\S]*?)\1/gi, async (match) => {
|
|
||||||
const rewritten = await this.inlineSrcset(match[2], effectiveBase);
|
|
||||||
return `srcset=${match[1]}${htmlEscape(rewritten)}${match[1]}`;
|
|
||||||
});
|
|
||||||
|
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -259,12 +254,28 @@ export class AssetInliner {
|
|||||||
output = replaceMissingMediaAttribute(output, attr);
|
output = replaceMissingMediaAttribute(output, attr);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
const srcset = getAttribute(output, "srcset");
|
||||||
|
if (srcset) {
|
||||||
|
const rewritten = await this.inlineSrcset(srcset, baseUrl);
|
||||||
|
output = setAttribute(output, "srcset", rewritten);
|
||||||
|
}
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
async rewriteIframeTag(tag, baseUrl, depth) {
|
async rewriteIframeTag(tag, baseUrl, depth) {
|
||||||
|
const srcdoc = getAttribute(tag, "srcdoc");
|
||||||
|
if (srcdoc) {
|
||||||
|
let rewritten = removeAttribute(tag, "src");
|
||||||
|
if (depth >= 2) {
|
||||||
|
return rewritten;
|
||||||
|
}
|
||||||
|
const inlined = await this.inlineHtml(srcdoc, baseUrl, { depth: depth + 1 });
|
||||||
|
rewritten = setAttribute(rewritten, "srcdoc", inlined);
|
||||||
|
return rewritten;
|
||||||
|
}
|
||||||
|
|
||||||
const src = getAttribute(tag, "src");
|
const src = getAttribute(tag, "src");
|
||||||
if (!src || getAttribute(tag, "srcdoc")) {
|
if (!src) {
|
||||||
return this.rewriteMediaAttributes(tag, baseUrl);
|
return this.rewriteMediaAttributes(tag, baseUrl);
|
||||||
}
|
}
|
||||||
const absolute = resolveUrl(src, baseUrl);
|
const absolute = resolveUrl(src, baseUrl);
|
||||||
@@ -425,24 +436,42 @@ function mimeFromUrl(rawUrl) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
function getAttribute(tag, attr) {
|
function getAttribute(tag, attr) {
|
||||||
const match = tag.match(new RegExp(`\\b${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i"));
|
const openingTag = getOpeningTag(tag);
|
||||||
if (!match) {
|
if (!openingTag) {
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
return htmlDecode(match[2] ?? match[3] ?? match[4] ?? "");
|
const match = findAttribute(openingTag, attr);
|
||||||
|
return match ? htmlDecode(match.value) : null;
|
||||||
}
|
}
|
||||||
|
|
||||||
function setAttribute(tag, attr, value) {
|
function setAttribute(tag, attr, value) {
|
||||||
const escaped = htmlEscape(value);
|
const escaped = htmlEscape(value);
|
||||||
const attrRegex = new RegExp(`\\b${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i");
|
return replaceOpeningTag(tag, (openingTag) => {
|
||||||
if (attrRegex.test(tag)) {
|
const match = findAttribute(openingTag, attr);
|
||||||
return tag.replace(attrRegex, `${attr}="${escaped}"`);
|
if (match) {
|
||||||
}
|
return `${openingTag.slice(0, match.start)}${attr}="${escaped}"${openingTag.slice(match.end)}`;
|
||||||
return tag.replace(/^<[^>]*>/, (openingTag) => openingTag.replace(/\/?>$/, (end) => ` ${attr}="${escaped}"${end}`));
|
}
|
||||||
|
|
||||||
|
const selfClosing = /\/\s*>$/.test(openingTag);
|
||||||
|
const closeIndex = openingTag.lastIndexOf(">");
|
||||||
|
const beforeClose = openingTag.slice(0, closeIndex).replace(/\s*\/\s*$/, "");
|
||||||
|
return `${beforeClose} ${attr}="${escaped}"${selfClosing ? " /" : ""}>`;
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
function removeAttribute(tag, attr) {
|
function removeAttribute(tag, attr) {
|
||||||
return tag.replace(new RegExp(`\\s+${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i"), "");
|
return replaceOpeningTag(tag, (openingTag) => {
|
||||||
|
const match = findAttribute(openingTag, attr);
|
||||||
|
if (!match) {
|
||||||
|
return openingTag;
|
||||||
|
}
|
||||||
|
|
||||||
|
let start = match.start;
|
||||||
|
while (start > 0 && /\s/.test(openingTag[start - 1])) {
|
||||||
|
start -= 1;
|
||||||
|
}
|
||||||
|
return `${openingTag.slice(0, start)}${openingTag.slice(match.end)}`;
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
function replaceMissingMediaAttribute(tag, attr) {
|
function replaceMissingMediaAttribute(tag, attr) {
|
||||||
@@ -474,3 +503,95 @@ function cleanCssUrl(value) {
|
|||||||
}
|
}
|
||||||
return decoded;
|
return decoded;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function getOpeningTag(markup) {
|
||||||
|
const end = openingTagEndIndex(markup);
|
||||||
|
return end >= 0 ? markup.slice(0, end + 1) : null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function replaceOpeningTag(markup, replacer) {
|
||||||
|
const end = openingTagEndIndex(markup);
|
||||||
|
if (end < 0) {
|
||||||
|
return markup;
|
||||||
|
}
|
||||||
|
return `${replacer(markup.slice(0, end + 1))}${markup.slice(end + 1)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
function openingTagEndIndex(markup) {
|
||||||
|
let quote = "";
|
||||||
|
for (let i = 0; i < markup.length; i += 1) {
|
||||||
|
const ch = markup[i];
|
||||||
|
if (quote) {
|
||||||
|
if (ch === quote) {
|
||||||
|
quote = "";
|
||||||
|
}
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (ch === '"' || ch === "'") {
|
||||||
|
quote = ch;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (ch === ">") {
|
||||||
|
return i;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return -1;
|
||||||
|
}
|
||||||
|
|
||||||
|
function findAttribute(openingTag, attr) {
|
||||||
|
const attrLower = attr.toLowerCase();
|
||||||
|
const nameMatch = openingTag.match(/^<[^\s/>]+/);
|
||||||
|
let index = nameMatch ? nameMatch[0].length : 1;
|
||||||
|
|
||||||
|
while (index < openingTag.length) {
|
||||||
|
while (index < openingTag.length && /\s/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
if (index >= openingTag.length || openingTag[index] === ">" || openingTag[index] === "/") {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
const start = index;
|
||||||
|
while (index < openingTag.length && !/[\s=/>]/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
const name = openingTag.slice(start, index);
|
||||||
|
|
||||||
|
while (index < openingTag.length && /\s/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
let value = "";
|
||||||
|
if (openingTag[index] === "=") {
|
||||||
|
index += 1;
|
||||||
|
while (index < openingTag.length && /\s/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
const quote = openingTag[index];
|
||||||
|
if (quote === '"' || quote === "'") {
|
||||||
|
index += 1;
|
||||||
|
const valueStart = index;
|
||||||
|
while (index < openingTag.length && openingTag[index] !== quote) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
value = openingTag.slice(valueStart, index);
|
||||||
|
if (openingTag[index] === quote) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
const valueStart = index;
|
||||||
|
while (index < openingTag.length && !/[\s>]/.test(openingTag[index])) {
|
||||||
|
index += 1;
|
||||||
|
}
|
||||||
|
value = openingTag.slice(valueStart, index);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (name.toLowerCase() === attrLower) {
|
||||||
|
return { start, end: index, value };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user