This commit is contained in:
2026-05-15 09:25:19 -07:00
parent 2e350ce3dc
commit bc6d9893a1
4 changed files with 544 additions and 68 deletions

205
AGENTS.md Normal file
View File

@@ -0,0 +1,205 @@
# Agent notes for local-page-archiver
## Project overview
This tool renders web pages in Chromium (via Playwright) and saves them as fully self-contained HTML files. All external assets (images, fonts, stylesheets) are inlined as data URIs so the resulting file works offline.
The pipeline is:
```
URL ──► Playwright render ──► inject privacy filters ──► inline assets ──► write HTML
```
## Source layout
- `src/cli.mjs` — CLI entrypoint. Supports `archive` and `help`. Accepts `--archive-path`, `--id`, and `--headful` flags.
- `src/archiver.mjs` — Core archiving logic. Loads privacy filters, steers the browser, injects adblockers/userscripts, and calls the inliner.
- `src/asset-inliner.mjs` — Fetches and inlines external resources (images, CSS, iframes). Also strips `<script>` and `<noscript>` tags for a static archive.
- `privacy-filters/` — Third-party filter lists and userscripts used to strip paywalls, trackers, and ad banners before the snapshot is taken.
## Privacy filters (`privacy-filters/`)
### `bpc-paywall-filter.txt`
An AdBlock Plus / uBlock Origin filter list. It contains three kinds of rules:
1. **Network rules** (`||tracker.com^`, `/regex/`) — block specific third-party paywall / tracking scripts.
2. **Exception rules** (`@@||example.com^`) — whitelist requests that a global block rule would otherwise hit.
3. **Cosmetic rules** (`example.com##.paywall`) — inject CSS to hide DOM elements (e.g. subscription banners, blurred overlays).
At module load time `archiver.mjs` parses this file into three arrays (`blockRules`, `allowRules`, `cosmeticRules`). Network rules are enforced at the Playwright level with `page.route(...)`. Cosmetic rules are injected as a `<style>` tag after the page reaches `domcontentloaded`.
#### Cosmetic rule caveats
Some advanced cosmetic syntax is unsupported and is silently discarded:
- `:remove()` — we can't actually remove DOM nodes from the CSS layer; only hide them.
- `:style(...)` — converted to real CSS `display: none`.
- `:xpath(...)`, `:upward(...)`, `:matches-css(...)` — discarded during parsing.
### `userscript/` directory
Contains Greasemonkey-style userscripts (`bpc.*.user.js`) plus a shared library `bpc_func.js`. They do heavy lifting: decrypt paywalls, reconstruct article text from JSON data embedded in the page, remove blur overlays, etc.
#### Selective injection (important)
Each userscript declares `@match` and `@exclude` metadata. **Only matching scripts are injected.** For example, on `bloomberg.com` only `bpc.en.user.js` is injected. Injecting all scripts into every page caused literal JavaScript source expressions to leak into DOM attributes, which the asset inliner then tried to fetch as URLs, producing garbage `HTTP 403` warnings.
The matching logic is a simple glob parser for userscript `@match` patterns:
- `*://*.com/*` matches any `.com` domain
- `*://example.com/path/*` matches that path prefix
- `@exclude` patterns take precedence and skip the script
#### `GM.xmlHttpRequest` mock
The userscripts rely on `GM.xmlHttpRequest` to fetch article text from archive mirrors or API endpoints. In a Playwright context this doesn't exist, so we inject a tiny mock that wraps the browser's native `fetch()` and presents the same callback interface (`onload`, `onerror`).
#### Timing
Userscripts are injected **after** `domcontentloaded` but **before** `networkidle`. We then wait an extra 2 s (`page.waitForTimeout`) so any `setTimeout(..., 1000)` callbacks inside the scripts have time to fire before we snapshot the DOM.
## Stealth / anti-detection
### The `playwright-extra` stealth packages are broken
The common recommendation is:
```bash
npm install playwright-extra playwright-extra-plugin-stealth
```
**This does not work.** All three packages on npm (`playwright-extra-plugin-stealth`, `playwright-extra-stealth`, `playwright-stealth`) are **placeholder packages** (version `0.0.1`) that literally throw on `require()`:
```
Error: Wrong package, please see this:
https://github.com/berstend/puppeteer-extra/issues/454
```
No functional Playwright stealth plugin exists in those package names. We therefore removed them from `package.json`.
### Manual stealth evasions
Instead we apply the same core evasions manually via `context.addInitScript()` and browser launch flags.
**Launch flags:**
- `--disable-blink-features=AutomationControlled`
- `--disable-infobars`
- `--disable-web-security`
- `--no-sandbox`, `--disable-setuid-sandbox`
- `--disable-dev-shm-usage`
- Removed `--enable-automation` via `ignoreDefaultArgs`
**Init script (injected into every page before any scripts run):**
```js
Object.defineProperty(navigator, 'webdriver',
{ get: () => undefined, configurable: true, enumerable: true });
window.chrome = window.chrome || { runtime: {} };
window.navigator.permissions.query = (/* patched for notifications */);
```
#### CRITICAL: Avoid `delete navigator.webdriver` + iframe trick
An earlier version used a more elaborate stealth snippet that did `delete navigator.webdriver` and then created an `<iframe>` to steal the real navigator descriptor. **This crashed the Chromium renderer process on tab creation** with:
```
Protocol error (Page.addScriptToEvaluateOnNewDocument): Target crashed
```
The current init script is minimal and safe — it only overrides the getter via `Object.defineProperty` and avoids DOM mutation during page init.
## Browser context & headful mode
`renderPage()` auto-detects whether a display is available (`$DISPLAY` / `$WAYLAND_DISPLAY`). If neither is set it defaults to headless. The caller can override via `options.headless`.
- **ViewPort:** 1366×768 (standard laptop resolution, not the default Playwright 1280×720)
- **Locale:** `en-US`
- **Timezone:** `America/New_York`
- **User-Agent:** macOS Chrome 130 (matches the Playwright image's Chromium version)
## Docker / Podman support
### Dockerfile
- Base: `mcr.microsoft.com/playwright:v1.60.0` (must stay in sync with the `playwright` npm version)
- Installs Node 22 (the base image may ship an older Node)
- Runs `npx playwright install chromium` so the browser binary is baked into the image
### `podman-run.sh`
Helper for local runs. Two modes:
1. **`./podman-run.sh archive <URL>`** — headless, mounts `./archives`
2. **`./podman-run.sh headful-archive <URL>`** — headful with internal VNC
**Headful mode details:**
The container's `ENTRYPOINT` is `node src/cli.mjs`. To run a shell command inside the container (setting up Xvfb + x11vnc) we must override the entrypoint:
```bash
podman run --rm --entrypoint sh <image> -c "...setup Xvfb... && node src/cli.mjs archive <URL>"
```
Port `5900` inside the container maps to `5901` on the host to avoid conflicts with macOS's built-in VNC.
### `docker-compose.yml`
Includes a `headful` profile that can be run with:
```bash
docker compose --profile headful up archiver-headful
```
## Known limitations
### Bloomberg bot wall
Bloomberg detects our requests as automated. Both headless and headful mode return **"Are you a robot?"** from this IP. We verified the same page text is returned by `curl` with identical headers, confirming the block is network-level (IP / TLS fingerprint / rate-limit reputation), not browser-fingerprint-level. To archive Bloomberg you currently need a residential proxy or to use an archive mirror service.
### Unsupported adblock syntax
- Advanced procedural cosmetic filters (`:remove()`, `:upward()`, `:xpath()`) are silently ignored.
- Scriptlet injection (`##+js(...)`) is not supported — only cosmetic CSS injection works.
- Preprocessor directives (`!#if`) are skipped.
## Adding a new site to privacy filters
If you add a new filter rule or userscript:
1. **Filter rules:** Edit `privacy-filters/bpc-paywall-filter.txt`. `archiver.mjs` reloads the file on every process start, so no code changes are needed.
2. **Userscripts:** Drop a new `.user.js` into `privacy-filters/userscript/` and add its filename to the `userScriptFiles` array inside `loadPrivacyFilters()` in `archiver.mjs`.
3. Test with `node src/cli.mjs archive <URL>` and inspect the generated HTML.
## Rebuilding the Docker image
Playwright npm version image tags must match:
```bash
# Check what npm installed
node -e "console.log(require('playwright/package.json').version)"
# Update the FROM line in Dockerfile if needed
# Then rebuild
podman build -t local-page-archiver .
```
## Development quick reference
```bash
# Install deps
npm install
# Install browser binaries
npm run install-browsers
# Archive a page (headless)
node src/cli.mjs archive https://example.com
# Archive a page (headful on macOS)
node src/cli.mjs archive https://example.com --headful
# Archive inside container (headless)
./podman-run.sh archive https://example.com
# Archive inside container (headful + VNC)
./podman-run.sh headful-archive https://example.com
# Then open vnc://localhost:5901
```

View File

@@ -9,6 +9,7 @@
}, },
"scripts": { "scripts": {
"archive": "node src/cli.mjs archive", "archive": "node src/cli.mjs archive",
"test": "node --test test/*.test.mjs",
"install-browsers": "playwright install chromium" "install-browsers": "playwright install chromium"
}, },
"dependencies": { "dependencies": {

View File

@@ -32,6 +32,7 @@ const PRIVACY_FILTERS_DIR = path.join(__dirname, "..", "privacy-filters");
let privacyFiltersAvailable = false; let privacyFiltersAvailable = false;
let filterRules = { blockRules: [], allowRules: [], cosmeticRules: [] }; let filterRules = { blockRules: [], allowRules: [], cosmeticRules: [] };
let userScriptData = []; // { file, content, matches, excludes } let userScriptData = []; // { file, content, matches, excludes }
let userScriptRequireContent = "";
async function loadPrivacyFilters() { async function loadPrivacyFilters() {
try { try {
@@ -40,6 +41,7 @@ async function loadPrivacyFilters() {
filterRules = parseFilterRules(filterContent); filterRules = parseFilterRules(filterContent);
const userscriptDir = path.join(PRIVACY_FILTERS_DIR, "userscript"); const userscriptDir = path.join(PRIVACY_FILTERS_DIR, "userscript");
userScriptRequireContent = await fs.readFile(path.join(userscriptDir, "bpc_func.js"), "utf8");
const userScriptFiles = [ const userScriptFiles = [
"bpc.en.user.js", "bpc.en.user.js",
"bpc.de.user.js", "bpc.de.user.js",
@@ -126,7 +128,7 @@ function parseNetworkRule(line) {
const lastDollar = line.lastIndexOf("$"); const lastDollar = line.lastIndexOf("$");
if (lastDollar > 0) { if (lastDollar > 0) {
const optsStr = line.slice(lastDollar + 1); const optsStr = line.slice(lastDollar + 1);
if (/^[a-z,=~\-|0-9]+$/i.test(optsStr)) { if (/^[a-z,=~_.\-|0-9]+$/i.test(optsStr)) {
options = optsStr.split(","); options = optsStr.split(",");
pattern = line.slice(0, lastDollar); pattern = line.slice(0, lastDollar);
} }
@@ -134,8 +136,20 @@ function parseNetworkRule(line) {
if (!pattern) return null; if (!pattern) return null;
const type = options.find((o) => const types = options.filter((o) =>
["script", "stylesheet", "image", "media", "xmlhttprequest", "other", "inline-script"].includes(o) [
"document",
"font",
"image",
"inline-script",
"media",
"object",
"other",
"script",
"stylesheet",
"subdocument",
"xmlhttprequest"
].includes(o)
); );
const isThirdParty = options.includes("third-party"); const isThirdParty = options.includes("third-party");
const isFirstParty = options.includes("~third-party"); const isFirstParty = options.includes("~third-party");
@@ -162,7 +176,7 @@ function parseNetworkRule(line) {
kind: "domain", kind: "domain",
domain, domain,
path, path,
type, types,
isThirdParty, isThirdParty,
isFirstParty, isFirstParty,
includeDomains, includeDomains,
@@ -171,14 +185,12 @@ function parseNetworkRule(line) {
}; };
} }
if (pattern.startsWith("/")) { if (pattern.startsWith("/") && pattern.endsWith("/") && pattern.length > 1) {
const lastSlash = pattern.lastIndexOf("/"); const regex = pattern.slice(1, -1);
if (lastSlash > 0) {
const regex = pattern.slice(1, lastSlash);
return { return {
kind: "regex", kind: "regex",
regex, regex,
type, types,
isThirdParty, isThirdParty,
isFirstParty, isFirstParty,
includeDomains, includeDomains,
@@ -186,12 +198,25 @@ function parseNetworkRule(line) {
important important
}; };
} }
}
return null; return {
kind: "pattern",
regex: adblockPatternToRegex(pattern),
types,
isThirdParty,
isFirstParty,
includeDomains,
excludeDomains,
important
};
} }
function cosmeticSelectorToCss(selector) { function cosmeticSelectorToCss(selector) {
if (selector.endsWith(":remove()")) {
const baseSelector = selector.slice(0, -":remove()".length);
return baseSelector ? `${baseSelector} { display: none !important; }` : null;
}
const styleMatch = selector.match(/:style\((.+)\)$/); const styleMatch = selector.match(/:style\((.+)\)$/);
if (styleMatch) { if (styleMatch) {
const baseSelector = selector.slice(0, selector.lastIndexOf(":style(")); const baseSelector = selector.slice(0, selector.lastIndexOf(":style("));
@@ -246,17 +271,8 @@ function matchesNetworkRule(url, urlObj, hostname, resourceType, sourceHostname,
if (blocked) return false; if (blocked) return false;
} }
if (rule.type) { if (rule.types.length > 0) {
const typeMap = { if (!rule.types.some((type) => resourceTypeMatches(type, resourceType))) {
script: "script",
stylesheet: "stylesheet",
image: "image",
media: "media",
xmlhttprequest: "xhr",
other: "other",
"inline-script": "script"
};
if (typeMap[rule.type] && resourceType !== typeMap[rule.type]) {
return false; return false;
} }
} }
@@ -271,18 +287,11 @@ function matchesNetworkRule(url, urlObj, hostname, resourceType, sourceHostname,
} }
if (rule.kind === "domain") { if (rule.kind === "domain") {
const domainRe = new RegExp( if (!domainPatternMatches(hostname, rule.domain)) return false;
"^" + rule.domain.replace(/\./g, "\\.").replace(/\*/g, "[^.]*") + "$",
"i"
);
if (!domainRe.test(hostname)) return false;
if (rule.path) { if (rule.path) {
const pathRe = new RegExp( const pathRe = new RegExp("^" + adblockPatternToRegex(rule.path), "i");
"^" + rule.path.replace(/\./g, "\\.").replace(/\*/g, ".*").replace(/\?/g, "\\?").replace(/\^/g, ""), if (!pathRe.test(urlObj.pathname + urlObj.search)) return false;
"i"
);
if (!pathRe.test(urlObj.pathname)) return false;
} }
return true; return true;
} }
@@ -296,8 +305,83 @@ function matchesNetworkRule(url, urlObj, hostname, resourceType, sourceHostname,
} }
} }
if (rule.kind === "pattern") {
try {
const re = new RegExp(rule.regex, "i");
return re.test(url);
} catch {
return false; return false;
} }
}
return false;
}
function resourceTypeMatches(filterType, resourceType) {
const typeMap = {
document: ["document"],
font: ["font"],
image: ["image"],
"inline-script": ["script"],
media: ["media"],
object: ["object"],
other: ["other"],
script: ["script"],
stylesheet: ["stylesheet"],
subdocument: ["document"],
xmlhttprequest: ["fetch", "xhr"]
};
const mapped = typeMap[filterType];
return mapped ? mapped.includes(resourceType) : false;
}
function domainPatternMatches(hostname, pattern) {
const normalized = pattern.replace(/\^$/, "").toLowerCase();
if (!normalized) return false;
if (!normalized.includes("*")) {
return hostname === normalized || hostname.endsWith("." + normalized);
}
const re = new RegExp(
"^" +
normalized
.split("*")
.map((part) => part.replace(/[|\\{}()[\]^$+?.]/g, "\\$&"))
.join("[^.]*") +
"$",
"i"
);
return re.test(hostname);
}
function adblockPatternToRegex(pattern) {
let source = "";
let remaining = pattern;
let anchoredStart = false;
let anchoredEnd = false;
if (remaining.startsWith("|")) {
anchoredStart = true;
remaining = remaining.slice(1);
}
if (remaining.endsWith("|")) {
anchoredEnd = true;
remaining = remaining.slice(0, -1);
}
for (const ch of remaining) {
if (ch === "*") {
source += ".*";
} else if (ch === "^") {
source += "(?:[^A-Za-z0-9_.%-]|$)";
} else {
source += ch.replace(/[|\\{}()[\]^$+?.]/g, "\\$&");
}
}
return `${anchoredStart ? "^" : ""}${source}${anchoredEnd ? "$" : ""}`;
}
function shouldBlockRequest(url, resourceType, sourceHostname) { function shouldBlockRequest(url, resourceType, sourceHostname) {
if (url === sourceHostname || url.startsWith(sourceHostname + "/")) { if (url === sourceHostname || url.startsWith(sourceHostname + "/")) {
@@ -512,7 +596,7 @@ async function setupRequestBlocking(page, sourceHostname) {
await page.route("**/*", (route) => { await page.route("**/*", (route) => {
try { try {
const request = route.request(); const request = route.request();
if (request.isNavigationRequest()) { if (request.isNavigationRequest() && request.frame() === page.mainFrame()) {
route.continue(); route.continue();
return; return;
} }
@@ -584,6 +668,9 @@ async function injectPrivacyUserScripts(page, sourceUrl) {
// Inject GM API mock first. // Inject GM API mock first.
try { try {
await page.addScriptTag({ content: GM_MOCK }); await page.addScriptTag({ content: GM_MOCK });
if (userScriptRequireContent) {
await page.addScriptTag({ content: userScriptRequireContent });
}
} catch { } catch {
return; return;
} }
@@ -731,18 +818,22 @@ function addArchiveComment(html, sourceUrl) {
export function findExternalAssetRefs(html) { export function findExternalAssetRefs(html) {
const refs = new Set(); const refs = new Set();
const attrPattern = /\s(?:src|srcset|poster|data)\s*=\s*(["'])([\s\S]*?)\1/gi; const assetTagPattern = /<(?:img|source|audio|video|track|embed|object|input|iframe)\b[^>]*>/gi;
for (const match of html.matchAll(attrPattern)) { for (const match of html.matchAll(assetTagPattern)) {
if (isSelfContainedAssetRef(match[2])) { const tag = match[0];
for (const attr of ["src", "srcset", "poster", "data"]) {
const value = readAttribute(tag, attr);
if (!value || isSelfContainedAssetRef(value)) {
continue; continue;
} }
for (const part of match[2].split(",")) { for (const part of value.split(",")) {
const candidate = part.trim().split(/\s+/)[0]; const candidate = part.trim().split(/\s+/)[0];
if (candidate && !isSelfContainedAssetRef(candidate)) { if (candidate && !isSelfContainedAssetRef(candidate)) {
refs.add(candidate); refs.add(candidate);
} }
} }
} }
}
const linkPattern = /<link\b[^>]*>/gi; const linkPattern = /<link\b[^>]*>/gi;
for (const match of html.matchAll(linkPattern)) { for (const match of html.matchAll(linkPattern)) {
@@ -779,8 +870,8 @@ function isSelfContainedAssetRef(value) {
} }
function readAttribute(tag, attr) { function readAttribute(tag, attr) {
const match = tag.match(new RegExp(`\\b${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i")); const match = findAttribute(tag, attr);
return match ? match[2] ?? match[3] ?? match[4] ?? "" : ""; return match ? match.value : "";
} }
function cleanCssUrl(value) { function cleanCssUrl(value) {
@@ -796,3 +887,61 @@ function cleanCssUrl(value) {
} }
return decoded; return decoded;
} }
function findAttribute(openingTag, attr) {
const attrLower = attr.toLowerCase();
const nameMatch = openingTag.match(/^<[^\s/>]+/);
let index = nameMatch ? nameMatch[0].length : 1;
while (index < openingTag.length) {
while (index < openingTag.length && /\s/.test(openingTag[index])) {
index += 1;
}
if (index >= openingTag.length || openingTag[index] === ">" || openingTag[index] === "/") {
return null;
}
const start = index;
while (index < openingTag.length && !/[\s=/>]/.test(openingTag[index])) {
index += 1;
}
const name = openingTag.slice(start, index);
while (index < openingTag.length && /\s/.test(openingTag[index])) {
index += 1;
}
let value = "";
if (openingTag[index] === "=") {
index += 1;
while (index < openingTag.length && /\s/.test(openingTag[index])) {
index += 1;
}
const quote = openingTag[index];
if (quote === '"' || quote === "'") {
index += 1;
const valueStart = index;
while (index < openingTag.length && openingTag[index] !== quote) {
index += 1;
}
value = openingTag.slice(valueStart, index);
if (openingTag[index] === quote) {
index += 1;
}
} else {
const valueStart = index;
while (index < openingTag.length && !/[\s>]/.test(openingTag[index])) {
index += 1;
}
value = openingTag.slice(valueStart, index);
}
}
if (name.toLowerCase() === attrLower) {
return { start, end: index, value };
}
}
return null;
}

View File

@@ -197,11 +197,6 @@ export class AssetInliner {
async (match) => this.rewriteMediaAttributes(match[0], effectiveBase) async (match) => this.rewriteMediaAttributes(match[0], effectiveBase)
); );
output = await replaceAsync(output, /srcset=(["'])([\s\S]*?)\1/gi, async (match) => {
const rewritten = await this.inlineSrcset(match[2], effectiveBase);
return `srcset=${match[1]}${htmlEscape(rewritten)}${match[1]}`;
});
return output; return output;
} }
@@ -259,12 +254,28 @@ export class AssetInliner {
output = replaceMissingMediaAttribute(output, attr); output = replaceMissingMediaAttribute(output, attr);
} }
} }
const srcset = getAttribute(output, "srcset");
if (srcset) {
const rewritten = await this.inlineSrcset(srcset, baseUrl);
output = setAttribute(output, "srcset", rewritten);
}
return output; return output;
} }
async rewriteIframeTag(tag, baseUrl, depth) { async rewriteIframeTag(tag, baseUrl, depth) {
const srcdoc = getAttribute(tag, "srcdoc");
if (srcdoc) {
let rewritten = removeAttribute(tag, "src");
if (depth >= 2) {
return rewritten;
}
const inlined = await this.inlineHtml(srcdoc, baseUrl, { depth: depth + 1 });
rewritten = setAttribute(rewritten, "srcdoc", inlined);
return rewritten;
}
const src = getAttribute(tag, "src"); const src = getAttribute(tag, "src");
if (!src || getAttribute(tag, "srcdoc")) { if (!src) {
return this.rewriteMediaAttributes(tag, baseUrl); return this.rewriteMediaAttributes(tag, baseUrl);
} }
const absolute = resolveUrl(src, baseUrl); const absolute = resolveUrl(src, baseUrl);
@@ -425,24 +436,42 @@ function mimeFromUrl(rawUrl) {
} }
function getAttribute(tag, attr) { function getAttribute(tag, attr) {
const match = tag.match(new RegExp(`\\b${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i")); const openingTag = getOpeningTag(tag);
if (!match) { if (!openingTag) {
return null; return null;
} }
return htmlDecode(match[2] ?? match[3] ?? match[4] ?? ""); const match = findAttribute(openingTag, attr);
return match ? htmlDecode(match.value) : null;
} }
function setAttribute(tag, attr, value) { function setAttribute(tag, attr, value) {
const escaped = htmlEscape(value); const escaped = htmlEscape(value);
const attrRegex = new RegExp(`\\b${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i"); return replaceOpeningTag(tag, (openingTag) => {
if (attrRegex.test(tag)) { const match = findAttribute(openingTag, attr);
return tag.replace(attrRegex, `${attr}="${escaped}"`); if (match) {
return `${openingTag.slice(0, match.start)}${attr}="${escaped}"${openingTag.slice(match.end)}`;
} }
return tag.replace(/^<[^>]*>/, (openingTag) => openingTag.replace(/\/?>$/, (end) => ` ${attr}="${escaped}"${end}`));
const selfClosing = /\/\s*>$/.test(openingTag);
const closeIndex = openingTag.lastIndexOf(">");
const beforeClose = openingTag.slice(0, closeIndex).replace(/\s*\/\s*$/, "");
return `${beforeClose} ${attr}="${escaped}"${selfClosing ? " /" : ""}>`;
});
} }
function removeAttribute(tag, attr) { function removeAttribute(tag, attr) {
return tag.replace(new RegExp(`\\s+${attr}\\s*=\\s*("([^"]*)"|'([^']*)'|([^\\s>]+))`, "i"), ""); return replaceOpeningTag(tag, (openingTag) => {
const match = findAttribute(openingTag, attr);
if (!match) {
return openingTag;
}
let start = match.start;
while (start > 0 && /\s/.test(openingTag[start - 1])) {
start -= 1;
}
return `${openingTag.slice(0, start)}${openingTag.slice(match.end)}`;
});
} }
function replaceMissingMediaAttribute(tag, attr) { function replaceMissingMediaAttribute(tag, attr) {
@@ -474,3 +503,95 @@ function cleanCssUrl(value) {
} }
return decoded; return decoded;
} }
function getOpeningTag(markup) {
const end = openingTagEndIndex(markup);
return end >= 0 ? markup.slice(0, end + 1) : null;
}
function replaceOpeningTag(markup, replacer) {
const end = openingTagEndIndex(markup);
if (end < 0) {
return markup;
}
return `${replacer(markup.slice(0, end + 1))}${markup.slice(end + 1)}`;
}
function openingTagEndIndex(markup) {
let quote = "";
for (let i = 0; i < markup.length; i += 1) {
const ch = markup[i];
if (quote) {
if (ch === quote) {
quote = "";
}
continue;
}
if (ch === '"' || ch === "'") {
quote = ch;
continue;
}
if (ch === ">") {
return i;
}
}
return -1;
}
function findAttribute(openingTag, attr) {
const attrLower = attr.toLowerCase();
const nameMatch = openingTag.match(/^<[^\s/>]+/);
let index = nameMatch ? nameMatch[0].length : 1;
while (index < openingTag.length) {
while (index < openingTag.length && /\s/.test(openingTag[index])) {
index += 1;
}
if (index >= openingTag.length || openingTag[index] === ">" || openingTag[index] === "/") {
return null;
}
const start = index;
while (index < openingTag.length && !/[\s=/>]/.test(openingTag[index])) {
index += 1;
}
const name = openingTag.slice(start, index);
while (index < openingTag.length && /\s/.test(openingTag[index])) {
index += 1;
}
let value = "";
if (openingTag[index] === "=") {
index += 1;
while (index < openingTag.length && /\s/.test(openingTag[index])) {
index += 1;
}
const quote = openingTag[index];
if (quote === '"' || quote === "'") {
index += 1;
const valueStart = index;
while (index < openingTag.length && openingTag[index] !== quote) {
index += 1;
}
value = openingTag.slice(valueStart, index);
if (openingTag[index] === quote) {
index += 1;
}
} else {
const valueStart = index;
while (index < openingTag.length && !/[\s>]/.test(openingTag[index])) {
index += 1;
}
value = openingTag.slice(valueStart, index);
}
}
if (name.toLowerCase() === attrLower) {
return { start, end: index, value };
}
}
return null;
}