This project saves self-contained HTML archives for pages the operator is authorized to access. It sends a real browser user agent, renders web URLs with Playwright, strips ad/tracker-like elements, normalizes the captured DOM, and inlines page requisites as data: URLs.

It intentionally does not execute paywall-bypass rules. The bundled bypass-paywalls-clean-filters files are treated as reference material only; paywall selectors and scripts are not applied.

CLI

npm install
npm run install-browsers
node src/cli.mjs archive "https://example.com/article"

For an existing HTML file:

node src/cli.mjs archive ./page.html --static

For an archive.ph HTML export where you want the captured page without the archive shell:

node src/cli.mjs archive ./bloomberg-archive.html --static --strip-archive-shell

Local archive.ph HTML inputs with --strip-archive-shell use the static extractor by default because those files already contain the rendered page. Add --render only when you explicitly want Chromium to load the local HTML first.

Computed-style freezing is off by default for live web pages because it can inflate modern article pages into very large HTML files. Add --freeze-styles only when stylesheet inlining is not enough to preserve layout.

Archives are written to ARCHIVE_PATH, or to a development directory under the system temp directory when ARCHIVE_PATH is not set.

API

ARCHIVE_PATH=/tmp/local-page-archives npm run serve

Archive a page:

curl -X POST http://127.0.0.1:8787/archive \
  -H 'content-type: application/json' \
  -d '{"url":"https://example.com/article"}'

The response includes the archived file path and a local viewUrl.

Set PORT to choose a port other than the default 8787.