Hardening the Test Pipeline

644 tests are useless if they break on every content commit. Hardening isn't adding more tests — it's making the existing ones survivable.

The Problem: Tests That Punish You for Writing Content

This website has a comprehensive test suite — 235 unit tests, 54 E2E tests, 279 visual regression screenshots, 121 accessibility audits, and performance benchmarks. All tied to 20 features through a typed specification system with 112 acceptance criteria at 100% coverage.

Sounds great. Until you add a blog post.

The visual regression tests read public/sitemap.xml at runtime. Every page in the sitemap gets screenshotted across 4 themes (dark, dark-hc, light, light-hc). That's 4 screenshots per page. When you add a new page, the sitemap grows, and the test suite tries to compare a screenshot against a baseline that doesn't exist.

New page = no baseline = test failure.

Most commits to this site are content additions. Blog posts, experience entries, project pages. Every one of them broke the visual regression suite. The "fix" was running --update-snapshots after every content commit, which defeats the purpose of visual regression testing.

The accessibility tests had the same problem — axe scans every page in the sitemap.

The test suite was punishing the most common workflow. That's not a quality system. That's a tax.

Smoke Mode: Test What Matters, Fast

The first fix: don't test everything on every push. Test a curated set of stable pages that represent the site's core behavior.

// test/visual/pages.spec.ts
const SMOKE_PAGES = [
  '/',
  '/content/about.html',
  '/content/blog/binary-wrapper.html',
  '/content/skills.html',
];

function readSitemap(): string[] {
  const sitemapPath = path.resolve('public/sitemap.xml');
  const sitemapXml = fs.readFileSync(sitemapPath, 'utf8');
  return [...sitemapXml.matchAll(/<loc>[^<]*?(\/(content\/[^<]+|))<\/loc>/g)]
    .map(m => m[1] || '/')
    .map(p => (p === '' ? '/' : p));
}

const pageFilter = process.env.PAGE;
const smoke = process.env.SMOKE === '1';
const pages = pageFilter ? [pageFilter] : (smoke ? SMOKE_PAGES : readSitemap());

SMOKE=1 switches from "all 57 pages from sitemap" to "4 curated pages." These 4 were chosen because:

Home (/) — tests the TOC sidebar, animations, and overall layout
About (/content/about.html) — tests a content page with photos, tables, and mermaid diagrams
Binary Wrapper (/content/blog/binary-wrapper.html) — tests a long blog post with code blocks, headings, and diagrams
Skills (/content/skills.html) — tests a compact page with different content patterns

These pages are stable. They don't change with new content. They exercise every rendering path the site has.

# Smoke test: unit tests + Playwright on 4 curated pages
npm run test:smoke

The same SMOKE=1 filtering is applied to the accessibility tests — axe scans only the curated pages, and the contrast matrix uses them instead of the full sitemap.

Per-Page Filtering: Test One Page

Sometimes you edit a specific page and want to verify it looks right. The PAGE environment variable lets you test a single page across all themes:

# Visual regression for one specific page
PAGE=/content/blog/typed-specs/01-why.html npx playwright test test/visual/

# Accessibility scan for one page
PAGE=/content/about.html npx playwright test test/a11y/

This runs 4 screenshots (one per theme) for that single page instead of 228. Useful during development when you're iterating on a specific post.

Auto-Baseline Creation: New Pages Don't Fail

The core insight: a new page with no baseline isn't a regression. It's a new thing. It should create its baseline, not fail.

test(`${pageSlug} [${theme.name}]`, async ({ page }, testInfo) => {
  await page.goto(pagePath);
  await page.waitForLoadState('networkidle');
  await applyTheme(page, theme);
  await expandForFullPage(page);

  // Auto-create baseline for new pages instead of failing
  const baselinePath = testInfo.snapshotPath(screenshotName);
  if (!fs.existsSync(baselinePath)) {
    const screenshot = await page.screenshot({ fullPage: true });
    fs.mkdirSync(path.dirname(baselinePath), { recursive: true });
    fs.writeFileSync(baselinePath, screenshot);
    test.skip(true, `Created baseline for new page: ${screenshotName}`);
    return;
  }
  await expect(page).toHaveScreenshot(screenshotName, { fullPage: true });
});

When a page has no baseline:

Take the screenshot
Save it as the new baseline
Mark the test as skipped (not failed)

When a page has a baseline:

Take the screenshot
Compare against the baseline (1% pixel threshold, 0.2 color threshold)
Fail if the diff exceeds thresholds

The test report shows skipped tests clearly — you can see which pages are new and review their baselines at your leisure. But they don't block the build.

Compliance Scanner V2: History, Reverse Mapping, and Workflow

The compliance scanner verifies that every feature's acceptance criteria are linked to tests. V1 printed a report and optionally failed the build. V2 adds three capabilities.

Historical Tracking (`--save`)

npx tsx scripts/compliance-report.ts --save

Writes a timestamped JSON report to docs/compliance/:

docs/compliance/
├── 2026-03-24T23-45-12.json
├── 2026-03-25T08-30-00.json
└── 2026-03-25T14-15-33.json

Each file contains the full coverage matrix — features, ACs, which tests cover them, percentages. Over time, this directory becomes a trend line: did coverage improve? When were new features added? When were gaps closed?

Reverse Mapping (`--by-test`)

The standard report answers "which tests cover feature X?" The reverse mapping answers the opposite: "which features does this test file cover?"

npx tsx scripts/compliance-report.ts --by-test

  ── Test → Feature Mapping ──

  test/e2e/navigation.spec.ts
    NAV: tocClickLoadsPage, backButtonRestores, activeItemHighlights,
         anchorScrollsSmoothly, directUrlLoads, deepLinkLoads,
         bookmarkableUrl, f5ReloadPreserves

  test/e2e/theme.spec.ts
    ACCENT: rightClickOpensPalette, swatchChangesAccent, ...
    THEME: darkLightToggle, themePersistsAfterReload, ...

  test/unit/build-static-io.test.ts
    BUILD: cleanDirPreservesGit, mainOrchestrates, buildTemplateStrips, ...

This is useful for impact analysis: "I'm about to refactor navigation.spec.ts — which features will be affected?" Also supports --by-test --json for machine-readable output.

Workflow Integration

The compliance scanner is now wired into the development workflow:

f menu option in workflow.js — runs compliance with --save (persists history)
"All" test type in the test wizard — runs compliance with --strict --save after unit + Playwright tests
npm run test:all — includes compliance as the final step

No separate step to remember. Compliance is part of the standard flow.

Pre-Push Hook: The Lightweight Gate

Everything above comes together in a Husky pre-push hook:

# .husky/pre-push
npx vitest run && cross-env SMOKE=1 npx playwright test && npx tsx scripts/compliance-report.ts --strict

Three checks before your code reaches the remote:

Unit tests — logic is correct
Smoke Playwright — 4 curated pages look right across themes, pass accessibility
Compliance — every critical feature has 100% AC coverage

Why pre-push instead of pre-commit? Because you might commit feature definitions and tests in separate commits during development. The gate should enforce at the push boundary — the moment code is about to leave your machine.

If any check fails, the push is blocked. Fix the issue, push again.

The New Script Landscape

Before hardening, there were 3 test scripts. Now there are 10:

Script	What It Does
`npm run test`	Unit tests (Vitest)
`npm run test:coverage`	Unit tests with V8 coverage report
`npm run test:e2e`	Full E2E suite (all pages, static target)
`npm run test:smoke`	Unit + Playwright on 4 curated pages
`npm run test:visual`	Visual regression (all pages, all themes)
`npm run test:visual:update`	Update visual baselines
`npm run test:compliance`	Compliance scanner (--strict)
`npm run test:compliance:by-test`	Reverse mapping (test → features)
`npm run test:compliance:save`	Compliance with history persistence
`npm run test:all`	Everything: unit + Playwright + compliance + save

Plus per-page filtering:

PAGE=/content/blog/my-post.html npx playwright test test/visual/
PAGE=/content/blog/my-post.html npx playwright test test/a11y/

The Philosophy: Sustainable Quality

The test suite went from 644 tests that broke on every content commit to a layered system:

Diagram — The hardened pipeline is a three-tier gate — a 30-second pre-push check for every commit, a full-suite run for deliberate refactors, and a per-page mode for iterative work — each with its own cost and cadence.

Every push: lightweight gate (unit + smoke + compliance). Fast, won't break on content.
Deliberate runs: full suite (all pages, all themes). Run when you change layout, theming, or shared behavior.
During development: per-page checks. Run when you're iterating on a specific page.

The key insight: tests that break on every commit get disabled. Tests that break only when something actually regressed get trusted. Hardening isn't about adding more tests. It's about making the tests you have survivable — so they're still running six months from now.

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

Hardening the Test Pipeline📋

The Problem: Tests That Punish You for Writing Content📋

Smoke Mode: Test What Matters, Fast📋

Per-Page Filtering: Test One Page📋

Auto-Baseline Creation: New Pages Don't Fail📋

Compliance Scanner V2: History, Reverse Mapping, and Workflow📋

Historical Tracking (--save)📋

Reverse Mapping (--by-test)📋

Workflow Integration📋

Pre-Push Hook: The Lightweight Gate📋

The New Script Landscape📋

The Philosophy: Sustainable Quality📋