Hardening the Test Pipeline
644 tests are useless if they break on every content commit. Hardening isn't adding more tests — it's making the existing ones survivable.
The Problem: Tests That Punish You for Writing Content
This website has a comprehensive test suite — 235 unit tests, 54 E2E tests, 279 visual regression screenshots, 121 accessibility audits, and performance benchmarks. All tied to 20 features through a typed specification system with 112 acceptance criteria at 100% coverage.
Sounds great. Until you add a blog post.
The visual regression tests read public/sitemap.xml at runtime. Every page in the sitemap gets screenshotted across 4 themes (dark, dark-hc, light, light-hc). That's 4 screenshots per page. When you add a new page, the sitemap grows, and the test suite tries to compare a screenshot against a baseline that doesn't exist.
New page = no baseline = test failure.
Most commits to this site are content additions. Blog posts, experience entries, project pages. Every one of them broke the visual regression suite. The "fix" was running --update-snapshots after every content commit, which defeats the purpose of visual regression testing.
The accessibility tests had the same problem — axe scans every page in the sitemap.
The test suite was punishing the most common workflow. That's not a quality system. That's a tax.
Smoke Mode: Test What Matters, Fast
The first fix: don't test everything on every push. Test a curated set of stable pages that represent the site's core behavior.
// test/visual/pages.spec.ts
const SMOKE_PAGES = [
'/',
'/content/about.html',
'/content/blog/binary-wrapper.html',
'/content/skills.html',
];
function readSitemap(): string[] {
const sitemapPath = path.resolve('public/sitemap.xml');
const sitemapXml = fs.readFileSync(sitemapPath, 'utf8');
return [...sitemapXml.matchAll(/<loc>[^<]*?(\/(content\/[^<]+|))<\/loc>/g)]
.map(m => m[1] || '/')
.map(p => (p === '' ? '/' : p));
}
const pageFilter = process.env.PAGE;
const smoke = process.env.SMOKE === '1';
const pages = pageFilter ? [pageFilter] : (smoke ? SMOKE_PAGES : readSitemap());// test/visual/pages.spec.ts
const SMOKE_PAGES = [
'/',
'/content/about.html',
'/content/blog/binary-wrapper.html',
'/content/skills.html',
];
function readSitemap(): string[] {
const sitemapPath = path.resolve('public/sitemap.xml');
const sitemapXml = fs.readFileSync(sitemapPath, 'utf8');
return [...sitemapXml.matchAll(/<loc>[^<]*?(\/(content\/[^<]+|))<\/loc>/g)]
.map(m => m[1] || '/')
.map(p => (p === '' ? '/' : p));
}
const pageFilter = process.env.PAGE;
const smoke = process.env.SMOKE === '1';
const pages = pageFilter ? [pageFilter] : (smoke ? SMOKE_PAGES : readSitemap());SMOKE=1 switches from "all 57 pages from sitemap" to "4 curated pages." These 4 were chosen because:
- Home (
/) — tests the TOC sidebar, animations, and overall layout - About (
/content/about.html) — tests a content page with photos, tables, and mermaid diagrams - Binary Wrapper (
/content/blog/binary-wrapper.html) — tests a long blog post with code blocks, headings, and diagrams - Skills (
/content/skills.html) — tests a compact page with different content patterns
These pages are stable. They don't change with new content. They exercise every rendering path the site has.
# Smoke test: unit tests + Playwright on 4 curated pages
npm run test:smoke# Smoke test: unit tests + Playwright on 4 curated pages
npm run test:smokeThe same SMOKE=1 filtering is applied to the accessibility tests — axe scans only the curated pages, and the contrast matrix uses them instead of the full sitemap.
Per-Page Filtering: Test One Page
Sometimes you edit a specific page and want to verify it looks right. The PAGE environment variable lets you test a single page across all themes:
# Visual regression for one specific page
PAGE=/content/blog/typed-specs/01-why.html npx playwright test test/visual/
# Accessibility scan for one page
PAGE=/content/about.html npx playwright test test/a11y/# Visual regression for one specific page
PAGE=/content/blog/typed-specs/01-why.html npx playwright test test/visual/
# Accessibility scan for one page
PAGE=/content/about.html npx playwright test test/a11y/This runs 4 screenshots (one per theme) for that single page instead of 228. Useful during development when you're iterating on a specific post.
Auto-Baseline Creation: New Pages Don't Fail
The core insight: a new page with no baseline isn't a regression. It's a new thing. It should create its baseline, not fail.
test(`${pageSlug} [${theme.name}]`, async ({ page }, testInfo) => {
await page.goto(pagePath);
await page.waitForLoadState('networkidle');
await applyTheme(page, theme);
await expandForFullPage(page);
// Auto-create baseline for new pages instead of failing
const baselinePath = testInfo.snapshotPath(screenshotName);
if (!fs.existsSync(baselinePath)) {
const screenshot = await page.screenshot({ fullPage: true });
fs.mkdirSync(path.dirname(baselinePath), { recursive: true });
fs.writeFileSync(baselinePath, screenshot);
test.skip(true, `Created baseline for new page: ${screenshotName}`);
return;
}
await expect(page).toHaveScreenshot(screenshotName, { fullPage: true });
});test(`${pageSlug} [${theme.name}]`, async ({ page }, testInfo) => {
await page.goto(pagePath);
await page.waitForLoadState('networkidle');
await applyTheme(page, theme);
await expandForFullPage(page);
// Auto-create baseline for new pages instead of failing
const baselinePath = testInfo.snapshotPath(screenshotName);
if (!fs.existsSync(baselinePath)) {
const screenshot = await page.screenshot({ fullPage: true });
fs.mkdirSync(path.dirname(baselinePath), { recursive: true });
fs.writeFileSync(baselinePath, screenshot);
test.skip(true, `Created baseline for new page: ${screenshotName}`);
return;
}
await expect(page).toHaveScreenshot(screenshotName, { fullPage: true });
});When a page has no baseline:
- Take the screenshot
- Save it as the new baseline
- Mark the test as skipped (not failed)
When a page has a baseline:
- Take the screenshot
- Compare against the baseline (1% pixel threshold, 0.2 color threshold)
- Fail if the diff exceeds thresholds
The test report shows skipped tests clearly — you can see which pages are new and review their baselines at your leisure. But they don't block the build.
Compliance Scanner V2: History, Reverse Mapping, and Workflow
The compliance scanner verifies that every feature's acceptance criteria are linked to tests. V1 printed a report and optionally failed the build. V2 adds three capabilities.
Historical Tracking (--save)
npx tsx scripts/compliance-report.ts --savenpx tsx scripts/compliance-report.ts --saveWrites a timestamped JSON report to docs/compliance/:
docs/compliance/
├── 2026-03-24T23-45-12.json
├── 2026-03-25T08-30-00.json
└── 2026-03-25T14-15-33.jsondocs/compliance/
├── 2026-03-24T23-45-12.json
├── 2026-03-25T08-30-00.json
└── 2026-03-25T14-15-33.jsonEach file contains the full coverage matrix — features, ACs, which tests cover them, percentages. Over time, this directory becomes a trend line: did coverage improve? When were new features added? When were gaps closed?
Reverse Mapping (--by-test)
The standard report answers "which tests cover feature X?" The reverse mapping answers the opposite: "which features does this test file cover?"
npx tsx scripts/compliance-report.ts --by-testnpx tsx scripts/compliance-report.ts --by-test ── Test → Feature Mapping ──
test/e2e/navigation.spec.ts
NAV: tocClickLoadsPage, backButtonRestores, activeItemHighlights,
anchorScrollsSmoothly, directUrlLoads, deepLinkLoads,
bookmarkableUrl, f5ReloadPreserves
test/e2e/theme.spec.ts
ACCENT: rightClickOpensPalette, swatchChangesAccent, ...
THEME: darkLightToggle, themePersistsAfterReload, ...
test/unit/build-static-io.test.ts
BUILD: cleanDirPreservesGit, mainOrchestrates, buildTemplateStrips, ... ── Test → Feature Mapping ──
test/e2e/navigation.spec.ts
NAV: tocClickLoadsPage, backButtonRestores, activeItemHighlights,
anchorScrollsSmoothly, directUrlLoads, deepLinkLoads,
bookmarkableUrl, f5ReloadPreserves
test/e2e/theme.spec.ts
ACCENT: rightClickOpensPalette, swatchChangesAccent, ...
THEME: darkLightToggle, themePersistsAfterReload, ...
test/unit/build-static-io.test.ts
BUILD: cleanDirPreservesGit, mainOrchestrates, buildTemplateStrips, ...This is useful for impact analysis: "I'm about to refactor navigation.spec.ts — which features will be affected?" Also supports --by-test --json for machine-readable output.
Workflow Integration
The compliance scanner is now wired into the development workflow:
fmenu option inworkflow.js— runs compliance with--save(persists history)- "All" test type in the test wizard — runs compliance with
--strict --saveafter unit + Playwright tests npm run test:all— includes compliance as the final step
No separate step to remember. Compliance is part of the standard flow.
Pre-Push Hook: The Lightweight Gate
Everything above comes together in a Husky pre-push hook:
# .husky/pre-push
npx vitest run && cross-env SMOKE=1 npx playwright test && npx tsx scripts/compliance-report.ts --strict# .husky/pre-push
npx vitest run && cross-env SMOKE=1 npx playwright test && npx tsx scripts/compliance-report.ts --strictThree checks before your code reaches the remote:
- Unit tests — logic is correct
- Smoke Playwright — 4 curated pages look right across themes, pass accessibility
- Compliance — every critical feature has 100% AC coverage
Why pre-push instead of pre-commit? Because you might commit feature definitions and tests in separate commits during development. The gate should enforce at the push boundary — the moment code is about to leave your machine.
If any check fails, the push is blocked. Fix the issue, push again.
The New Script Landscape
Before hardening, there were 3 test scripts. Now there are 10:
| Script | What It Does |
|---|---|
npm run test |
Unit tests (Vitest) |
npm run test:coverage |
Unit tests with V8 coverage report |
npm run test:e2e |
Full E2E suite (all pages, static target) |
npm run test:smoke |
Unit + Playwright on 4 curated pages |
npm run test:visual |
Visual regression (all pages, all themes) |
npm run test:visual:update |
Update visual baselines |
npm run test:compliance |
Compliance scanner (--strict) |
npm run test:compliance:by-test |
Reverse mapping (test → features) |
npm run test:compliance:save |
Compliance with history persistence |
npm run test:all |
Everything: unit + Playwright + compliance + save |
Plus per-page filtering:
PAGE=/content/blog/my-post.html npx playwright test test/visual/
PAGE=/content/blog/my-post.html npx playwright test test/a11y/PAGE=/content/blog/my-post.html npx playwright test test/visual/
PAGE=/content/blog/my-post.html npx playwright test test/a11y/The Philosophy: Sustainable Quality
The test suite went from 644 tests that broke on every content commit to a layered system:
- Every push: lightweight gate (unit + smoke + compliance). Fast, won't break on content.
- Deliberate runs: full suite (all pages, all themes). Run when you change layout, theming, or shared behavior.
- During development: per-page checks. Run when you're iterating on a specific page.
The key insight: tests that break on every commit get disabled. Tests that break only when something actually regressed get trusted. Hardening isn't about adding more tests. It's about making the tests you have survivable — so they're still running six months from now.