Part 06 — Unit testing, port-driven

A code generator is among the most test-sensitive things a team can write. Its output is read by a compiler or an editor that does not forgive whitespace drift; every decision it makes is a decision that, if wrong, propagates to every file it emits; its bugs surface not in the generator's own output but in downstream tools that reject it. The naïve approach — run the generator, write the output to a tmp directory, shell out to tsc or vsce to validate it — produces tests that are slow, flaky, and opaque when they fail. This article lays out the alternative the monorepo already practises: port-driven, in-memory, three-layered, with property tests at the base and @FeatureTest/@Verifies throughout.

Why codegen tests tend to rot

Three failure modes recur across code-generator projects.

Side-effect coupling. A generator that writes to node:fs directly has tests that must clean up after themselves, must serialise across workers, and must tolerate CI agents with read-only home directories. Any test that leaks a file into the repo because an assertion threw before the cleanup code ran becomes a source of flake for weeks.

Serialisation sensitivity. A generator that writes JSON has tests whose assertions have to choose between JSON.stringify (where key ordering is insertion-dependent), line-by-line string matching (where trailing newline changes break everything), and AST equivalence (which requires a second parser). The wrong choice produces either false positives (tests pass on outputs that break downstream consumers) or false negatives (tests fail on output changes that are semantically identical).

Snapshot opacity. Tests backed by large text snapshots — the simplest way to pin down a generator's output — become, over time, the thing no reviewer reads and no author updates thoughtfully. The snapshot is three hundred lines of machine-written TypeScript; the review says "looks fine"; the bug the snapshot should have caught sails through.

The port-driven strategy answers all three. Port the filesystem, and side-effect coupling disappears. Port the clock and the process spawner, and every other source of flake disappears with it. Choose test granularity deliberately — per-field assertions over per-file snapshots — and the opacity problem does not get a foothold in the first place.

Three ports, three layers

The meta-DSL's test strategy rests on the four ports named in Part 04 (FileSystem, Process, Clock, Logger) and on three layers of test that together cover the pipeline from spec.ts to sibling-project tree.

Layer 1 — Extractor tests. Input: a spec.ts source string (committed as a fixture in test/fixtures/). Output: a LanguageIR value, compared field-by-field against a hand-authored expected value. The extractor is the only place ts-morph enters the test tree; everything downstream works on IR values directly.

Proposal (design-in-public) — extractor test sketch:
@FeatureTest(ExtractsTokensFeature)
class ExtractorTests {
  @Verifies('extractsAtSignTokensFromDecorator')
  extractsAtSignTokensFromDecorator(): ACResult {
    const src = loadFixture('minimal-spec.ts');
    const ir = extractLanguageIR(src);
    const tokenNames = ir.tokens.map(t => t.name);
    return tokenNames.includes('decoratorKeyword')
      ? { ok: true }
      : { ok: false, reason: `missing decoratorKeyword; got ${tokenNames.join(',')}` };
  }
}

Note the absence of describe/it. Every test file in the monorepo's existing packages uses @FeatureTest and @Verifies instead — the tests are bindings back to a Feature's acceptance criteria, so compliance runs can report which ACs are proved and which are still missing tests. This is not just a stylistic choice; it is REQ-DOG-FOOD, the invariant that the requirements DSL is used by its own test suites and every sibling package inherits. The meta-DSL's tests inherit it too. Zero describe, zero it, anywhere.

Layer 2 — Emitter tests. Input: a LanguageIR value, built inline in the test (no fixture parsing). Output: an InMemoryFileSystem state — a Map<string, string> of path to contents. Assertions target specific fields or regions of the emitted files; full-file snapshots are used only for artefacts whose shape is externally controlled (the TextMate grammar JSON, which has a schema; the VSCode snippet JSON, likewise).

Proposal (design-in-public) — emitter test sketch:
@FeatureTest(ManifestEmitterFeature)
class ManifestEmitterTests {
  @Verifies('manifestDeclaresLanguageId')
  manifestDeclaresLanguageId(): ACResult {
    const ir = buildIR({ language: { id: 'requirements', extensions: ['.req.ts'], scopeName: 'source.requirements', aliases: [], features: [] } });
    const fs = new InMemoryFileSystem();
    new ManifestEmitter().emit(ir, fs);
    const pkg = JSON.parse(fs.read('package.json'));
    return pkg.contributes.languages[0].id === 'requirements'
      ? { ok: true }
      : { ok: false, reason: `wrong language id: ${pkg.contributes.languages[0].id}` };
  }
}

Two properties of this test matter. First, it runs in milliseconds because nothing is written to disk, no subprocess is spawned, no compiler is loaded. Second, when it fails, the failure message names the exact field that is wrong, not "the snapshot diverged by twenty lines and here is a full diff". The meta-DSL's tests are judged on both properties; every new test asks, how fast is this, and how specific is the failure message?

Layer 3 — Property tests. Input: a fast-check arbitrary that generates well-formed LanguageIR values. Output: an assertion over the emitters' output that must hold for every IR the arbitrary produces. Property tests catch the bugs extractor and emitter unit tests miss by construction: bugs that only appear on inputs a human tester would not think to write. The monorepo already uses fast-check for sixteen invariants in the requirements package (packages/requirements/CLAUDE.md — "property + fuzzy via fast-check (16 invariants, 200 runs)"); the meta-DSL inherits the library and the discipline.

Diagram — Figure 1 — The test pyramid. Extractor tests depend on ts-morph; emitter tests are pure functions over the IR; property tests synthesise IRs to stress invariants. Nothing above layer 1 touches a compiler.

Coverage gates — ≥95 % per file

The monorepo runs vitest with --coverage on every test invocation (feedback_vitest_always_coverage), and the gates are enforced per file, not aggregated. The requirements package stands at 100 % lines/branches/functions/statements — 778 tests, per its CLAUDE.md. The meta-DSL's gate is set one notch below that ceiling, at 95 %, to leave room for the two classes of code that resist coverage honestly:

Ts-morph glue inside the extractor. A few branches handle malformed input shapes (a decorator with the wrong arity, a property missing its type annotation) that the extractor defends against with fall-through errors. These branches are hit by dedicated negative tests; a few arm-twisting branches that only fire on mutually-inconsistent ASTs are acceptable as dead code if they make the happy path clearer.
Logger side channels. A Logger.error call on an unreachable branch is sometimes the right insurance; coverage should not force a "this cannot happen" branch to be provoked by a contrived test.

95 % per file means: any file where coverage drops below the gate fails CI. The gate is per file, not per package, because the aggregated coverage number is the kind of metric that rewards writing a thousand-line unit test over one complex file to hide coverage gaps in the other nine. Per-file gates force the gap to surface next to the file that caused it.

No describe/it — @FeatureTest and @Verifies throughout

The monorepo's hard rule — zero describe/it in any test file across any package — has one reason behind it: the requirements DSL wires tests back to their Feature's acceptance criteria so that npx requirements compliance --strict can report, per AC, whether it has a bound test. A describe/it test is invisible to that system; a @FeatureTest(FooFeature) class with @Verifies('acName') methods binds every test to a specific AC on a specific Feature, and the compliance runner sees the binding at static-analysis time (not at runtime; the scanner inspects TypeScript source).

For the meta-DSL's tests, the consequence is that the series' seven articles' acceptance criteria — declared in assets/features.ts — are the targets every meta-DSL test binds to, via the same @Verifies mechanism. A test of the grammar emitter might bind to FEAT-ARCHPAT-05's emittersAsStrategyExplained or FEAT-ARCHPAT-06's portDrivenArchitectureExplained. The compliance report is the tool that will report, article by article, which acceptance criteria have prose coverage and which have test coverage, and where those two diverge.

One practical note: because the meta-DSL is not implemented in this series, the @FeatureTest bindings in this article's examples are illustrative. The series' own article-level ACs are bound by the prose itself — an article "verifies" its own ACs by the paragraphs that address them. When ide-forge v0 ships, the tests it carries will bind to the same ACs with runnable code, and the loop closes.

A worked property test, end to end

Layer 3 is the layer most teams skip when they first adopt property testing, because designing a good arbitrary feels like more work than writing unit tests. The investment pays off on one specific class of generator bug: the bug where two layers of output must agree on a piece of information, and one layer drifts.

The invariant we take on, as the worked example: every @Token regex declared in the spec appears as a pattern in the emitted TextMate grammar JSON. This is the kind of invariant a unit test can prove for one specific spec.ts fixture; a property test proves it for every well-formed IR the arbitrary can produce.

Proposal (design-in-public) — worked property test:
import fc from 'fast-check';

const tokenArb = fc.record({
  name: fc.string({ minLength: 3, maxLength: 20 })
           .filter(s => /^[a-z][a-zA-Z]*$/.test(s)),
  pattern: fc.constantFrom(
    '\\b[A-Z][A-Z0-9]+\\b',
    '@[A-Z][a-zA-Z]+\\b',
    '\\b[a-z]+\\b',
  ),
  scope: fc.constant('keyword.other'),
});

const irArb = fc.record({
  schemaVersion: fc.constant('2026-04-14'),
  language: fc.record({
    id: fc.string({ minLength: 1, maxLength: 20 })
          .filter(s => /^[a-z][a-z-]*$/.test(s)),
    extensions: fc.array(fc.string(), { minLength: 1, maxLength: 3 })
                  .map(xs => xs.map(x => '.' + x.replace(/\\W/g, 'a'))),
    scopeName: fc.string().map(s => 'source.' + (s || 'x')),
    aliases: fc.constant([]),
    features: fc.constant([]),
  }),
  tokens: fc.array(tokenArb, { minLength: 1, maxLength: 8 }),
  rules: fc.constant([]),
  snippets: fc.constant([]),
  lspFeatures: fc.constant([]),
  executors: fc.constant([]),
});

@FeatureTest(GrammarEmitterPropertyFeature)
class GrammarEmitterPropertyTests {
  @Verifies('everyTokenPatternInGrammar')
  everyTokenPatternInGrammar(): ACResult {
    const failure = fc.check(fc.property(irArb, (ir) => {
      const fs = new InMemoryFileSystem();
      new GrammarEmitter().emit(ir, fs);
      const grammar = JSON.parse(
        fs.read(`syntaxes/${ir.language.id}.tmLanguage.json`),
      );
      const emittedPatterns: string[] = (grammar.patterns ?? [])
        .map((p: { match?: string }) => p.match ?? '')
        .filter(Boolean);
      return ir.tokens.every(t => emittedPatterns.includes(t.pattern));
    }), { numRuns: 200 });
    return failure.failed
      ? { ok: false, reason: JSON.stringify(failure.counterexample) }
      : { ok: true };
  }
}

Five design points are worth drawing out of that block.

The arbitrary is narrow on purpose. Token names are forced to the /^[a-z][a-zA-Z]*$/ shape a decorator property name must respect; language ids are forced to a kebab-case lowercase shape; tokens patterns are drawn from a constant set of three regexes. A broader arbitrary would find bugs the emitter does not claim to handle (arbitrary Unicode in identifiers, malformed regex sources); narrowness keeps the failure reports actionable.
200 runs is the package's convention. The requirements package's CLAUDE.md fixes "200 runs" per property; the meta-DSL inherits that constant. Higher is nice; 200 is the threshold the team has agreed is "enough that a bug surviving it is surprising".
The assertion is every, not equal. The property does not claim the grammar contains only the IR's tokens; it claims it contains at least them. The emitter is free to add built-in patterns (for comments, for whitespace) that do not correspond to a user-declared token.
The counterexample is stringified into the reason. When the property fails, the reason field carries the minimised input fast-check shrunk to, which is the difference between "the grammar is broken" and "here is the two-token IR that reproduces the bug in under ten lines".
No filesystem, no compiler, no subprocess. The whole test runs as a pure function from arbitrary → boolean, takes milliseconds per run, and can be parallelised across vitest workers without any contention.

This is the shape the meta-DSL's property tests take. One per crossed-field invariant: manifest activation events ↔ command contributions, LSP feature registrations ↔ server handler methods, executor commands ↔ task-definition types, snippet prefixes ↔ language scope. Each property is a few dozen lines; each catches a class of drift bugs unit tests cannot.

What the three layers do not cover

An honest test strategy names what it cannot prove. The three-layer pyramid leaves three gaps the meta-DSL must fill with other mechanisms:

End-to-end extension installation. No unit test can assert that the generated .vsix, installed into a real VSCode, actually produces the expected behaviour in the editor. This is an integration concern, and it belongs to a separate test suite that runs against a real VSCode fixture — the monorepo has precedent for this in packages/ssg-site's Playwright suite, which drives a real browser against the built site. The meta-DSL's equivalent is a Playwright-style extension test that installs the generated .vsix into a headless VSCode and asserts on user-visible behaviour. It is future work; this series does not commit to it.
Language-server protocol compliance. A generated LSP server must satisfy the LSP specification — initialise/shutdown lifecycle, capability negotiation, message framing. The unit tests assert the server is wired correctly; they do not assert the wire protocol is correct. LSP compliance is covered by the reference vscode-languageserver-node test suite the generated server depends on, plus a small set of integration tests that send handcrafted JSON-RPC messages and assert on the responses. Also future work.
Visual correctness of highlighting. A grammar emits the right patterns does not mean the theme renders them correctly. Visual correctness is an editor-side concern that only a human or a screenshot-based visual test can assert. The monorepo's a11y-test-themes.mjs runs four-theme pa11y sweeps for the site; an analogous tool could capture four-theme screenshots of the generated extension's editor. Again, future work.

Listing these gaps is the posture. The meta-DSL commits to covering the three layers completely; it does not pretend the three layers cover everything. Where a concern sits outside the pyramid, it is named, deferred, and tracked.

Coverage, kept honest

A final note, which is also a warning. Coverage is a ceiling on confidence, not a floor. A suite of tests at 100 % coverage can still be wrong, if the assertions are weak; 95 % coverage with sharp assertions is stronger than 100 % coverage with "it runs without throwing" checks. The per-file 95 % gate is the minimum the meta-DSL will not ship below; the working target is that every line the tests touch is touched purposefully, with an assertion that would fail if the behaviour drifted.

This is also the discipline the monorepo enforces via the requirements DSL's compliance command: coverage matters only against a spec. A line covered by a test that does not bind to a Feature's acceptance criterion is a line covered accidentally. The @FeatureTest/@Verifies binding is what turns coverage into intentional coverage — every passing test proves a specific thing the DSL's author claimed the code would do.

Part 07 closes the series. It walks the requirements-ide.spec.ts artefact decorator by decorator, ties each decorator to the article that justified it, names the mise-en-abyme loop explicitly, lists the open questions that remain, and points at the future implementation series. The design, by the end of 07, is as tight as design-in-public can make it before code gets written.

Build counterpart

The test strategy proposed here is exercised article by article in the companion build series, Ide.Dsl — Build. The port-driven posture is honoured from Build 02 — The extractor, where SourceReader hides ts-morph from the extraction logic; every emitter from Build 04 onward takes FileSystem and Logger ports so InMemoryFileSystem is enough to verify output. The three-layer pyramid (extractor → emitters → LSP integration) gets its own end-to-end article, Testing the full stack, later in the build series.

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

Part 06 — Unit testing, port-driven📋

Why codegen tests tend to rot📋

Three ports, three layers📋

Coverage gates — ≥95 % per file📋

No describe/it — @FeatureTest and @Verifies throughout📋

A worked property test, end to end📋

What the three layers do not cover📋

Coverage, kept honest📋

Build counterpart📋