AI-Driven Self-Implementation — When ACs Write Their Own Tests

The AI Agent's Context Problem, Solved

The versus series on AI agents predicted that typed specifications would produce tighter feedback loops for AI agents than document-based specifications. This series provides the evidence.

An AI agent tasked with "add tests for Feature X" needs four things:

The contract — what to test. In this system, that is the Feature abstract class with its AC methods. The agent reads requirements/features/trace.ts and knows: 11 ACs, each with a JSDoc description, all abstract, all typed ACResult.
The conventions — how to write tests. The test/unit/CLAUDE.md file teaches the exact anatomy: @FeatureTest(F) class decorator, @Verifies<F>('ac') method decorator, import expect from vitest and nothing else, no describe/it, helpers as plain functions.
The dep types — what to inject. src/lib/external.ts lists every port: FileSystem, Logger, Scheduler, WindowLike. The agent builds fakes from these interfaces — no guessing what window looks like.
The validation gate — how to know it worked. The compliance scanner reports which ACs are covered and which are not. The quality gate fails if critical ACs are missing.

No Jira ticket to parse. No PRD to interpret. No ambiguous natural-language specification. The feature class IS the specification, the decorators ARE the linking mechanism, and the scanner IS the validator.

The Self-Implementation Loop

Here is the concrete loop an agent executes when implementing a new feature's tests:

Diagram — The AI self-implementation loop — the agent reads a feature class, writes decorated tests that import src symbols, the scanner infers bindings, and the compliance report either reports gaps (loop back) or confirms 818/818 ACs verified.

Agent reads the Feature class — sees 11 abstract methods, each with a JSDoc description.
Agent writes a test class: @FeatureTest(TraceFeature) at the class level, @Verifies<TraceFeature>('indexBuildsFromManifest') on each method.
Agent imports the source functions: buildTraceIndex, queryFileToFeatures, normalizePath — from scripts/cli/commands/lib/trace-core.ts.
TypeScript compiler validates: does 'indexBuildsFromManifest' exist on TraceFeature? If not, compile error. The agent fixes the typo and re-runs. Sub-second feedback.
Agent runs tests — vitest executes the class (auto-registered by @FeatureTest), coverage instruments the imported source files.
Compliance scanner reports: Total 11, Covered 11, TU 11, E2E 0, 100% — quality gate PASS.

The hexagonal architecture makes this loop particularly smooth. The agent does not need to set up jsdom, mock the filesystem, or configure network stubs. It builds fakes from the port interfaces — { readFile: async () => '{}', exists: async () => true } — and passes them to the factory. Tests are pure: construct fakes, call function, assert result.

The TRACE feature was substantially co-authored this way. The agent read the 11 ACs, wrote 57 test methods across 11 test classes, used synthetic fixtures (2 fake features, 3 fake files, 3 fake test refs), and achieved 100% statement coverage on the core.

What the Agent Produces

A concrete example. The agent reads this AC:

/** `work trace impact` identifies impacted features and tests to rerun from changed files. */
abstract impactAnalysis(): ACResult;

And produces this test class:

@FeatureTest(TraceFeature)
class ImpactAnalysisTests {
  @Verifies<TraceFeature>('impactAnalysis')
  'identifies impacted features from changed files'() {
    const index = buildFixtureIndex();
    const result = queryImpact(index, ['src/lib/foo-state.ts']);
    expect(result.impactedFeatures).toHaveLength(1);
    expect(result.impactedFeatures[0]!.id).toBe('FOO');
  }

  @Verifies<TraceFeature>('impactAnalysis')
  'sorts impacted features by priority critical first'() {
    const index = buildFixtureIndex();
    const result = queryImpact(index, ['src/lib/foo-state.ts', 'src/lib/bar-state.ts']);
    expect(result.impactedFeatures[0]!.priority).toBe('critical');
  }

  @Verifies<TraceFeature>('impactAnalysis')
  'collects tests to rerun'() {
    const index = buildFixtureIndex();
    const result = queryImpact(index, ['src/lib/foo-state.ts']);
    expect(result.testsToRerun.size).toBeGreaterThan(0);
  }

  @Verifies<TraceFeature>('impactAnalysis')
  'returns empty for unknown files'() {
    const index = buildFixtureIndex();
    const result = queryImpact(index, ['src/lib/nonexistent.ts']);
    expect(result.impactedFeatures).toHaveLength(0);
  }

  @Verifies<TraceFeature>('impactAnalysis')
  'renderer produces output'() {
    const index = buildFixtureIndex();
    const result = queryImpact(index, ['src/lib/foo-state.ts']);
    const lines = renderImpact(result, noopFmt);
    expect(lines.length).toBeGreaterThan(0);
  }
}

Five test methods, one AC, each verifying a different facet: the happy path, the priority sort, the test collection, the empty case, and the renderer. All importing queryImpact and renderImpact from the trace core — symbols the AST scanner will resolve to scripts/cli/commands/lib/trace-core.ts.

Second Example: FSM Lifecycle Guards

The impactAnalysis example above was a pure query function — straightforward. The more interesting case is state machine testing, where the agent must understand transitions, guards, and idempotent events.

The agent reads this AC:

/** The FSM lifecycle guards queries: only ready state accepts requests. */
abstract fsmLifecycleGuards(): ACResult;

And produces 11 test methods covering the full state machine contract:

@FeatureTest(TraceFeature)
class FsmLifecycleTests {
  @Verifies<TraceFeature>('fsmLifecycleGuards')
  'starts in idle state'() {
    const fsm = createTraceIndexMachine<TraceIndex>();
    expect(fsm.getState()).toBe('idle');
  }

  @Verifies<TraceFeature>('fsmLifecycleGuards')
  'transitions idle -> loading -> ready'() {
    const fsm = createTraceIndexMachine<TraceIndex>();
    fsm.load();
    expect(fsm.getState()).toBe('loading');
    fsm.ready(buildFixtureIndex());
    expect(fsm.getState()).toBe('ready');
  }

  @Verifies<TraceFeature>('fsmLifecycleGuards')
  'requireReady throws when not ready'() {
    const fsm = createTraceIndexMachine<TraceIndex>();
    expect(() => fsm.requireReady()).toThrow('not ready');
  }

  @Verifies<TraceFeature>('fsmLifecycleGuards')
  'idle ignores fail (no-op)'() {
    const fsm = createTraceIndexMachine<TraceIndex>();
    fsm.fail();
    expect(fsm.getState()).toBe('idle');  // unchanged
  }

  // ... 7 more: error recovery, ready reload, idempotent transitions
}

The agent imports createTraceIndexMachine from src/lib/trace-index-state.ts. The AST scanner resolves it. The AC is bound to the FSM factory. This single example combines state machines, SOLID (the factory takes no deps — it is pure closure), requirements (the AC describes the guard contract), and @Verifies (linking the test to the AC via keyof T).

The pattern scales. Across the project, 96 features with 818 ACs were substantially co-authored with AI agents using this exact loop. The agent reads the contract, writes the test, the system validates completeness. The human reviews correctness.

Skills as Reusable Workflows

The agent does not invent the test convention — it reads it from test/unit/CLAUDE.md. This file teaches the exact anatomy of a test file:

Import only expect from vitest, never describe/it
@FeatureTest(F) on every class, @Verifies<F>('ac') on every method
Helpers as plain functions outside the class, not class members
Async methods with async, DOM tests with @vitest-environment jsdom pragma
No boilerplate registration loop — @FeatureTest auto-registers

Claude Code skills encode this as a repeatable operation. When the agent is asked to "add tests for Feature X," the skill loads the CLAUDE.md conventions, reads the Feature class, generates the test file, and runs the compliance scanner. The same workflow every time. No creative interpretation. No drift.

The conventions file is itself a requirement: if test/unit/CLAUDE.md says "no describe/it", the compliance scanner enforces it. A test file with bare describe() calls fails the build. The convention is not a suggestion — it is a gate.

The Complete Traceability Chain

With all four parts in place, the complete chain has no manual links:

Link	Mechanism	Verified by
Requirement exists	Abstract class in `requirements/features/`	TypeScript compiler (class must export)
AC exists	Abstract method on Feature	TypeScript compiler (method must be declared)
AC referenced correctly	`@Verifies<F>('acName')` with `keyof T`	TypeScript compiler (compile error on typo)
Test exists for AC	`@Verifies` decorator on test method	Compliance scanner (reports uncovered ACs)
Test calls source code	`import { fn } from 'src/lib/...'`	AST scanner (resolves imports to source files)
Source code is exercised	vitest v8 line instrumentation	Coverage thresholds (98% statements gate)
No sync IO regression	AST scan for `*Sync` calls	Sync-usage scanner (zero violations gate)
Overall quality	All of the above combined	Quality gate: PASS / FAIL

Each link is checked by a different mechanism, and each mechanism is itself a tracked feature with its own ACs and tests. The system verifies itself — not as a philosophical curiosity, but as an engineering guarantee.

What AI Cannot Verify

The compliance scanner checks completeness, not correctness. An agent can write a @Verifies method that calls the right function but asserts nothing useful:

@Verifies<TraceFeature>('impactAnalysis')
'does something'() {
  const result = queryImpact(buildFixtureIndex(), []);
  expect(result).toBeDefined();  // trivially true — useless assertion
}

This passes the scanner. The AC is "covered." The binding exists. The coverage is 100%. But the test verifies nothing meaningful — it confirms that calling the function does not throw, which is a low bar.

The scanner cannot judge intent. It cannot know that expect(result).toBeDefined() is a weak assertion. It cannot know that the test should have checked result.impactedFeatures.length instead. That is a human judgement call.

The system's guarantee is: every AC has at least one test that imports and calls the relevant source code. The guarantee is not: every test is a good test. Completeness is automated. Correctness is human.

Mutation testing (Stryker) would add the next layer — verifying that assertions actually detect regressions. That is a future step. For now, the closed loop provides the scaffold: if every AC is tested and every test calls the right code, the probability that a meaningful regression goes undetected is low. Not zero. Low.

What This Makes Possible

The closed loop changes the economics of feature development. Adding a new feature to the project means:

Write the abstract class in requirements/features/ — 15 lines, one per AC.
Export it from requirements/index.ts — 1 line.
Write or generate the tests — the AI agent handles this, constrained by keyof T and the compliance scanner.
Run the scanner — the manifest updates automatically.
Run the quality gate — PASS or fix.

No binding files to maintain. No sourceFiles[] to declare. No regex patterns to update. No manual traceability matrix. The test code is the single source of truth, and the rest is derived.

The Manifest as a Platform

The BindingsManifest started as an artefact — a JSON file that the compliance scanner reads to produce a report. It is now a platform that supports derived applications:

The compliance report — the original consumer. Reads the manifest, cross-references features and tests, produces the quality gate. This is where the manifest was born.
The trace engine (Part II) — the first derived application. Consumes the manifest to build a TraceIndex with seven ReadonlyMap instances, exposing eight query sub-commands. It does not produce bindings — it queries them.
The architecture X-ray (Part III) — the second derived application. Analyses the manifest as a bipartite graph to detect SRP violations, measure coupling, compute isolation scores, and surface encapsulation breaches. Same data, entirely different questions.
The AI agent loop (this part) — the third consumer. The agent reads the manifest to understand which ACs are covered and which are not, then produces tests to fill the gaps. The manifest is the feedback signal that closes the agent's loop.

The pattern is: produce the manifest once (via AST inference), consume it many times (for compliance, tracing, architecture analysis, agent guidance). Each consumer answers a different question from the same data. The manifest is cheap to produce — a single parseTestFile pass over the test suite — and the applications it enables are limited only by the questions you think to ask.

The 96-feature, 818-AC, 2,642-test system that produces this blog post was built this way. Not all at once — incrementally, feature by feature, each one traced from requirement to code to test to proof. The loop is closed. The system verifies itself. The agent writes the tests. The human reviews the meaning.

That is the architecture. These are the numbers. The code is the proof.

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

AI-Driven Self-Implementation — When ACs Write Their Own Tests📋

The AI Agent's Context Problem, Solved📋

The Self-Implementation Loop📋

What the Agent Produces📋

Second Example: FSM Lifecycle Guards📋

Skills as Reusable Workflows📋

The Complete Traceability Chain📋

What AI Cannot Verify📋

What This Makes Possible📋

The Manifest as a Platform📋