The AI Agent's Context Problem, Solved
The versus series on AI agents predicted that typed specifications would produce tighter feedback loops for AI agents than document-based specifications. This series provides the evidence.
An AI agent tasked with "add tests for Feature X" needs four things:
- The contract — what to test. In this system, that is the Feature abstract class with its AC methods. The agent reads
requirements/features/trace.tsand knows: 11 ACs, each with a JSDoc description, all abstract, all typedACResult. - The conventions — how to write tests. The
test/unit/CLAUDE.mdfile teaches the exact anatomy:@FeatureTest(F)class decorator,@Verifies<F>('ac')method decorator, importexpectfrom vitest and nothing else, nodescribe/it, helpers as plain functions. - The dep types — what to inject.
src/lib/external.tslists every port:FileSystem,Logger,Scheduler,WindowLike. The agent builds fakes from these interfaces — no guessing whatwindowlooks like. - The validation gate — how to know it worked. The compliance scanner reports which ACs are covered and which are not. The quality gate fails if critical ACs are missing.
No Jira ticket to parse. No PRD to interpret. No ambiguous natural-language specification. The feature class IS the specification, the decorators ARE the linking mechanism, and the scanner IS the validator.
The Self-Implementation Loop
Here is the concrete loop an agent executes when implementing a new feature's tests:
- Agent reads the Feature class — sees 11 abstract methods, each with a JSDoc description.
- Agent writes a test class:
@FeatureTest(TraceFeature)at the class level,@Verifies<TraceFeature>('indexBuildsFromManifest')on each method. - Agent imports the source functions:
buildTraceIndex,queryFileToFeatures,normalizePath— fromscripts/cli/commands/lib/trace-core.ts. - TypeScript compiler validates: does
'indexBuildsFromManifest'exist onTraceFeature? If not, compile error. The agent fixes the typo and re-runs. Sub-second feedback. - Agent runs tests — vitest executes the class (auto-registered by
@FeatureTest), coverage instruments the imported source files. - Compliance scanner reports: Total 11, Covered 11, TU 11, E2E 0, 100% — quality gate PASS.
The hexagonal architecture makes this loop particularly smooth. The agent does not need to set up jsdom, mock the filesystem, or configure network stubs. It builds fakes from the port interfaces — { readFile: async () => '{}', exists: async () => true } — and passes them to the factory. Tests are pure: construct fakes, call function, assert result.
The TRACE feature was substantially co-authored this way. The agent read the 11 ACs, wrote 57 test methods across 11 test classes, used synthetic fixtures (2 fake features, 3 fake files, 3 fake test refs), and achieved 100% statement coverage on the core.
What the Agent Produces
A concrete example. The agent reads this AC:
/** `work trace impact` identifies impacted features and tests to rerun from changed files. */
abstract impactAnalysis(): ACResult;/** `work trace impact` identifies impacted features and tests to rerun from changed files. */
abstract impactAnalysis(): ACResult;And produces this test class:
@FeatureTest(TraceFeature)
class ImpactAnalysisTests {
@Verifies<TraceFeature>('impactAnalysis')
'identifies impacted features from changed files'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts']);
expect(result.impactedFeatures).toHaveLength(1);
expect(result.impactedFeatures[0]!.id).toBe('FOO');
}
@Verifies<TraceFeature>('impactAnalysis')
'sorts impacted features by priority critical first'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts', 'src/lib/bar-state.ts']);
expect(result.impactedFeatures[0]!.priority).toBe('critical');
}
@Verifies<TraceFeature>('impactAnalysis')
'collects tests to rerun'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts']);
expect(result.testsToRerun.size).toBeGreaterThan(0);
}
@Verifies<TraceFeature>('impactAnalysis')
'returns empty for unknown files'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/nonexistent.ts']);
expect(result.impactedFeatures).toHaveLength(0);
}
@Verifies<TraceFeature>('impactAnalysis')
'renderer produces output'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts']);
const lines = renderImpact(result, noopFmt);
expect(lines.length).toBeGreaterThan(0);
}
}@FeatureTest(TraceFeature)
class ImpactAnalysisTests {
@Verifies<TraceFeature>('impactAnalysis')
'identifies impacted features from changed files'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts']);
expect(result.impactedFeatures).toHaveLength(1);
expect(result.impactedFeatures[0]!.id).toBe('FOO');
}
@Verifies<TraceFeature>('impactAnalysis')
'sorts impacted features by priority critical first'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts', 'src/lib/bar-state.ts']);
expect(result.impactedFeatures[0]!.priority).toBe('critical');
}
@Verifies<TraceFeature>('impactAnalysis')
'collects tests to rerun'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts']);
expect(result.testsToRerun.size).toBeGreaterThan(0);
}
@Verifies<TraceFeature>('impactAnalysis')
'returns empty for unknown files'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/nonexistent.ts']);
expect(result.impactedFeatures).toHaveLength(0);
}
@Verifies<TraceFeature>('impactAnalysis')
'renderer produces output'() {
const index = buildFixtureIndex();
const result = queryImpact(index, ['src/lib/foo-state.ts']);
const lines = renderImpact(result, noopFmt);
expect(lines.length).toBeGreaterThan(0);
}
}Five test methods, one AC, each verifying a different facet: the happy path, the priority sort, the test collection, the empty case, and the renderer. All importing queryImpact and renderImpact from the trace core — symbols the AST scanner will resolve to scripts/cli/commands/lib/trace-core.ts.
Second Example: FSM Lifecycle Guards
The impactAnalysis example above was a pure query function — straightforward. The more interesting case is state machine testing, where the agent must understand transitions, guards, and idempotent events.
The agent reads this AC:
/** The FSM lifecycle guards queries: only ready state accepts requests. */
abstract fsmLifecycleGuards(): ACResult;/** The FSM lifecycle guards queries: only ready state accepts requests. */
abstract fsmLifecycleGuards(): ACResult;And produces 11 test methods covering the full state machine contract:
@FeatureTest(TraceFeature)
class FsmLifecycleTests {
@Verifies<TraceFeature>('fsmLifecycleGuards')
'starts in idle state'() {
const fsm = createTraceIndexMachine<TraceIndex>();
expect(fsm.getState()).toBe('idle');
}
@Verifies<TraceFeature>('fsmLifecycleGuards')
'transitions idle -> loading -> ready'() {
const fsm = createTraceIndexMachine<TraceIndex>();
fsm.load();
expect(fsm.getState()).toBe('loading');
fsm.ready(buildFixtureIndex());
expect(fsm.getState()).toBe('ready');
}
@Verifies<TraceFeature>('fsmLifecycleGuards')
'requireReady throws when not ready'() {
const fsm = createTraceIndexMachine<TraceIndex>();
expect(() => fsm.requireReady()).toThrow('not ready');
}
@Verifies<TraceFeature>('fsmLifecycleGuards')
'idle ignores fail (no-op)'() {
const fsm = createTraceIndexMachine<TraceIndex>();
fsm.fail();
expect(fsm.getState()).toBe('idle'); // unchanged
}
// ... 7 more: error recovery, ready reload, idempotent transitions
}@FeatureTest(TraceFeature)
class FsmLifecycleTests {
@Verifies<TraceFeature>('fsmLifecycleGuards')
'starts in idle state'() {
const fsm = createTraceIndexMachine<TraceIndex>();
expect(fsm.getState()).toBe('idle');
}
@Verifies<TraceFeature>('fsmLifecycleGuards')
'transitions idle -> loading -> ready'() {
const fsm = createTraceIndexMachine<TraceIndex>();
fsm.load();
expect(fsm.getState()).toBe('loading');
fsm.ready(buildFixtureIndex());
expect(fsm.getState()).toBe('ready');
}
@Verifies<TraceFeature>('fsmLifecycleGuards')
'requireReady throws when not ready'() {
const fsm = createTraceIndexMachine<TraceIndex>();
expect(() => fsm.requireReady()).toThrow('not ready');
}
@Verifies<TraceFeature>('fsmLifecycleGuards')
'idle ignores fail (no-op)'() {
const fsm = createTraceIndexMachine<TraceIndex>();
fsm.fail();
expect(fsm.getState()).toBe('idle'); // unchanged
}
// ... 7 more: error recovery, ready reload, idempotent transitions
}The agent imports createTraceIndexMachine from src/lib/trace-index-state.ts. The AST scanner resolves it. The AC is bound to the FSM factory. This single example combines state machines, SOLID (the factory takes no deps — it is pure closure), requirements (the AC describes the guard contract), and @Verifies (linking the test to the AC via keyof T).
The pattern scales. Across the project, 96 features with 818 ACs were substantially co-authored with AI agents using this exact loop. The agent reads the contract, writes the test, the system validates completeness. The human reviews correctness.
Skills as Reusable Workflows
The agent does not invent the test convention — it reads it from test/unit/CLAUDE.md. This file teaches the exact anatomy of a test file:
- Import only
expectfrom vitest, neverdescribe/it @FeatureTest(F)on every class,@Verifies<F>('ac')on every method- Helpers as plain functions outside the class, not class members
- Async methods with
async, DOM tests with@vitest-environment jsdompragma - No boilerplate registration loop —
@FeatureTestauto-registers
Claude Code skills encode this as a repeatable operation. When the agent is asked to "add tests for Feature X," the skill loads the CLAUDE.md conventions, reads the Feature class, generates the test file, and runs the compliance scanner. The same workflow every time. No creative interpretation. No drift.
The conventions file is itself a requirement: if test/unit/CLAUDE.md says "no describe/it", the compliance scanner enforces it. A test file with bare describe() calls fails the build. The convention is not a suggestion — it is a gate.
The Complete Traceability Chain
With all four parts in place, the complete chain has no manual links:
| Link | Mechanism | Verified by |
|---|---|---|
| Requirement exists | Abstract class in requirements/features/ |
TypeScript compiler (class must export) |
| AC exists | Abstract method on Feature | TypeScript compiler (method must be declared) |
| AC referenced correctly | @Verifies<F>('acName') with keyof T |
TypeScript compiler (compile error on typo) |
| Test exists for AC | @Verifies decorator on test method |
Compliance scanner (reports uncovered ACs) |
| Test calls source code | import { fn } from 'src/lib/...' |
AST scanner (resolves imports to source files) |
| Source code is exercised | vitest v8 line instrumentation | Coverage thresholds (98% statements gate) |
| No sync IO regression | AST scan for *Sync calls |
Sync-usage scanner (zero violations gate) |
| Overall quality | All of the above combined | Quality gate: PASS / FAIL |
Each link is checked by a different mechanism, and each mechanism is itself a tracked feature with its own ACs and tests. The system verifies itself — not as a philosophical curiosity, but as an engineering guarantee.
What AI Cannot Verify
The compliance scanner checks completeness, not correctness. An agent can write a @Verifies method that calls the right function but asserts nothing useful:
@Verifies<TraceFeature>('impactAnalysis')
'does something'() {
const result = queryImpact(buildFixtureIndex(), []);
expect(result).toBeDefined(); // trivially true — useless assertion
}@Verifies<TraceFeature>('impactAnalysis')
'does something'() {
const result = queryImpact(buildFixtureIndex(), []);
expect(result).toBeDefined(); // trivially true — useless assertion
}This passes the scanner. The AC is "covered." The binding exists. The coverage is 100%. But the test verifies nothing meaningful — it confirms that calling the function does not throw, which is a low bar.
The scanner cannot judge intent. It cannot know that expect(result).toBeDefined() is a weak assertion. It cannot know that the test should have checked result.impactedFeatures.length instead. That is a human judgement call.
The system's guarantee is: every AC has at least one test that imports and calls the relevant source code. The guarantee is not: every test is a good test. Completeness is automated. Correctness is human.
Mutation testing (Stryker) would add the next layer — verifying that assertions actually detect regressions. That is a future step. For now, the closed loop provides the scaffold: if every AC is tested and every test calls the right code, the probability that a meaningful regression goes undetected is low. Not zero. Low.
What This Makes Possible
The closed loop changes the economics of feature development. Adding a new feature to the project means:
- Write the abstract class in
requirements/features/— 15 lines, one per AC. - Export it from
requirements/index.ts— 1 line. - Write or generate the tests — the AI agent handles this, constrained by
keyof Tand the compliance scanner. - Run the scanner — the manifest updates automatically.
- Run the quality gate — PASS or fix.
No binding files to maintain. No sourceFiles[] to declare. No regex patterns to update. No manual traceability matrix. The test code is the single source of truth, and the rest is derived.
The Manifest as a Platform
The BindingsManifest started as an artefact — a JSON file that the compliance scanner reads to produce a report. It is now a platform that supports derived applications:
The compliance report — the original consumer. Reads the manifest, cross-references features and tests, produces the quality gate. This is where the manifest was born.
The trace engine (Part II) — the first derived application. Consumes the manifest to build a
TraceIndexwith sevenReadonlyMapinstances, exposing eight query sub-commands. It does not produce bindings — it queries them.The architecture X-ray (Part III) — the second derived application. Analyses the manifest as a bipartite graph to detect SRP violations, measure coupling, compute isolation scores, and surface encapsulation breaches. Same data, entirely different questions.
The AI agent loop (this part) — the third consumer. The agent reads the manifest to understand which ACs are covered and which are not, then produces tests to fill the gaps. The manifest is the feedback signal that closes the agent's loop.
The pattern is: produce the manifest once (via AST inference), consume it many times (for compliance, tracing, architecture analysis, agent guidance). Each consumer answers a different question from the same data. The manifest is cheap to produce — a single parseTestFile pass over the test suite — and the applications it enables are limited only by the questions you think to ask.
The 96-feature, 818-AC, 2,642-test system that produces this blog post was built this way. Not all at once — incrementally, feature by feature, each one traced from requirement to code to test to proof. The loop is closed. The system verifies itself. The agent writes the tests. The human reviews the meaning.
That is the architecture. These are the numbers. The code is the proof.