Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

The AI Agent Experience

What does it feel like to be an AI agent working with each approach? This isn't an abstract question. As AI agents become central to development workflows — from Copilot's inline suggestions to Claude Code's autonomous implementation to custom agent pipelines — the interface between the agent and the specification system determines the quality of output.


The Workflow

1. Orchestrator receives task: "Implement password reset feature"
2. Context Engineering selects relevant documents:
   - PRD → password_reset section
   - Coding Practices → C# language rules
   - Testing → unit + integration strategies
   - Documentation → API documentation rules
3. Context assembled into prompt:
   "You are implementing the password_reset feature.
    Here are the acceptance criteria: [...]
    Here are the coding practices: [...]
    Here are the testing requirements: [...]
    Here is the existing code: [...]"
4. AI generates code
5. Quality gate validates output
6. If gate fails → add more context → regenerate
7. If gate passes → done

What the AI Sees

The AI receives a prompt containing:

FEATURE: password_reset
ACCEPTANCE CRITERIA:
  1. User can request a password reset email
  2. Reset link expires after 24 hours
  3. New password must meet complexity requirements

CODING PRACTICES:
  - Follow SOLID principles
  - Use dependency injection
  - Apply Result pattern for error handling
  - Method names: PascalCase
  - Max method length: 30 lines

TESTING REQUIREMENTS:
  - Write unit tests with NUnit/xUnit
  - Follow Arrange-Act-Assert pattern
  - Naming: MethodName_Scenario_ExpectedResult
  - Coverage target: 80% line, 75% branch

EXISTING CODE:
  [Contents of UserService.cs, AuthController.cs, ...]

What the AI Produces

The AI generates implementation code, tests, and possibly documentation — all based on its interpretation of the natural language specifications. The output quality depends on:

  1. How well the AI understands the ACs. "User can request a password reset email" is vague. Does "request" mean an API call? A UI button? A CLI command? The AI guesses based on context.

  2. How well the context was assembled. If the existing AuthController.cs wasn't included, the AI might create a new controller instead of extending the existing one.

  3. How consistent the AI is. The same prompt can produce different outputs on different runs. The spec-driven approach relies on the quality gate to catch inconsistencies, but the quality gate measures symptoms (coverage, lint), not semantics (does the code match the AC?).

Failure Modes

Failure 1: AC misinterpretation

The AI interprets "reset link expires after 24 hours" as "token expires 24 hours after creation." But the product owner meant "24 hours after the user's last login." The AI's interpretation is a valid reading of the English sentence. The quality gate passes because the implementation is syntactically correct, well-tested, and meets coverage thresholds. The semantic error survives until manual QA testing.

Failure 2: Context overload

The assembled context includes the full Testing-as-Code specification — 15+ strategies, metrics, thresholds. The AI tries to implement all of them: unit tests, integration tests, E2E tests, property-based tests, mutation test configuration. The feature needed 5 tests; the AI generates 47. Most are correct. Some are redundant. A few are wrong. The quality gate passes because coverage is 95%. The developer spends two hours cleaning up the excess.

Failure 3: Context underload

The progressive disclosure strategy starts with minimal context. The AI generates a password reset implementation that doesn't use the existing Result<T> pattern because the coding practices document wasn't included in Round 1. Round 2 adds the coding practices, and the AI rewrites using Result<T>. But it also changes the method signatures, breaking the tests from Round 1. The iterative cycle becomes expensive.

Failure 4: Invisible drift

The PRD was last updated three weeks ago. Since then, the team added a fourth AC ("reset link can only be used once") directly in the code, bypassing the PRD. The AI reads the outdated PRD and generates code for only three ACs. The quality gate passes. The fourth AC remains unimplemented until someone notices.


The Workflow

1. Developer adds new AC to feature record
2. Compiler fires REQ101: "no spec for this AC"
3. AI agent sees compiler diagnostic
4. AI generates specification interface method
5. Compiler fires CS0535: "class doesn't implement method"
6. AI generates implementation
7. Compiler fires REQ301: "no test for this AC"
8. AI generates test
9. Build succeeds — all diagnostics clear

What the AI Sees

The AI sees the type system, not a document. Its context is:

// The requirement (Feature definition)
public abstract record PasswordResetFeature : Feature<UserManagementEpic>
{
    public abstract AcceptanceCriterionResult
        UserCanRequestPasswordResetEmail(Email userEmail);

    public abstract AcceptanceCriterionResult
        ResetLinkExpiresAfter24Hours(TokenId resetToken, DateTime requestedAt);

    public abstract AcceptanceCriterionResult
        NewPasswordMeetsComplexityRequirements(Password newPassword);

    public abstract AcceptanceCriterionResult
        ResetLinkCanOnlyBeUsedOnce(TokenId resetToken);  // ← NEW AC
}

// The compiler diagnostic
error REQ101: PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce has no matching
              spec method with [ForRequirement(typeof(PasswordResetFeature),
              nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]

What the AI Produces

The AI generates code that satisfies the compiler:

// Step 1: Specification method
[ForRequirement(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]
Result ValidateTokenNotUsed(TokenId token);

// Step 2: Implementation
public Result ValidateTokenNotUsed(TokenId token)
{
    var stored = _tokenStore.Find(token);
    if (stored is null)
        return Result.Failure("Token not found");
    if (stored.IsUsed)
        return Result.Failure("Token already used");
    return Result.Success();
}

// Step 3: Test
[Verifies(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]
public void Used_token_is_rejected()
{
    var token = CreateValidToken();
    _service.ResetPassword(token.Id, ValidPassword());

    var result = _service.ValidateTokenNotUsed(token.Id);

    Assert.That(result.IsSuccess, Is.False);
    Assert.That(result.Error, Does.Contain("already used"));
}

Why This Works Better

  1. No ambiguity. The AC is ResetLinkCanOnlyBeUsedOnce(TokenId resetToken) — not "reset link can only be used once." The AI knows the input is a TokenId, not a string, not a URL, not a user ID. The type signature eliminates misinterpretation.

  2. Compiler-guided flow. The AI doesn't guess what to do next. The compiler tells it: "create a spec method" → "implement the method" → "write a test." Each step is driven by a specific diagnostic.

  3. No context assembly. The AI doesn't need an orchestration layer to select documents. The type system IS the context. The compiler diagnostics ARE the task list.

  4. No drift. The AI reads the current type definitions, not a document that might be outdated. The types reflect the current state of the system because they ARE the system.

  5. Verifiable output. The AI's output must compile. If it generates incorrect code, the compiler says so immediately. There's no "generate code → run quality gate → hope it's correct" cycle.

Failure Modes

Failure 1: Semantic correctness

The AI generates a test that compiles and passes but doesn't test the right thing:

[Verifies(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]
public void Token_is_used_once()
{
    Assert.Pass(); // This compiles. This "passes." This tests nothing.
}

The Roslyn analyzer is satisfied — a [Verifies] test exists for this AC. But the test is meaningless. The type system enforces structure (test exists, test references correct AC) but not semantics (test actually verifies the AC's behavior).

This is a genuine weakness — but not an unsolvable one. The solution is a three-layer defense that progressively closes the semantic gap.

Layer 1: Executable ACs — The AC Is Not Just a Name

The key insight from Requirements as Code Part IX: acceptance criteria are not just abstract method signatures. They are executable static methods on the feature record that encode the business rule:

public abstract partial record PasswordResetFeature : Feature<UserManagementEpic>
{
    // The AC is EXECUTABLE — it validates the business rule
    public static AcceptanceCriterionResult ResetLinkCanOnlyBeUsedOnce(
        TokenId resetToken,
        ITokenStore tokenStore)
    {
        var token = tokenStore.Find(resetToken);
        if (token is null)
            return AcceptanceCriterionResult.Failed("Token not found");
        if (token.IsUsed)
            return AcceptanceCriterionResult.Failed("Token already used");
        return AcceptanceCriterionResult.Satisfied();
    }

    // ... other ACs as executable methods
}

This same method runs in two places:

// In production — enforcing the business rule
public Result ResetPassword(TokenId token, Password newPassword)
{
    var acResult = PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce(token, _tokenStore);
    if (!acResult.IsSatisfied)
        return Result.Failure(acResult.FailureReason!);

    // ... proceed with password reset
}

// In tests — verifying the business rule
[Verifies(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]
public void Used_token_is_rejected()
{
    var token = CreateAndUseToken();

    var result = PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce(
        token.Id, _tokenStore);

    Assert.That(result.IsSatisfied, Is.False);
    Assert.That(result.FailureReason, Does.Contain("already used"));
}

The test calls the AC method directly. It doesn't test "something related to the AC" — it tests the AC itself. One definition, two uses: production enforcement and test verification. If the test doesn't call the AC method, it's not verifying the AC.

Layer 2: REQ305 Analyzer — The Compiler Checks the Test Body

A new Roslyn analyzer — REQ305 — inspects the body of every [Verifies] test and checks whether it actually invokes the referenced AC method:

// REQ305: Verifies test must invoke referenced AC method
//
// The analyzer uses the Roslyn semantic model to:
// 1. Read the [Verifies] attribute → extract the AC method name
// 2. Scan the test method body for InvocationExpressions
// 3. Check if any invocation resolves to the AC method
// 4. If not → emit diagnostic

Build output when a test lies:

warning REQ305: Test 'Token_is_used_once' has
                [Verifies(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce)]
                but never invokes PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce().
                The test may not actually verify this acceptance criterion.
                → Call the AC method in the test body to verify the business rule.

The Assert.Pass() test from above triggers REQ305 instantly — it has no invocation of the AC method. The lying test is caught at compile time.

What about indirect invocation? The test might call a service method that internally calls the AC. REQ305 can support this with a configurable depth:

# .editorconfig — configure invocation depth
dotnet_diagnostic.REQ305.invocation_depth = 2   # Check up to 2 levels deep

At depth 2, calling _service.ResetPassword(token, password) — which internally calls PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce(...) — satisfies REQ305. The analyzer traces the call graph through the semantic model.

Layer 3: Mutation Testing — The Test Must Kill Mutants

Even with REQ305, a test could call the AC method and then not assert on the result:

[Verifies(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]
public void Token_invoked_but_not_checked()
{
    // Calls the AC — passes REQ305 ✓
    var result = PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce(
        token.Id, _tokenStore);

    // But doesn't assert anything — test always passes
    // Mutation testing catches this: mutants survive
}

The [MutationTarget] DSL (from Part V) closes this final gap:

[MutationTarget(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce),
    MinimumMutationScore = 0.8)]

Stryker mutates the AC method body (flips token.IsUsed to !token.IsUsed, removes the null check, etc.). If the [Verifies] test doesn't kill at least 80% of these mutants, the build fails. A test that calls the AC but doesn't assert on the result kills zero mutants — caught.

The Three Layers Combined

Layer 1: Executable ACs
  │  The AC is code, not just a name.
  │  Tests call the AC directly.
  │  Same method runs in prod and tests.
  │
  ▼
Layer 2: REQ305 Analyzer (compile-time)
  │  "Does the test invoke the AC method?"
  │  If not → compiler warning.
  │  Catches: Assert.Pass(), unrelated assertions.
  │
  ▼
Layer 3: Mutation Testing (post-test)
  │  "Does the test kill mutants in the AC method?"
  │  If not → build failure.
  │  Catches: invocation without assertion, weak assertions.
  │
  ▼
Result: Structure ✓ + Invocation ✓ + Semantics ✓

The AC is not just a name — it's an executable method. The test must call it. The analyzer verifies the call. Mutation testing verifies the assertion. Three layers, zero semantic gap.

This is the complete answer to the "lying test" problem. The spec-driven approach has no equivalent — a document-based AC ("Reset link can only be used once") cannot be invoked, cannot be analyzed for invocation, and cannot be mutation-tested. It's text. You can write a test that says "I test this AC" and nobody can verify the claim programmatically. The typed approach can — and does.

Failure 2: Over-engineering

The AI sees the full type chain and generates overly complex code — a specification interface, an abstract base class, a concrete implementation, a decorator, and a factory — when a simple method would suffice. The type system's ceremony can encourage over-engineering in AI agents that interpret "follow the pattern" too literally.

Failure 3: Bootstrap confusion

The AI encounters Feature<PlatformScalabilityEpic> for the first time and doesn't understand the typed specification conventions. It treats Feature<T> as a regular generic class, ignores the abstract AC methods, and generates code that doesn't interact with the requirements chain at all.

This is the onboarding problem: the AI needs to learn the typed specification system before it can work within it. A good CLAUDE.md or system prompt mitigates this, but it's an extra setup cost.


Spec-Driven Feedback Loop

Duration: minutes to hours (depending on CI pipeline)

Write prompt ──→ AI generates code ──→ Run tests ──→ Check coverage
                                                          │
                                                    Pass? │ Fail?
                                                    │     │
                                                    Done  Update prompt
                                                          │
                                                          └──→ AI regenerates
                                                               │
                                                               └──→ Repeat

Loop characteristics:

  • Feedback granularity: coarse (pass/fail on the whole quality gate)
  • Feedback speed: slow (must run tests, coverage tools, quality gates)
  • Feedback specificity: low ("coverage < 80%", not "AC X is untested")
  • Number of iterations: unpredictable (depends on AI output quality)

Typed Specification Feedback Loop

Duration: seconds (compile time)

Change type ──→ Compiler fires ──→ Fix ──→ Compiler fires ──→ Fix ──→ Done
                   REQ101             CS0535              REQ301
                   "no spec"         "not implemented"     "no test"

Loop characteristics:

  • Feedback granularity: fine (specific diagnostic per AC)
  • Feedback speed: fast (compile time, IDE real-time)
  • Feedback specificity: high ("PasswordResetFeature.UsedOnce has no test")
  • Number of iterations: predictable (always three: spec → impl → test)

The Agent Autonomy Question

A crucial question for AI-assisted development: how autonomous can the agent be?

Spec-Driven Autonomy

The spec-driven agent has high autonomy, low safety rails. It reads the specifications, generates code freely, and the quality gate checks afterward. Between "generate" and "check," the agent can produce anything — correct or incorrect, elegant or messy, complete or partial.

The quality gate is the safety rail. But it's a coarse rail: it checks proxies (coverage, lint, test pass rate), not semantics (does the code match the AC?). The agent can produce code that passes all gates but misimplements a requirement.

This makes spec-driven well-suited for: experienced AI agents with good track records, well-calibrated context assembly, projects with strong QA processes, and situations where the cost of a bad generation is low (easy to regenerate).

Typed Specification Autonomy

The typed agent has constrained autonomy, tight safety rails. The type system limits what the agent can produce. The compiler diagnostic tells the agent exactly what to do next. The agent can't deviate because the compiler won't let it.

The compiler is the safety rail. It's a precise rail: it checks structure (does the spec method match the AC? does the class implement the interface? does the test reference the AC?). The agent must produce code that satisfies the type system — which is a higher bar than "passes the quality gate."

This makes typed specifications well-suited for: less experienced AI agents, situations where correctness is critical, projects with compliance requirements, and teams where the cost of a bad generation is high (hard to audit and fix).


Real-World Example: Claude Code

Claude Code (the tool generating this blog post) works with both approaches in practice. Here's what the experience looks like:

Claude Code with Spec-Driven Files

Claude Code reads the specification files and uses them as context. But spec files are large (5,000+ lines each), and context windows are finite. The agent must:

  1. Decide which sections are relevant (a mini "context engineering" problem within the agent itself)
  2. Parse the structured text format
  3. Map specification text to code actions
  4. Hope that its interpretation matches the specification author's intent

The experience works but is fragile. A slightly different task framing can lead Claude Code to select different spec sections, producing different output.

Claude Code with Typed Specifications

Claude Code reads the C# types and compiler diagnostics. The experience is:

  1. Open the feature record → see all ACs with typed parameters
  2. Read compiler diagnostics → know exactly what's missing
  3. Generate spec method → compiler validates immediately
  4. Generate implementation → compiler validates immediately
  5. Generate test → compiler validates immediately

Each step has instant, specific feedback. Claude Code doesn't need to "interpret" anything — the types are unambiguous, and the compiler is the arbiter.

The experience is significantly more reliable because the feedback loop is tighter, the diagnostics are specific, and there's no interpretation gap between "what the spec says" and "what the code should do."


The Future: AI Agents as First-Class Users

Both approaches position AI agents as users of the specification system. But they position them differently:

Spec-driven positions AI agents as document readers. The agent reads specifications, interprets them, and generates code. The specification is a communication medium between humans (who write specs) and AI (who reads specs). The bottleneck is interpretation accuracy.

Typed specifications position AI agents as type system participants. The agent writes code within the type system, receives compiler feedback, and iterates. The specification is not a communication medium — it's a constraint system. The bottleneck is not interpretation but compliance: can the agent produce code that satisfies the compiler?

As AI models improve, interpretation accuracy improves. But interpretation is inherently probabilistic — even a perfect model can misinterpret ambiguous natural language. Compiler compliance is binary: the code compiles or it doesn't. The typed approach bets that binary feedback is more reliable than probabilistic interpretation. That bet gets more compelling as AI agents become more capable — because more capable agents can satisfy more complex type constraints.


The Text-to-Code Translation Tax

There's a cost hidden in the spec-driven approach that deserves explicit attention: the translation tax.

When an AI agent reads a spec-driven document, it performs a translation:

Text specification → [AI interpretation] → Code implementation

This translation is lossy. The text says "User can request a password reset email". The AI must translate this into:

  • A method signature (what parameters? what return type?)
  • An implementation (what's the algorithm? what are the edge cases?)
  • Error handling (what if the user doesn't exist? what if the email service is down?)
  • Tests (what scenarios to cover? what assertions to make?)

Every one of these decisions is a translation from natural language to code. Every translation introduces ambiguity. Every ambiguity is a potential defect.

The spec-driven framework's response is quality gates: generate code, then validate it. But the quality gates also operate on the code's observable behavior (tests pass, coverage met, lint clean) — not on the translation accuracy. A method that does the wrong thing with correct syntax passes every gate.

The typed approach eliminates the translation tax for the structural dimension:

Type specification → [no translation needed] → Code implementation that satisfies types

The method signature is already defined: ResetLinkExpiresAfter24Hours(TokenId resetToken, DateTime requestedAt). The AI doesn't translate a sentence into a signature — it implements a signature that already exists. The parameters, return type, and name are given. The AI's job is to fill in the body, not to invent the shape.

This is why typed specifications disproportionately benefit AI agents. The translation tax is highest for the structural decisions (signatures, types, relationships) and lowest for the behavioral decisions (algorithm, edge cases, error handling). Types eliminate the high-tax translations and leave the low-tax ones to the AI.


The Inertness Tax on AI Agents

The spec-driven framework's text files are inert (as discussed in Part II). For AI agents, this inertness has a specific consequence: the AI must be the runtime for the specification.

A text specification like "All commands must have a corresponding validator" is a rule. But it's a rule written in English, stored in a .txt file. Who enforces it? In the spec-driven approach, the AI is expected to both understand the rule and follow it. The specification is dead text until the AI reads it and decides to comply.

This means the AI agent serves two roles simultaneously:

  1. Interpreter: understand what the specification means
  2. Enforcer: ensure the generated code follows the specification

These are conflicting roles. An interpreter tries to extract meaning. An enforcer checks compliance. Asking the same agent to do both is like asking a student to both take the exam and grade it.

The typed approach separates these roles:

  • The developer (or AI) writes code (the interpreter role)
  • The compiler checks compliance (the enforcer role)

The compiler is a better enforcer than the AI because:

  • It's deterministic (same input → same output, every time)
  • It's exhaustive (checks every type constraint, not a sampled subset)
  • It's instant (feedback in seconds, not after a pipeline run)
  • It can't be persuaded, tired, or distracted

When an AI agent works within a typed specification system, it only needs to be a good interpreter — the compiler handles enforcement. When an AI agent works within a spec-driven system, it must be both a good interpreter AND a good enforcer. That's a harder job, and it fails more often.


Summary

Dimension AI with Spec-Driven AI with Typed Specifications
Input Natural language specifications Type definitions + compiler diagnostics
Output quality Probabilistic (depends on interpretation) Structural (must compile)
Feedback loop Minutes-hours (quality gate) Seconds (compiler)
Feedback specificity Coarse (pass/fail) Fine (per-AC diagnostic)
Drift risk High (specs can be outdated) Zero (types are current)
Ambiguity Possible (natural language) Impossible (typed signatures)
Semantic verification Weak (quality gate checks proxies) Weak (compiler checks structure, not semantics)
Agent autonomy High (few constraints during generation) Constrained (compiler limits output)
Bootstrap Read the spec docs Learn the type system conventions
Best for Broad, exploratory tasks Specific, implementation tasks

Part IX examines the long game: what happens to each approach over months and years?


The Prompt Engineering Paradox

In the spec-driven world, you engineer prompts. In the typed specification world, you engineer types. Both are engineering. But one targets a probabilistic interpreter (the AI model), and the other targets a deterministic one (the compiler). The consequences are profound.

A Concrete Comparison: Implementing Order Cancellation

Let's watch two Claude Code sessions implement the same feature. The product owner wants: "Customer can cancel an order before shipment, refund is processed, confirmation email is sent."

Session A: Spec-Driven (Prompt-Engineered)

────────────────────────────────────────────────────────
Claude Code Session — Spec-Driven
────────────────────────────────────────────────────────

[Context assembled by orchestrator — 4,200 tokens]

SYSTEM PROMPT:
You are implementing the order_cancellation feature.

PRD CONTEXT:
  DEFINE_FEATURE(order_cancellation)
    description: "Allow customers to cancel orders before shipment"
    acceptance_criteria:
      - "Customer can cancel an order that has not yet shipped"
      - "Cancellation triggers a full refund to original payment method"
      - "Confirmation email sent after successful cancellation"
    priority: High
    complexity: Medium

CODING PRACTICES:
  - Language: C#
  - Patterns: SOLID, DI, Result pattern
  - Error handling: Result<T>, never throw
  - Naming: PascalCase methods, camelCase params
  - Max method length: 30 lines

TESTING PRACTICES:
  - Framework: NUnit
  - Pattern: Arrange-Act-Assert
  - Naming: MethodName_Scenario_ExpectedResult
  - Coverage target: 80%

EXISTING CODE:
  [OrderService.cs — 120 lines]
  [PaymentService.cs — 85 lines]
  [EmailService.cs — 60 lines]
  [Order.cs — 45 lines]

────────────────────────────────────────────────────────

> Claude generates OrderCancellationService.cs
> Claude generates OrderCancellationTests.cs
> Claude generates IOrderCancellationService.cs

[Quality gate runs — 45 seconds]

GATE RESULTS:
  ✓ Compilation: passed
  ✓ Tests: 6/6 passed
  ✓ Coverage: 84% (target: 80%)
  ✓ Lint: no violations
  ✗ Review note: "refund" AC — no explicit test for
    refund amount matching order total

[Prompt enriched with gate feedback — Round 2]

> Claude adds RefundAmountMatchesOrderTotal test
> Claude adjusts RefundService mock setup

[Quality gate runs — 45 seconds]

GATE RESULTS:
  ✓ All gates passed

Total time: ~4 minutes
Total prompt tokens consumed: ~8,400
Total rounds: 2
────────────────────────────────────────────────────────

Notice what happened. The AI consumed 4,200 tokens of context (PRD, practices, existing code) just to understand what to build. That's 4,200 tokens NOT available for reasoning about the implementation. The prompt engineering targeted the AI's interpretation layer — "please understand what I mean by these English sentences."

The quality gate caught a coverage gap on round 1, but it caught it as a coverage metric, not as "the refund AC lacks a test." The gate feedback was "coverage for RefundService < 80%" — a proxy signal, not a requirement signal. The AI had to infer which AC was undertested.

Session B: Typed Specifications (Type-Engineered)

────────────────────────────────────────────────────────
Claude Code Session — Typed Specifications
────────────────────────────────────────────────────────

[Context: CLAUDE.md (48 lines) + compiler diagnostics]

> Developer saved OrderCancellationFeature.cs with 3 ACs

COMPILER OUTPUT (instant):
  error REQ101: OrderCancellationFeature.CustomerCanCancelUnshippedOrder
                has no matching spec method.
  error REQ101: OrderCancellationFeature.CancellationTriggersFullRefund
                has no matching spec method.
  error REQ101: OrderCancellationFeature.ConfirmationEmailSentAfterCancellation
                has no matching spec method.

> Claude reads diagnostics
> Claude generates IOrderCancellationSpec.cs (3 methods)

COMPILER OUTPUT (instant):
  error CS0535: OrderCancellationService does not implement
               IOrderCancellationSpec.CancelOrder(OrderId, CustomerId)
  error CS0535: OrderCancellationService does not implement
               IOrderCancellationSpec.InitiateRefund(OrderId, PaymentMethodId)
  error CS0535: OrderCancellationService does not implement
               IOrderCancellationSpec.SendCancellationConfirmation(OrderId, Email)

> Claude reads diagnostics
> Claude implements all 3 methods in OrderCancellationService.cs

COMPILER OUTPUT (instant):
  warning REQ301: OrderCancellationFeature.CustomerCanCancelUnshippedOrder
                  has no test.
  warning REQ301: OrderCancellationFeature.CancellationTriggersFullRefund
                  has no test.
  warning REQ301: OrderCancellationFeature.ConfirmationEmailSentAfterCancellation
                  has no test.

> Claude reads diagnostics
> Claude generates OrderCancellationTests.cs
>   (3 [Verifies] test methods, one per AC)

COMPILER OUTPUT (instant):
  Build succeeded. 0 errors. 0 warnings.

Total time: ~2 minutes
Total prompt tokens consumed: ~1,800
Total rounds: 4 (but each round is seconds, not minutes)
────────────────────────────────────────────────────────

The contrast is stark. The typed session consumed 1,800 tokens — less than half the spec-driven session. The AI didn't read a PRD, coding practices, or testing specifications. It read the feature record (30 lines), the compiler diagnostics (5 lines per round), and the CLAUDE.md (48 lines). The rest of its context window was available for reasoning about the actual implementation.

More importantly: the AI never interpreted English. It read types. CancelOrder(OrderId orderId, CustomerId customerId) is unambiguous. The AI doesn't decide what parameters the method should take — the spec interface already defines them. The AI doesn't decide which ACs need tests — the compiler tells it.

The Paradox

Here is the paradox: in spec-driven, you engineer prompts to tell the AI what to do. In typed specifications, you engineer types, and the compiler tells the AI what to do.

Prompt engineering is human-to-AI communication. Type engineering is human-to-compiler communication, and then compiler-to-AI communication. The compiler is a better communicator than a prompt because:

  1. It's unambiguous. error CS0535: does not implement IOrderCancellationSpec.CancelOrder has one meaning. "Implement the order cancellation feature following SOLID principles" has many.

  2. It's incremental. The compiler gives one diagnostic at a time (or a batch, but each is specific). A prompt gives everything at once and hopes the AI absorbs it all.

  3. It's verifiable. The compiler confirms when the AI's output is correct (build succeeds). A quality gate confirms when proxies are met (coverage, lint), not when the output is correct.

  4. It doesn't consume context. Compiler diagnostics are 1-2 lines. A prompt is thousands of tokens. Every token of prompt is a token not available for reasoning.

The prompt engineering paradox: the more context you give the AI, the less room it has to think. Typed specifications resolve this by encoding context in the type system, which is read by the compiler, which produces tiny diagnostics, which the AI reads. The context is compressed from 4,200 tokens to 5 lines — without losing information.


Agentic Loops: Single-Shot vs Compiler-Guided

The previous section showed individual sessions. Now let's examine the loop structure — how the AI iterates to completion. This is where the operational difference becomes a structural difference.

The Spec-Driven Loop: Generate-Validate-Regenerate

┌──────────────────────────────────────────────────────────┐
│                 SPEC-DRIVEN AGENTIC LOOP                 │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │  Assemble   │───→│   Generate   │───→│  Quality   │  │
│  │  Context    │    │   Code       │    │  Gate      │  │
│  └─────────────┘    └──────────────┘    └─────┬──────┘  │
│        ↑                                      │         │
│        │                               Pass?  │  Fail?  │
│        │                               ┌──────┴──────┐  │
│        │                               │             │  │
│        │                              Done     Enrich   │
│        │                                      context   │
│        │                                        │       │
│        └────────────────────────────────────────┘       │
│                                                          │
│  TERMINATION: gate passes OR max iterations reached      │
│  ITERATIONS: unpredictable (1 to N)                      │
│  COST PER ITERATION: high (full pipeline run)            │
│  CONVERGENCE GUARANTEE: none                             │
│                                                          │
└──────────────────────────────────────────────────────────┘

Loop characteristics:

  • Unbounded. There is no structural guarantee the loop terminates at a specific iteration count. The AI might produce output that passes the quality gate on round 1 — or round 7. The gate checks proxies (coverage, lint), and the AI can chase proxy improvements without converging on requirement satisfaction.

  • Coarse feedback. Each iteration gets a pass/fail from the quality gate. If it fails, the feedback is "coverage is 72%" or "lint violation on line 47." The AI must infer what to change. This is like navigating with a compass that only tells you "you're not there yet" without showing which direction to go.

  • Expensive iterations. Each round runs the full pipeline: compile, test, coverage analysis, lint. On a real project, this is 30 seconds to 5 minutes per iteration. Five iterations at 2 minutes each is 10 minutes of pipeline time.

  • Context growth. Each iteration adds context — gate feedback, previous attempts, enriched specifications. By round 3, the prompt is significantly larger than round 1. The AI is reasoning with more context but less headroom.

The Typed Specification Loop: Error-Fix-Error-Fix-Done

┌──────────────────────────────────────────────────────────┐
│             TYPED SPECIFICATION AGENTIC LOOP              │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐             │
│  │ REQ101:  │──→│ Create   │──→│ CS0535:  │             │
│  │ no spec  │   │ spec     │   │ not impl │             │
│  └──────────┘   └──────────┘   └────┬─────┘             │
│                                      │                   │
│                                      ▼                   │
│                                ┌──────────┐              │
│                                │ Implement│              │
│                                │ method   │              │
│                                └────┬─────┘              │
│                                     │                    │
│                                     ▼                    │
│                                ┌──────────┐              │
│                                │ REQ301:  │              │
│                                │ no test  │              │
│                                └────┬─────┘              │
│                                     │                    │
│                                     ▼                    │
│                                ┌──────────┐              │
│                                │ Write    │              │
│                                │ test     │              │
│                                └────┬─────┘              │
│                                     │                    │
│                                     ▼                    │
│                                ┌──────────┐              │
│                                │  Build   │              │
│                                │ succeeds │              │
│                                └──────────┘              │
│                                                          │
│  TERMINATION: build succeeds (zero diagnostics)          │
│  ITERATIONS: exactly 3 per AC (spec → impl → test)      │
│  COST PER ITERATION: low (incremental compile)           │
│  CONVERGENCE GUARANTEE: structural (finite diagnostics)  │
│                                                          │
└──────────────────────────────────────────────────────────┘

Loop characteristics:

  • Bounded. The loop has exactly three phases per acceptance criterion: create spec, implement, write test. The compiler tells the AI when it's done — zero diagnostics means the chain is complete. The iteration count is predictable: 3 * number_of_ACs.

  • Precise feedback. Each compiler diagnostic names the exact type, the exact method, and the exact action required. error REQ101: OrderCancellationFeature.CancellationTriggersFullRefund has no matching spec method is not vague. The AI doesn't infer — it reads the instruction.

  • Cheap iterations. Each phase is an incremental compile — typically 1-5 seconds. The AI doesn't wait for a pipeline. It writes code, saves, reads the compiler output, and continues. Four ACs with three phases each = 12 micro-iterations at 3 seconds each = 36 seconds total.

  • Constant context. The prompt doesn't grow. Each iteration replaces the previous compiler output with new diagnostics. The AI's context is: feature record (stable) + current diagnostics (changing) + CLAUDE.md (stable). The headroom stays constant.

Why Bounded Loops Matter for Autonomous Agents

Autonomous AI agents — those running without human oversight — need predictable behavior. The operator must know:

  1. How long will this take? Bounded loops answer this: 3 * ACs * compile_time. Unbounded loops cannot answer this.

  2. How much will this cost? Bounded loops have predictable token consumption. Unbounded loops have unpredictable token consumption because context grows with each iteration.

  3. Will this terminate? Bounded loops terminate when diagnostics reach zero — a structurally guaranteed state. Unbounded loops terminate when a quality gate passes — a probabilistically achieved state.

  4. Can I trust the output? Bounded loops produce output that satisfies the compiler — a deterministic check. Unbounded loops produce output that satisfies quality gates — a heuristic check.

For a team running nightly autonomous agents that implement features from a backlog, predictability is everything. An agent that takes 2 minutes per feature (bounded) can process 30 features overnight. An agent that takes between 2 and 20 minutes per feature (unbounded) might process 10 or 50 — you don't know until the morning.

The Iteration Count Comparison

A concrete comparison for a feature with 4 acceptance criteria:

Metric Spec-Driven Loop Typed Specification Loop
Iterations to complete 1-7 (unpredictable) 12 (3 per AC, exactly)
Time per iteration 30s-5min (pipeline) 1-5s (incremental compile)
Total time (best case) 30 seconds 36 seconds
Total time (worst case) 35 minutes 60 seconds
Total time (typical) 4-8 minutes 40-50 seconds
Tokens consumed 8,000-25,000 (growing) 1,500-2,500 (constant)
Termination guarantee No (may hit max iterations) Yes (finite diagnostics)
Feedback specificity "Coverage 72%" "Feature.AC has no test"

The spec-driven best case is competitive. But the worst case is 35x slower. For autonomous agents that process hundreds of tasks, the worst case is what matters — because it's the bottleneck that determines throughput.

The Stalled Agent Problem

In the spec-driven loop, there's a failure mode with no equivalent in the typed loop: the stalled agent.

SPEC-DRIVEN STALL:

Round 1: Generate code → Gate: coverage 68% → Fail
Round 2: Add more tests → Gate: coverage 74% → Fail
Round 3: Add more tests → Gate: coverage 76% → Fail
Round 4: Add more tests → Gate: coverage 78% → Fail
Round 5: Add more tests → Gate: coverage 79% → Fail    ← asymptotic approach
Round 6: Refactor test → Gate: coverage 79% → Fail     ← stuck
Round 7: Max iterations reached → ABORT

The agent is chasing a coverage metric (80%) and each iteration produces diminishing returns. The gate keeps saying "not enough" but doesn't say "test THIS specific method" or "test the error path in THAT branch." The agent generates increasingly desperate tests that target random code paths.

In the typed loop, this can't happen. The compiler diagnostic says warning REQ301: Feature.AC has no test — the AI knows exactly which AC is untested. It writes one test. The diagnostic clears. There's no asymptotic approach to a threshold because the target isn't a percentage — it's a set of specific, named acceptance criteria.

The Compound Effect

For a single feature, the difference is minutes. For a project with 50 features and 200 ACs, the compound effect is dramatic:

Scale Spec-Driven Total Typed Specification Total
1 feature, 4 ACs ~6 minutes ~45 seconds
10 features, 40 ACs ~60 minutes ~7.5 minutes
50 features, 200 ACs ~5 hours ~37 minutes
Full project autonomous run Overnight, unpredictable 1-2 hours, predictable

These numbers assume typical iteration counts and pipeline times. Your mileage will vary. But the structural difference — bounded vs unbounded — is constant regardless of scale.


The Multi-Agent Future

AI-assisted development is evolving from single-agent (one Copilot, one Claude Code session) to multi-agent (orchestrated teams of specialized agents working in parallel). This evolution amplifies the differences between the two approaches. (For an example of how typed DSLs provide structured context that agents can introspect — including self-documenting DSLs via Document<Document<>> — see Auto-Documentation from a Typed System, Part IX.)

Multi-Agent with Spec-Driven

In a multi-agent spec-driven pipeline, different agents handle different concerns:

Orchestrator
├── Agent A: Read PRD → Generate implementation code
├── Agent B: Read Testing spec → Generate tests
├── Agent C: Read Coding Practices → Review code quality
└── Agent D: Read Documentation spec → Generate docs
    │
    ▼
Merge all outputs → Quality gate → Pass/Fail

The problem: agents don't share context. Agent A generates an implementation. Agent B generates tests. But Agent B doesn't know what Agent A generated — it reads the PRD and generates tests independently. The tests might not match the implementation. The quality gate catches this (tests fail), but the fix requires another round of multi-agent coordination.

Cross-agent consistency is hard because each agent reads different documents and interprets them independently. Two agents reading the same PRD section can produce incompatible implementations. The orchestrator must detect conflicts and mediate — adding complexity and latency.

Multi-agent spec-driven failure mode:

Agent A reads PRD: "User can cancel order"
→ Implements: CancelOrder(int orderId)   ← uses int

Agent B reads PRD: "User can cancel order"  
→ Tests: CancelOrder(Guid orderId)        ← uses Guid

Result: Tests fail. Orchestrator reruns.
Root cause: PRD says "order" — doesn't specify the ID type.

Multi-Agent with Typed Specifications

In a multi-agent typed pipeline, the type system is the shared context:

Orchestrator
├── Agent A: See REQ101 → Generate spec method for new AC
├── Agent B: See CS0535 → Implement the spec method
├── Agent C: See REQ301 → Generate test for AC
└── Agent D: See PERF warning → Add performance budget
    │
    ▼
Compile → All diagnostics clear → Done

The difference: agents share context through the type system. Agent A creates a spec method with signature Result RevokeRole(UserId admin, UserId target, RoleId role). Agent B sees this exact signature (it's in the compiled types) and implements it with the same parameter types. Agent C sees the same signature and writes a test that matches. There's no interpretation gap because the types are unambiguous.

Multi-agent typed failure mode:

Agent A adds spec method: RevokeRole(UserId, UserId, RoleId) → Result

Agent B implements: public Result RevokeRole(UserId admin, UserId target, RoleId role)
→ Must match signature exactly — compiler enforces

Agent C tests: [Verifies(typeof(UserRolesFeature), nameof(AdminCanRevokeRoles))]
→ Must reference existing AC — compiler enforces

Result: All agents produce compatible code. Types are the shared contract.

The Type System as a Coordination Protocol

In distributed systems, we use protocols (gRPC, protobuf, OpenAPI) to ensure that different services agree on data shapes. In multi-agent AI development, the type system serves the same role: it's a coordination protocol that ensures all agents produce compatible code.

The spec-driven approach has no coordination protocol beyond the documents themselves. If two agents interpret a document differently, the conflict is detected at the quality gate — late, coarse, and expensive to fix.

The typed approach has a coordination protocol built in: the C# type system. If two agents produce incompatible code, the compiler detects the conflict immediately — early, specific, and cheap to fix. An interface method has one signature. Two implementations must match it exactly. There's no room for interpretation.

This is why typed specifications become more valuable, not less, as AI agents become more numerous and more autonomous. The more agents you have, the more important the coordination protocol becomes. Documents don't scale as coordination protocols — they're ambiguous. Types do — they're precise.


The Human-AI Collaboration Model

A final perspective: how does each approach structure the collaboration between humans and AI?

Spec-Driven: Human Writes, AI Implements

Human                            AI Agent
  │                                │
  ├── Write PRD ──────────────────→│
  ├── Write Testing spec ─────────→│
  ├── Write Coding Practices ─────→│
  │                                ├── Read docs
  │                                ├── Generate code
  │                                ├── Generate tests
  │                                ├── Run quality gate
  │◄────── Deliver output ─────────┤
  │                                │
  ├── Review output                │
  ├── Fix issues                   │
  ├── Update docs if needed        │
  └── Merge                        │

The human is the specification author. The AI is the implementation agent. The human writes documents; the AI reads documents and generates code. The collaboration is sequential: human specifies → AI implements → human reviews.

The bottleneck: the human must write good documents. If the PRD is vague, the AI produces vague code. If the Testing spec is incomplete, the AI misses test cases. The quality of the AI's output is bounded by the quality of the human's documents.

Typed Specifications: Human Types, Compiler Guides AI

Human                     Compiler           AI Agent
  │                          │                  │
  ├── Add feature type ─────→│                  │
  │                          ├── Fire REQ101 ──→│
  │                          │                  ├── Generate spec
  │                          ├── Fire CS0535 ──→│
  │                          │                  ├── Generate impl
  │                          ├── Fire REQ301 ──→│
  │                          │                  ├── Generate test
  │                          ├── Build OK ─────→│
  │◄───────────── Output ────┤                  │
  │                          │                  │
  ├── Review output          │                  │
  └── Merge                  │                  │

The human is the specification definer (adding types). The compiler is the orchestrator (telling the AI what to do next). The AI is the implementation agent (responding to compiler diagnostics). The collaboration is three-party: human defines → compiler guides → AI implements.

The bottleneck: the human must define good types. If the feature record has vague AC method names, the AI produces vague implementations. But the types force precision in ways documents don't — method parameters must be typed, return types must be specified, and the compiler enforces structural completeness.

The Key Insight

In the spec-driven model, the human talks to the AI through documents. In the typed model, the human talks to the AI through the compiler. The compiler is a better communication channel because:

  1. It's unambiguous. REQ101: Feature.AC has no spec has one interpretation. "Implement the password reset feature" has many.
  2. It's incremental. Each diagnostic is one step. The AI handles one step at a time. Documents present everything at once.
  3. It's verifiable. When the diagnostic clears, the step is done. When an AI "follows" a document, completion is subjective.
  4. It's deterministic. The same code produces the same diagnostics. The same document can produce different AI interpretations.

The compiler is not just a validator — it's a communication protocol between humans and AI agents. The human expresses intent through types. The compiler translates intent into actionable diagnostics. The AI responds to diagnostics with code. The compiler verifies the code against the types. The loop closes.

Documents are a monologue — the human speaks, the AI listens. Types are a dialogue — the human defines, the compiler translates, the AI responds, the compiler verifies.

Part IX examines the long game: what happens to each approach over months and years?

⬇ Download