Testing: Strategy Documents vs Compiler-Enforced Coverage

Testing is where both approaches invest the most effort — and where the difference in philosophy produces the most visible consequences.

The Spec-Driven Testing Specification

The Testing-as-Code specification is the cogeet-io framework's most detailed document. It defines:

Principles (3)

Layered Testing Strategy — Tests follow the testing pyramid: 70% unit, 20% integration, 10% E2E.
Test Reliability — Flakiness rate < 0.01, stability > 0.99, execution time variance coefficient < 0.1.
Comprehensive Coverage — Line coverage > 80%, branch > 75%, function > 90%, requirement coverage = 100%.

Practices (18+)

Unit Testing:

Test Organization — consistent naming per language (Rust: test_function_behavior_expected, Python: test_when_given_then, JS: should_behavior_when_condition)
Test Isolation — no shared mutable state, independent test data, proper cleanup
Dependency Mocking — mock repositories, HTTP clients, file systems, clocks
Test Doubles Strategy — when to use dummies, fakes, stubs, spies, mocks

Integration Testing:

API Integration — request/response validation, error handling, auth, rate limiting
Database Integration — real database instances, transaction isolation, migration testing
Microservice Integration — sync, async, event-driven, circuit breaker testing
Third-Party Integration — contract testing, sandbox environments, feature toggles

E2E Testing:

User Workflow Testing — registration, core flows, error recovery, business processes
Cross-Browser Testing — desktop (Chrome, Firefox, Safari, Edge), mobile, legacy
System Reliability — network failures, service degradation, data corruption, resource exhaustion

Specialized Testing:

Property-Based Testing — round-trip, invariants, commutativity, idempotency
Mutation Testing — arithmetic, boolean, conditional, statement mutations; score > 80%
Vulnerability Scanning — dependency audits, code patterns, configuration, secrets
Penetration Testing — injection, authentication, authorization, data exposure
Load Testing — baseline, stress, spike, volume, endurance; p95 < 2s
Chaos Engineering — infrastructure, application, dependency, resource failures
Fuzz Testing — structured fuzzing, protocol fuzzing

Test Data:

Synthetic Data Generation — factories, fixtures, generators, anonymization
Test Data Lifecycle — setup, isolation, cleanup, reset
Test Environment Strategy — containers for integration, production-like for perf

CI/CD Integration:

Pipeline integration — pre-commit, commit, pre-deploy, post-deploy stages
Parallelization — test-level, suite-level, pipeline-level, distributed
Reporting — execution results, coverage analysis, performance metrics, quality trends

Metrics (3)

Test Coverage Comprehensive — five dimensions (line, branch, function, requirement, risk)
Test Quality Score — composite of reliability (0.4), maintainability (0.3), effectiveness (0.3); target 0.85
Defect Detection Effectiveness — pre-production detection > 85%, regression prevention > 90%, critical detection > 95%

Language-Specific Patterns

Rust: #[cfg(test)] modules, tests/ directory, doc tests, criterion benchmarks
Python: pytest, hypothesis, coverage.py, mutmut
JavaScript: jest, fast-check, nyc, stryker
Java: JUnit, junit-quickcheck, jacoco, pitest
C#: xUnit/NUnit, coverlet, stryker.NET

The Testing-as-Code specification is a comprehensive encyclopedia of testing knowledge. If a team doesn't know about mutation testing, this document introduces it. If a team hasn't considered chaos engineering, this document explains why they should. If a team's naming conventions are inconsistent, this document provides language-specific patterns.

It's a teaching document, a reference document, and a compliance checklist rolled into one.

What It Cannot Give You

Which tests are missing. The document says "requirement coverage = 100%" but doesn't know which requirements exist, which have tests, and which don't. That knowledge lives in the code — and the document has no structural link to the code.
Whether test X covers AC Y. The document says "tests must cover acceptance criteria" but has no mechanism to verify the link. A test named TestPasswordReset might or might not test the "reset link expires after 24 hours" AC. The document can't tell.
Stale test detection. If a feature is deleted, the tests for that feature become stale. The document has no mechanism to detect this. The tests still exist, still pass, still count toward coverage — but they test dead code.
Real-time coverage state. The document defines thresholds (80% line coverage) but doesn't know the current state. You need a separate tool (coverage.py, coverlet, nyc) to measure — and that tool measures lines, not acceptance criteria.

Typed Specification Testing

The typed approach to testing is narrower in scope but deeper in enforcement. It doesn't tell you about 15 testing strategies. It does three things:

1. Test-to-Requirement Linking

Every test class is annotated with [TestsFor] to declare which feature it covers. Every test method is annotated with [Verifies] to declare which AC it proves:

[TestsFor(typeof(UserRolesFeature))]
public class UserRolesFeatureTests
{
    [Verifies(typeof(UserRolesFeature), nameof(UserRolesFeature.AdminCanAssignRoles))]
    public void Admin_with_ManageRoles_can_assign_editor_role()
    {
        // Arrange
        var admin = TestUsers.AdminWithPermission(Permission.ManageRoles);
        var target = TestUsers.RegularUser();
        var role = Roles.Editor;

        // Act
        var result = _service.AssignRole(admin, target, role);

        // Assert
        Assert.That(result.IsSuccess, Is.True);
        Assert.That(target.CurrentRole, Is.EqualTo(role));
    }

    [Verifies(typeof(UserRolesFeature), nameof(UserRolesFeature.AdminCanAssignRoles))]
    public void Non_admin_cannot_assign_roles()
    {
        var user = TestUsers.RegularUser();
        var target = TestUsers.AnotherUser();
        var role = Roles.Editor;

        var result = _service.AssignRole(user, target, role);

        Assert.That(result.IsSuccess, Is.False);
        Assert.That(result.Error, Is.InstanceOf<UnauthorizedException>());
    }

    [Verifies(typeof(UserRolesFeature), nameof(UserRolesFeature.ViewerHasReadOnlyAccess))]
    public void Viewer_can_read_resources()
    {
        var viewer = TestUsers.ViewerUser();
        var resource = TestResources.Document("doc-123");

        var result = _service.VerifyReadAccess(viewer, resource);

        Assert.That(result.IsSuccess, Is.True);
    }

    [Verifies(typeof(UserRolesFeature), nameof(UserRolesFeature.ViewerHasReadOnlyAccess))]
    public void Viewer_cannot_modify_resources()
    {
        var viewer = TestUsers.ViewerUser();
        var resource = TestResources.Document("doc-123");

        var result = _service.VerifyWriteAccess(viewer, resource);

        Assert.That(result.IsSuccess, Is.False);
    }
}

Note: multiple tests can verify the same AC. AdminCanAssignRoles has both a happy-path test and an authorization-failure test. The system tracks all of them.

2. Compiler-Enforced Coverage

The REQ3xx analyzer family detects coverage gaps at compile time:

Diagnostic	Severity	Trigger
REQ300	Error	Feature has zero `[TestsFor]` test classes
REQ301	Warning	AC method has no `[Verifies]` test
REQ302	Warning	`[Verifies]` references an AC method that doesn't exist (stale)
REQ303	Info	Feature fully tested — all ACs have `[Verifies]` tests

Build output:

error REQ300: JwtRefreshStory has 2 acceptance criteria but no test class
              with [TestsFor(typeof(JwtRefreshStory))]

warning REQ301: PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce has no test
                with [Verifies(typeof(PasswordResetFeature),
                nameof(PasswordResetFeature.ResetLinkCanOnlyBeUsedOnce))]

warning REQ302: UserRolesTests.OldTest references
                nameof(UserRolesFeature.DeletedAC) which no longer exists

This is the critical difference: the compiler knows which acceptance criteria have tests and which don't. It doesn't measure line coverage — it measures requirement coverage. It doesn't count tests — it maps tests to ACs.

3. Quality Gates Integration

After tests execute, the REQ4xx analyzer family validates that tests don't just exist — they pass, meet coverage thresholds, and satisfy performance budgets:

<!-- MyApp.Tests.csproj -->
<PropertyGroup>
  <RequirementMinCoverage>80</RequirementMinCoverage>
  <RequirementMinPassRate>100</RequirementMinPassRate>
  <RequirementMaxTestDuration>5000</RequirementMaxTestDuration>
</PropertyGroup>

Diagnostic	Severity	Trigger
REQ400	Error	Feature's test pass rate below minimum
REQ401	Warning	Feature's AC coverage below threshold
REQ402	Warning	Test duration exceeds budget
REQ403	Info	Feature passes all quality gates

A Side-by-Side Comparison

Scenario: Team adds property-based testing

Spec-driven: The Testing-as-Code spec already documents property-based testing. The team reads the section, learns about round-trip, invariant, commutativity, and idempotency properties, and implements tests using the recommended framework (proptest for Rust, hypothesis for Python, fast-check for JS).

The spec tells them:

What property categories exist
Which frameworks to use per language
What violation patterns to watch for (untested invariants, missing round-trip tests)
What auto-fix options exist

This is valuable. It's a teaching moment that produces better tests.

Typed specifications: The typed approach has nothing to say about property-based testing. It doesn't know what property testing is, doesn't recommend frameworks, doesn't define categories. A developer writes a property test, annotates it with [Verifies], and the system tracks it like any other test.

[Verifies(typeof(OrderProcessingFeature),
    nameof(OrderProcessingFeature.OrderTotalMustBePositive))]
[Property]  // FsCheck or similar
public Property Order_total_is_always_positive_when_all_lines_have_positive_quantity()
{
    return Prop.ForAll(
        Arb.From<PositiveInt>().Select(i => i.Get),
        Arb.From<PositiveDecimal>(),
        (quantity, price) =>
        {
            var line = new OrderLine(quantity, price);
            var order = new Order(new[] { line });
            return order.Total > 0;
        });
}

The system knows this test covers the OrderTotalMustBePositive AC. But it doesn't know it's a property test, doesn't validate the property category, and doesn't check the mutation score.

Winner: Spec-driven for guidance; typed for enforcement.

Scenario: Developer forgets to test an AC

Spec-driven: The coverage tool reports 78% line coverage on AuthorizationService.cs. The quality gate fails because the threshold is 80%. But line coverage doesn't tell you WHICH acceptance criteria are untested. The developer adds a few more test cases for well-covered methods until coverage reaches 81%. The quality gate passes. The untested AC remains untested.

Coverage report:
  AuthorizationService.cs: 81% ← passes threshold
  - AssignRole(): 95% covered
  - VerifyReadAccess(): 90% covered
  - RevokeRole(): 45% covered ← this is the problem, but not flagged

The spec-driven approach catches low coverage but not missing AC coverage. A developer can game the threshold by testing easy methods more thoroughly while leaving hard methods untested.

Typed specifications: The analyzer fires:

warning REQ301: UserRolesFeature.AdminCanRevokeRoles has no test with
                [Verifies(typeof(UserRolesFeature),
                nameof(UserRolesFeature.AdminCanRevokeRoles))]

The diagnostic is specific: not "low coverage" but "this specific acceptance criterion has no test." The developer cannot game it by testing other ACs more thoroughly. The build won't pass until this specific AC has a [Verifies] test.

Winner: Typed specifications, clearly. The per-AC granularity is decisive.

Scenario: Feature is deleted, tests become stale

Spec-driven: A feature is removed from the PRD. The implementation code is deleted. But the tests remain. They still pass (the code they test is gone, so they test... nothing? They might test a helper that still exists, or they might test an interface that was repurposed). They still contribute to coverage. Nobody notices they're stale.

Six months later, someone renames a method and 14 "passing" tests break. The team spends two days figuring out that these tests have been testing dead code since the feature was deleted.

Typed specifications: The feature type is deleted. Instantly:

error CS0246: The type or namespace name 'UserRolesFeature' could not be found
  → in UserRolesFeatureTests.cs, line 1: [TestsFor(typeof(UserRolesFeature))]
  → in UserRolesFeatureTests.cs, line 5: [Verifies(typeof(UserRolesFeature), ...)]
  → in UserRolesFeatureTests.cs, line 15: [Verifies(typeof(UserRolesFeature), ...)]

Every test referencing the deleted feature fails to compile. The developer must delete or repurpose the tests. Stale tests are structurally impossible.

Winner: Typed specifications. Dead code elimination is a compiler problem, not a discipline problem.

Scenario: New team member needs to understand testing expectations

Spec-driven: The new team member reads the Testing-as-Code specification. It's comprehensive — 18+ practices, 3 principles, 3 metrics, language-specific patterns. They learn about mutation testing, chaos engineering, property-based testing. They understand the testing pyramid and the coverage thresholds.

They feel informed. They know what to test and how to test.

Typed specifications: The new team member writes their first test. They forget the [Verifies] attribute. The compiler says:

info REQ303: Test method 'MyTest' in class 'UserRolesTests' is not annotated with
             [Verifies]. Consider adding [Verifies(typeof(Feature), nameof(AC))]
             to link this test to a specific acceptance criterion.

They add the attribute. They try to reference a nonexistent AC:

error CS0117: 'UserRolesFeature' does not contain a definition for 'NonexistentAC'

They pick the right AC. The compiler is happy. They've learned the system by using it, not by reading a 15,000-word document.

Winner: Typed specifications for enforcement, spec-driven for education. The typed approach teaches by doing. The spec-driven approach teaches by explaining. Both are valuable; they serve different learning styles.

The Coverage Granularity Gap

This is the single most important difference in the testing domain:

Coverage Type	Spec-Driven	Typed Specifications
Line coverage	✓ (via coverage tools)	✓ (via coverage tools)
Branch coverage	✓ (via coverage tools)	✓ (via coverage tools)
Function coverage	✓ (via coverage tools)	✓ (via coverage tools)
Requirement coverage	✗ (stated as goal, not measured)	✓ (REQ3xx analyzers, per-AC)
AC-level coverage	✗ (no structural link)	✓ (each test declares which AC it verifies)
Stale test detection	✗ (no mechanism)	✓ (REQ302 diagnostic)
Missing test detection	Only via line coverage thresholds	Specific diagnostic per uncovered AC

The spec-driven approach measures code coverage — how many lines, branches, and functions are exercised. This is a proxy for test quality, but it's a weak proxy. 100% line coverage doesn't mean all acceptance criteria are tested. 50% line coverage doesn't mean important ACs are untested.

The typed approach measures requirement coverage — how many acceptance criteria have [Verifies] tests. This is a direct measure of test completeness against the specification. It can coexist with line coverage (you can run both), but it adds the dimension that line coverage cannot provide.

What Typed Specifications Miss

The spec-driven Testing-as-Code specification covers domains that the typed approach ignores entirely:

Chaos Engineering — Testing system resilience under failure injection. The typed approach has no concept of chaos experiments. This is genuinely valuable for distributed systems.
Load Testing — Performance under expected and peak load. The typed approach tracks test duration (REQ402) but doesn't define load testing strategies, thresholds, or scenarios.
Cross-Browser Testing — Browser compatibility matrices. The typed approach is server-side C# and has no opinion on frontend testing.
Fuzz Testing — Input robustness testing with random/malicious inputs. The typed approach doesn't define fuzzing strategies or targets.
Penetration Testing — Security assessment workflows. The typed approach handles authorization testing through ACs but doesn't define pen-test methodologies.
Test Data Management — Synthetic data generation, privacy compliance, data lifecycle. The typed approach uses test factories and fixtures but doesn't define a data management strategy.
CI/CD Pipeline Design — Pre-commit, commit, pre-deploy, post-deploy test stages. The typed approach integrates with MSBuild but doesn't prescribe pipeline architecture.

These are current gaps, not fundamental limits. And this is the crucial distinction: every one of these gaps is a DSL waiting to be written.

Consider chaos engineering. Today, the typed approach has no opinion on it. But there's nothing stopping a Chaos DSL:

[ChaosExperiment("OrderService_NetworkPartition")]
[TargetService(typeof(OrderService))]
[FailureMode(FailureType.NetworkPartition, Duration = "30s")]
[ExpectedBehavior(Degradation.CircuitBreakerOpen, RecoveryTime = "60s")]
[RequiresResilience(typeof(OrderProcessingFeature),
    nameof(OrderProcessingFeature.OrderCanBeProcessed))]
public partial class OrderServicePartitionExperiment { }

Or load testing:

[LoadTest("OrderEndpoint_PeakTraffic")]
[TargetEndpoint("POST /api/orders")]
[LoadProfile(Users = 1000, RampUp = "30s", Duration = "5m")]
[PerformanceBudget(P95 = "200ms", P99 = "500ms", ErrorRate = 0.01)]
[VerifiesNonFunctional(typeof(OrderProcessingFeature))]
public partial class OrderEndpointLoadTest { }

The source generator validates the target service exists, the failure mode is valid, the performance budget is reasonable, and the feature reference is correct. The same compile-time enforcement that works for functional requirements works for non-functional testing strategies.

The spec-driven Testing-as-Code describes these strategies in English. A typed Chaos DSL would enforce them in the compiler. The spec-driven approach documents what should be tested. A typed DSL would make untested scenarios produce compiler warnings.

This is the trajectory: typed specifications start narrow (requirement chain only) and expand by adding DSLs. Each DSL brings another domain under compiler enforcement. The spec-driven approach starts broad (all testing strategies) but stays shallow (descriptions, not enforcement). Over time, the typed approach's coverage grows toward the spec-driven approach's breadth — but with enforcement the spec-driven approach cannot match.

The honest recommendation: for testing strategies you haven't yet built DSLs for, use spec-driven documents as a strategy guide (which testing techniques should we use?). But recognize that this is a transitional state, not the end state. The end state is typed DSLs for every testing domain that matters to your team.

The Testing Inertness Problem

The spec-driven Testing-as-Code specification defines beautiful testing strategies. But there's a problem that echoes Part II's discussion of pillar inertness: the spec describes testing practices; it doesn't create tests.

Consider this entry from the Testing spec:

DEFINE_PRACTICE(property_testing_implementation)
  Scope: algorithmic_correctness, invariant_validation
  Enforcement: recommended
  Validation strategy: property_verification
  Rule: "Complex algorithms and business logic must be tested
         with property-based testing"
  
  Property Categories:
    - Round trip: encode_decode_identity
    - Invariants: system_state_consistency
    - Commutativity: operation_order_independence
    - Idempotency: repeated_operation_same_result
  
  Implementation Tools by Language:
    - Rust: proptest
    - Python: hypothesis
    - JavaScript: fast_check
    - Java: junit_quickcheck

This is an excellent description of property-based testing. A developer who reads it will understand what properties to test and which tools to use.

But the description is text. It doesn't know which algorithms in your codebase are "complex." It doesn't know which functions should have round-trip properties. It doesn't generate property tests. It doesn't even know if your project uses Rust or Python.

To make this spec actionable, you need:

A human (or AI) to read the spec
That reader to identify which code is "complex"
That reader to decide which property category applies
That reader to write the property test
A quality gate to verify the test exists and passes

Steps 1-4 are interpretation. Step 5 is validation. The spec contributes to step 1 (guidance) but cannot participate in steps 2-5.

The typed approach is narrower but active. It doesn't tell you to write property tests — but if you DO write a property test and annotate it with [Verifies], the system knows exactly which AC it covers, and it can verify that every AC has at least one test. The spec is passive guidance. The type system is active enforcement.

Can the Spec-Driven Testing Spec Become Active?

Only by building tooling. To make "Complex algorithms must be property-tested" enforceable, you'd need:

A code analyzer that identifies "complex algorithms" (by cyclomatic complexity? by annotation?)
A test scanner that identifies property tests (by framework? by naming convention?)
A cross-reference checker that matches algorithms to property tests
A CI gate that fails if complex algorithms lack property tests

This is Roslyn analyzer territory. You're building the same thing the typed approach provides — just without the type system's help.

A Full Test File: Side by Side

Let's see what a complete test file looks like in each approach for the same feature.

Spec-Driven Test File

Following the Testing-as-Code conventions (Arrange-Act-Assert, naming convention MethodName_Scenario_ExpectedResult):

// Following spec-driven conventions
// Feature: password_reset (from PRD)
// No structural link to the PRD — this is a naming convention

public class PasswordResetServiceTests
{
    private readonly PasswordResetService _service;
    private readonly Mock<IUserRepository> _userRepo;
    private readonly Mock<ITokenStore> _tokenStore;
    private readonly Mock<IEmailService> _emailService;
    private readonly FakeClock _clock;

    public PasswordResetServiceTests()
    {
        _userRepo = new Mock<IUserRepository>();
        _tokenStore = new Mock<ITokenStore>();
        _emailService = new Mock<IEmailService>();
        _clock = new FakeClock(DateTime.UtcNow);
        _service = new PasswordResetService(
            _userRepo.Object, _tokenStore.Object,
            _emailService.Object, _clock);
    }

    // AC: "User can request a password reset email"
    // This comment is the only link to the AC. It can be wrong. It can be stale.

    [Fact]
    public void RequestReset_ValidEmail_SendsEmail()
    {
        // Arrange
        var email = "user@example.com";
        _userRepo.Setup(r => r.FindByEmail(email))
            .Returns(new User { Id = Guid.NewGuid(), Email = email });

        // Act
        var result = _service.RequestPasswordReset(email);

        // Assert
        Assert.True(result.IsSuccess);
        _emailService.Verify(e => e.SendResetLink(email, It.IsAny<string>()), Times.Once);
    }

    [Fact]
    public void RequestReset_UnknownEmail_NoEmailSentButReturnsSuccess()
    {
        _userRepo.Setup(r => r.FindByEmail(It.IsAny<string>()))
            .Returns((User?)null);

        var result = _service.RequestPasswordReset("unknown@example.com");

        // Success returned (prevent account enumeration) but no email sent
        Assert.True(result.IsSuccess);
        _emailService.Verify(e => e.SendResetLink(
            It.IsAny<string>(), It.IsAny<string>()), Times.Never);
    }

    // AC: "Reset link expires after 24 hours"

    [Fact]
    public void ValidateToken_Expired_ReturnsFailure()
    {
        var token = new ResetToken(Guid.NewGuid(), DateTime.UtcNow.AddHours(-25));
        _tokenStore.Setup(t => t.Find(token.Id)).Returns(token);

        var result = _service.ValidateResetToken(token.Id);

        Assert.False(result.IsSuccess);
        Assert.Contains("expired", result.Error);
    }

    // ... more tests following the same pattern
}

Characteristics:

Feature link: comment only (// AC: "User can request...")
AC link: comment only (can be wrong, can be stale)
No compiler check that these tests cover the right ACs
No way to generate a coverage matrix from these tests
If the feature is deleted, these tests still compile and pass

Typed Specification Test File

// Structural link to the feature via typeof() and nameof()
// The compiler verifies every reference

[TestsFor(typeof(PasswordResetFeature))]
public class PasswordResetFeatureTests
{
    private readonly PasswordResetService _service;
    private readonly InMemoryUserRepository _userRepo;
    private readonly InMemoryTokenStore _tokenStore;
    private readonly SpyEmailService _emailService;
    private readonly FakeClock _clock;

    public PasswordResetFeatureTests()
    {
        _userRepo = new InMemoryUserRepository();
        _tokenStore = new InMemoryTokenStore();
        _emailService = new SpyEmailService();
        _clock = new FakeClock(DateTime.UtcNow);
        _service = new PasswordResetService(
            _userRepo, _tokenStore, _emailService, _clock);

        // Seed test data
        _userRepo.Add(TestUsers.Alice);
    }

    [Verifies(typeof(PasswordResetFeature),
        nameof(PasswordResetFeature.UserCanRequestPasswordResetEmail))]
    public void Valid_email_sends_reset_link()
    {
        var result = _service.RequestPasswordReset(TestUsers.Alice.Email);

        Assert.That(result.IsSuccess, Is.True);
        Assert.That(_emailService.SentEmails, Has.Count.EqualTo(1));
        Assert.That(_emailService.SentEmails[0].To, Is.EqualTo(TestUsers.Alice.Email));
        Assert.That(_emailService.SentEmails[0].Body, Does.Contain("reset"));
    }

    [Verifies(typeof(PasswordResetFeature),
        nameof(PasswordResetFeature.UserCanRequestPasswordResetEmail))]
    public void Unknown_email_returns_success_but_sends_no_email()
    {
        var result = _service.RequestPasswordReset(new Email("unknown@example.com"));

        // Success returned to prevent account enumeration
        Assert.That(result.IsSuccess, Is.True);
        Assert.That(_emailService.SentEmails, Is.Empty);
    }

    [Verifies(typeof(PasswordResetFeature),
        nameof(PasswordResetFeature.ResetLinkExpiresAfter24Hours))]
    public void Token_used_after_24_hours_is_rejected()
    {
        var token = CreateValidToken();
        _clock.AdvanceBy(TimeSpan.FromHours(25));

        var result = _service.ValidateResetToken(token.Id);

        Assert.That(result.IsSuccess, Is.False);
        Assert.That(result.Error, Does.Contain("expired"));
    }

    [Verifies(typeof(PasswordResetFeature),
        nameof(PasswordResetFeature.ResetLinkExpiresAfter24Hours))]
    public void Token_used_within_24_hours_is_accepted()
    {
        var token = CreateValidToken();
        _clock.AdvanceBy(TimeSpan.FromHours(23));

        var result = _service.ValidateResetToken(token.Id);

        Assert.That(result.IsSuccess, Is.True);
    }

    [Verifies(typeof(PasswordResetFeature),
        nameof(PasswordResetFeature.NewPasswordMeetsComplexityRequirements))]
    public void Weak_password_is_rejected()
    {
        var token = CreateValidToken();
        var weakPassword = new Password("123");

        var result = _service.ResetPassword(token.Id, weakPassword);

        Assert.That(result.IsSuccess, Is.False);
        Assert.That(result.Error, Does.Contain("complexity"));
    }

    [Verifies(typeof(PasswordResetFeature),
        nameof(PasswordResetFeature.NewPasswordMeetsComplexityRequirements))]
    public void Strong_password_resets_successfully()
    {
        var token = CreateValidToken();
        var strongPassword = new Password("C0mpl3x!Pass#2026");

        var result = _service.ResetPassword(token.Id, strongPassword);

        Assert.That(result.IsSuccess, Is.True);
        Assert.That(
            _userRepo.FindById(TestUsers.Alice.Id).PasswordHash,
            Is.Not.EqualTo(TestUsers.Alice.PasswordHash));
    }

    private ResetToken CreateValidToken()
    {
        _service.RequestPasswordReset(TestUsers.Alice.Email);
        return _tokenStore.GetLatest();
    }
}

Characteristics:

Feature link: typeof(PasswordResetFeature) — compiler-checked, Ctrl+Click navigable
AC link: nameof(PasswordResetFeature.UserCanRequestPasswordResetEmail) — refactor-safe
Compiler verifies every reference (rename AC → all tests update automatically)
Source-generated traceability matrix includes these tests
If the feature is deleted, these tests produce compile errors

The difference is not in the test logic — both test the same behavior. The difference is in the metadata: the typed approach's test metadata is compiler-checked, navigable, and participates in the traceability system. The spec-driven approach's test metadata is comments that can lie.

Summary

Dimension	Spec-Driven Testing	Typed Specification Testing
Scope	15+ strategies, comprehensive	Requirement-to-test chain only
Guidance	Excellent (principles, practices, patterns)	Minimal (compiler diagnostics)
Enforcement	Coverage thresholds (line, branch, function)	Per-AC requirement coverage (REQ3xx)
Stale detection	None	REQ302 (stale `[Verifies]` reference)
Missing detection	Low coverage flag (line-level)	Specific AC diagnostic (REQ301)
Granularity	File/class/method level	Acceptance criterion level
Learning	Read the document	Use the system
Language support	Rust, Python, JS, Java, C#, C++, Go	C# (with .NET ecosystem)

The Testing DSL Vision

The typed approach starts with [Verifies] — a single attribute that links a test to an acceptance criterion. But [Verifies] is just the beginning. Every testing concern that the spec-driven approach describes in English can be expressed as a typed DSL attribute with compiler enforcement. (The Auto-Documentation from a Typed System series applies this same pattern to operational concerns — each DSL follows the attribute-to-generator-to-artifact pipeline described below.)

Here's what a fully typed testing ecosystem looks like.

Property-Based Testing DSL

The spec-driven approach describes property-based testing: round-trip, invariants, commutativity, idempotency. The typed approach enforces it:

[PropertyTest(typeof(OrderProcessingFeature),
    nameof(OrderProcessingFeature.OrderTotalMustBePositive))]
[PropertyCategory(PropertyCategory.Invariant)]
[Shrinkable(typeof(OrderLineArbitrary))]
public partial class OrderTotalInvariantProperty
{
    /// <summary>
    /// The generator produces arbitrary OrderLine collections.
    /// The shrinking strategy reduces failing inputs to minimal cases.
    /// </summary>
    public static Arbitrary<OrderLine[]> Generator => Arb.From(
        Gen.ArrayOf(
            from qty in Gen.Choose(1, 10000)
            from price in Gen.Choose(1, 100000).Select(p => p / 100m)
            select new OrderLine(qty, price)));

    public bool Property(OrderLine[] lines)
    {
        var order = new Order(lines);
        return order.Total > 0;
    }
}

The source generator produces:

// Generated: OrderTotalInvariantProperty.g.cs
public partial class OrderTotalInvariantProperty
{
    [Fact]
    [Trait("Category", "Property")]
    [Trait("Feature", "OrderProcessingFeature")]
    [Trait("AC", "OrderTotalMustBePositive")]
    public void OrderTotalInvariantProperty_Executes()
    {
        Prop.ForAll(Generator, Property)
            .WithShrink(OrderLineArbitrary.Shrink)
            .WithMaxTest(1000)
            .QuickCheckThrowOnFailure();
    }
}

The analyzer validates:

OrderProcessingFeature exists (compile error if not)
OrderTotalMustBePositive is a valid AC on that feature (compile error if not)
The Generator property returns Arbitrary<T> where T matches the Property method parameter (compile error if mismatched)
The PropertyCategory matches the actual property shape (warning if Invariant is claimed but the property has side effects)

Mutation Testing DSL

The spec-driven approach says "mutation score > 80%." The typed approach makes mutation testing a first-class concern:

[MutationTarget(typeof(OrderProcessingFeature))]
[MutationOperators(
    MutationOperator.ArithmeticReplacement,
    MutationOperator.ConditionalBoundary,
    MutationOperator.NegateConditional,
    MutationOperator.ReturnValueMutation)]
[MinimumMutationScore(85)]
[ExcludeFromMutation(nameof(Order.ToString), Reason = "Display-only method")]
public partial class OrderProcessingMutationConfig { }

The source generator produces a Stryker.NET configuration:

// Generated: stryker-OrderProcessing.json
{
  "stryker-config": {
    "project-info": {
      "name": "OrderProcessing",
      "feature": "OrderProcessingFeature"
    },
    "mutate": [
      "src/MyApp.Domain/Orders/**/*.cs"
    ],
    "mutation-level": "Standard",
    "thresholds": {
      "high": 85,
      "low": 70,
      "break": 60
    },
    "excluded-mutations": [],
    "ignore-methods": ["ToString"]
  }
}

The analyzer validates:

The target feature exists and has implementations (compile error if feature is deleted)
The mutation operators are valid for the implementation language (warning if operator doesn't apply)
The MinimumMutationScore is achievable given the test coverage (info diagnostic with recommendation)

info MUT100: OrderProcessingFeature has 12 [Verifies] tests covering 4/4 ACs.
             Mutation testing configured with score threshold 85%.

warning MUT101: OrderProcessingFeature.OrderCanBeSplitAcrossWarehouses has only
                1 [Verifies] test. Consider adding edge-case tests to improve
                mutation kill rate.

This is also the final piece of the semantic correctness puzzle. The biggest criticism of typed specifications is that a [Verifies] test can lie — it can reference an AC but test something unrelated. Mutation testing closes this gap: a lying test kills zero mutants, and [MutationTarget] catches it. Combined with executable ACs (where the test must call the AC method directly) and the REQ305 analyzer (which verifies the invocation), mutation testing is the third layer that guarantees semantic correctness. See Part VIII for the full three-layer defense.

Fuzz Testing DSL

The spec-driven approach mentions "structured fuzzing" and "protocol fuzzing." The typed approach makes fuzz targets declarative:

[FuzzTarget(typeof(OrderProcessingFeature),
    nameof(OrderProcessingFeature.OrderTotalMustBePositive))]
[InputGenerator(typeof(MalformedOrderInputGenerator))]
[FuzzDuration("5m")]
[MaxInputSize(4096)]
[CrashPolicy(CrashPolicy.CollectAndContinue)]
public partial class OrderInputFuzzTest
{
    /// <summary>
    /// The fuzz engine calls this method with generated byte arrays.
    /// The InputGenerator structures the bytes into domain-meaningful inputs.
    /// </summary>
    public FuzzResult Execute(byte[] input)
    {
        var order = MalformedOrderInputGenerator.FromBytes(input);
        try
        {
            var result = _service.ProcessOrder(order);
            // If we get here, the input was handled gracefully — good
            return FuzzResult.Handled;
        }
        catch (DomainException)
        {
            // Expected: domain rejects malformed input — good
            return FuzzResult.Handled;
        }
        // Unhandled exceptions = fuzz finding
    }
}

The source generator produces:

// Generated: OrderInputFuzzTest.g.cs
public partial class OrderInputFuzzTest
{
    [Fact]
    [Trait("Category", "Fuzz")]
    [Trait("Feature", "OrderProcessingFeature")]
    [Trait("AC", "OrderTotalMustBePositive")]
    public void OrderInputFuzzTest_Executes()
    {
        var engine = new FuzzEngine(
            target: Execute,
            generator: new MalformedOrderInputGenerator(),
            maxDuration: TimeSpan.FromMinutes(5),
            maxInputSize: 4096,
            crashPolicy: CrashPolicy.CollectAndContinue);

        var report = engine.Run();

        Assert.Empty(report.Crashes);
        Assert.Empty(report.UnhandledExceptions);
        Assert.True(report.InputsTested > 0,
            "Fuzz engine must test at least one input");
    }
}

The analyzer validates:

The target feature and AC exist (compile error if deleted)
The InputGenerator type implements IFuzzInputGenerator (compile error if not)
The Execute method has the correct signature (compile error if wrong)
The FuzzDuration is reasonable (warning if > 30 minutes in a CI context)

Contract Testing DSL

For integration testing against external services, the spec-driven approach describes "contract testing" and "sandbox environments." The typed approach declares contracts as typed interfaces:

[ContractTest(typeof(IPaymentGateway))]
[Provider("Stripe")]
[ConsumerName("OrderService")]
[ProviderState("customer_with_valid_card")]
[VerifiesIntegration(typeof(OrderProcessingFeature),
    nameof(OrderProcessingFeature.CancellationTriggersFullRefund))]
public partial class PaymentGatewayRefundContract
{
    [ContractInteraction("refund_full_amount")]
    public async Task<ContractResult> RefundInteraction()
    {
        // Arrange: the payment gateway is in state "customer_with_valid_card"
        var payment = new PaymentId("pay_test_123");
        var amount = new Money(99.99m, Currency.USD);

        // Act
        var result = await _gateway.RefundAsync(payment, amount);

        // Assert: the contract specifies the response shape
        return ContractResult.Verify(result)
            .HasStatus(RefundStatus.Succeeded)
            .HasAmount(amount)
            .HasCurrency(Currency.USD);
    }
}

The source generator produces a Pact-compatible contract file:

// Generated: OrderService-Stripe-contract.json
{
  "consumer": { "name": "OrderService" },
  "provider": { "name": "Stripe" },
  "interactions": [
    {
      "description": "refund_full_amount",
      "providerState": "customer_with_valid_card",
      "request": {
        "method": "POST",
        "path": "/v1/refunds",
        "body": { "payment_intent": "pay_test_123", "amount": 9999 }
      },
      "response": {
        "status": 200,
        "body": { "status": "succeeded", "amount": 9999, "currency": "usd" }
      }
    }
  ],
  "metadata": {
    "feature": "OrderProcessingFeature",
    "ac": "CancellationTriggersFullRefund"
  }
}

The analyzer validates:

IPaymentGateway exists and is a service interface (compile error if not)
The provider name matches a known configuration (warning if unknown)
The feature and AC references are valid (compile error if deleted)
Every public method on IPaymentGateway has at least one [ContractInteraction] (warning if uncovered)

warning CONTRACT100: IPaymentGateway.ChargeAsync has no [ContractInteraction]
                     in any contract test class. Consider adding a contract
                     for this interaction.

Performance Testing DSL

The spec-driven approach defines "p95 < 2s" as a threshold. The typed approach links performance budgets to features:

[PerformanceTest(typeof(OrderProcessingFeature),
    P95 = "200ms", P99 = "500ms")]
[Endpoint("POST /api/orders")]
[LoadProfile(ConcurrentUsers = 100, RampUp = "30s", Duration = "5m")]
[DataProfile(OrdersPerUser = 5, AverageLineItems = 3)]
[ResourceBudget(MaxMemoryMB = 512, MaxCpuPercent = 80)]
public partial class OrderProcessingPerformanceTest
{
    [PerformanceScenario("happy_path")]
    public async Task<PerformanceResult> HappyPath(HttpClient client)
    {
        var order = TestOrders.Typical();
        var response = await client.PostAsJsonAsync("/api/orders", order);
        return PerformanceResult.FromResponse(response);
    }

    [PerformanceScenario("large_order")]
    public async Task<PerformanceResult> LargeOrder(HttpClient client)
    {
        var order = TestOrders.WithLineItems(100);
        var response = await client.PostAsJsonAsync("/api/orders", order);
        return PerformanceResult.FromResponse(response);
    }
}

The source generator produces a k6 load test script:

// Generated: order-processing-perf.k6.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 100 },  // ramp up
    { duration: '5m', target: 100 },   // sustained
    { duration: '30s', target: 0 },    // ramp down
  ],
  thresholds: {
    'http_req_duration{scenario:happy_path}': ['p(95)<200', 'p(99)<500'],
    'http_req_duration{scenario:large_order}': ['p(95)<200', 'p(99)<500'],
  },
};

export default function () {
  // Scenario: happy_path (80% weight)
  // Scenario: large_order (20% weight)
  const scenario = Math.random() < 0.8 ? 'happy_path' : 'large_order';
  // ... generated test logic
}

The analyzer validates:

The target feature exists (compile error if deleted)
The endpoint matches a real API route (warning if no matching controller action)
The P95/P99 values parse as valid durations (compile error if "200xs")
The load profile is reasonable (warning if > 10,000 concurrent users in a test environment)
The feature has functional tests via [Verifies] (warning if performance-tested but not functionally tested)

warning PERF100: OrderProcessingFeature has a [PerformanceTest] but no
                 [Verifies] test for AC 'OrderCanBeSplitAcrossWarehouses'.
                 Performance testing without functional coverage is unreliable.

The Complete Testing DSL Analyzer Suite

When all testing DSLs are in place, the analyzer output at build time covers every testing concern:

# Requirement Coverage (REQ3xx)
info REQ303: OrderProcessingFeature — all 4 ACs have [Verifies] tests ✓
info REQ303: PasswordResetFeature — all 3 ACs have [Verifies] tests ✓
warning REQ301: UserRolesFeature.AdminCanRevokeRoles has no [Verifies] test

# Property Testing (PROP1xx)
info PROP100: OrderProcessingFeature.OrderTotalMustBePositive has
              [PropertyTest] with Invariant category ✓
warning PROP101: PasswordResetFeature has no [PropertyTest] for any AC.
                 Consider property-testing token generation and expiry logic.

# Mutation Testing (MUT1xx)
info MUT100: OrderProcessingFeature mutation config: score threshold 85%,
             4 operators, 12 covering tests ✓
warning MUT101: PasswordResetFeature has no [MutationTarget] configuration.

# Fuzz Testing (FUZZ1xx)
info FUZZ100: OrderProcessingFeature.OrderTotalMustBePositive has
              [FuzzTarget] with 5m duration ✓
warning FUZZ101: PasswordResetFeature has no [FuzzTarget]. Consider fuzzing
                 password complexity validation and token parsing.

# Contract Testing (CONTRACT1xx)
info CONTRACT100: IPaymentGateway — 3/3 methods have contract interactions ✓
warning CONTRACT101: IEmailService has no [ContractTest]. Consider adding
                     contracts for email delivery verification.

# Performance Testing (PERF1xx)
info PERF100: OrderProcessingFeature performance budget: P95=200ms,
              P99=500ms, 100 concurrent users ✓
warning PERF101: PasswordResetFeature has no [PerformanceTest]. Consider
                 testing token validation under load.

# Summary
Build succeeded with 5 warnings.
  Features fully covered (all DSLs): 1/3
  Features with [Verifies] coverage: 2/3
  Features with property tests: 1/3
  Features with mutation configs: 1/3
  Features with fuzz targets: 1/3
  Features with performance tests: 1/3
  External services with contracts: 1/2

This is the key realization: the spec-driven Testing-as-Code specification describes 15+ testing strategies in English paragraphs. The typed approach can enforce all 15 as compiler-checked DSLs. Each strategy becomes an attribute family, a source generator, and an analyzer. Each produces specific, actionable diagnostics. Each links back to the feature it tests via typeof().

The spec-driven approach tells you "consider property-based testing for complex algorithms." The typed approach tells you "OrderProcessingFeature.OrderTotalMustBePositive has no [PropertyTest] — and here's the analyzer ID so you can configure it as error, warning, or suggestion per project."

The difference is not just enforcement vs description. It's specificity. The spec-driven document says "test complex algorithms." The analyzer says "test THIS algorithm, for THIS feature, covering THIS acceptance criterion." One is a strategy. The other is a work item.

And because every DSL follows the same pattern — attribute → source generator → analyzer — the cost of adding a new testing concern is sublinear. The first DSL (property testing) requires building the test DSL infrastructure. The second DSL (mutation testing) reuses it. By the fifth DSL (performance testing), adding a new testing concern is an afternoon's work, not a week's project.

The Coverage Dashboard: Generated, Not Assembled

The spec-driven approach requires assembling coverage information from multiple tools. Each tool has its own report format. Building a unified dashboard requires parsing coverlet XML, Stryker HTML, k6 JSON, and Snyk reports, then correlating them manually or through a custom aggregation layer.

The typed approach generates the dashboard from build output. Every analyzer family contributes diagnostics. The source generator aggregates them into a single report:

// Generated: TestCoverageDashboard.g.cs
public static class TestCoverageDashboard
{
    public static readonly FeatureCoverage[] Features = new[]
    {
        new FeatureCoverage(
            Feature: "OrderProcessingFeature",
            AcCount: 4,
            VerifiesTests: 12,
            AcsCovered: 4,
            PropertyTests: 1,
            MutationConfigured: true,
            MutationScoreThreshold: 85,
            FuzzTargets: 1,
            ContractTests: 3,
            PerformanceTests: 1,
            PerformanceBudget: "P95=200ms",
            FullyCovered: true),

        new FeatureCoverage(
            Feature: "PasswordResetFeature",
            AcCount: 3,
            VerifiesTests: 6,
            AcsCovered: 3,
            PropertyTests: 0,        // ← PROP101 warning
            MutationConfigured: false, // ← MUT101 warning
            MutationScoreThreshold: 0,
            FuzzTargets: 0,           // ← FUZZ101 warning
            ContractTests: 0,
            PerformanceTests: 1,
            PerformanceBudget: "P95=300ms",
            FullyCovered: false),

        // ... one entry per feature
    };
}

This generated class is available at compile time. A dashboard UI can read it. A CI gate can query it. An AI agent can inspect it. No parsing. No aggregation. No "which report format does this tool use?" The data model is a C# class — queryable, type-safe, and always current.

The spec-driven approach builds dashboards from tool outputs. The typed approach generates dashboards from the type system. One is integration work that breaks when a tool changes its output format. The other is generated code that's always consistent with the build.

The Test Naming Convention Trap

The spec-driven approach defines naming conventions for tests, carefully tailored per language:

Rust: test_function_behavior_expected
Python: test_when_given_then
JavaScript: should_behavior_when_condition
Java: methodName_scenario_expectedResult
C#: MethodName_Scenario_ExpectedResult

This seems helpful. Consistent naming makes tests scannable, grep-able, and self-documenting. The spec-driven Testing-as-Code specification treats naming conventions as a core practice with explicit violation patterns and auto-fix suggestions.

But naming conventions are a trap. They create a brittle, human-maintained link between tests and the things they test. The typed approach makes naming irrelevant — because the [Verifies] attribute IS the link.

How Convention-Based Naming Breaks

Scenario 1: Renamed acceptance criterion.

The AC was originally "User can reset password." A product owner renames it to "User can request password recovery." In the spec-driven approach:

// The test name references the OLD AC wording
[Fact]
public void RequestReset_ValidEmail_SendsEmail()  // "Reset" not "Recovery"
{
    // ...
}

The test still passes. The name is now wrong — it says "Reset" but the AC says "Recovery." Nobody notices. Over months, half the tests reference old AC names and half reference new ones. The naming convention that was supposed to provide traceability now provides misinformation.

In the typed approach:

// The AC method is renamed via IDE refactoring
// Before: nameof(PasswordResetFeature.UserCanResetPassword)
// After:  nameof(PasswordRecoveryFeature.UserCanRequestPasswordRecovery)

[Verifies(typeof(PasswordRecoveryFeature),
    nameof(PasswordRecoveryFeature.UserCanRequestPasswordRecovery))]
public void Valid_email_sends_recovery_link()
{
    // Test name doesn't matter — the attribute is the link
}

The IDE rename propagated the change to every [Verifies] attribute automatically. The test name can be anything — it's the attribute that provides traceability.

Scenario 2: Stale naming convention.

Six months ago, the team agreed on MethodName_Scenario_ExpectedResult. Three new developers joined. They write tests with different patterns:

// Developer A (original convention)
public void RequestReset_ValidEmail_SendsEmail() { }

// Developer B (BDD-style)
public void Should_send_email_when_valid_email_provided() { }

// Developer C (Given-When-Then)
public void GivenValidEmail_WhenResetRequested_ThenEmailSent() { }

// Developer D (descriptive)
public void A_registered_user_requesting_password_reset_receives_an_email() { }

All four tests do the same thing. All four follow a "convention" — just not the same one. The naming convention document says MethodName_Scenario_ExpectedResult, but humans are inconsistent. No one enforces it because the convention is text, not code.

In the typed approach, all four tests can have any name they want. The traceability comes from the attribute:

[Verifies(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.UserCanRequestPasswordResetEmail))]
public void Whatever_name_the_developer_prefers() { }

The name is irrelevant. The [Verifies] attribute is the single source of truth. It's not a convention to follow — it's a structural link the compiler checks.

Scenario 3: Cross-language inconsistency.

The spec-driven approach defines different naming conventions per language. A polyglot team has:

C# tests: MethodName_Scenario_ExpectedResult
Python tests: test_when_given_then
JavaScript tests: should_behavior_when_condition

Now try to generate a traceability matrix. You need a parser for each naming convention, each language's test discovery mechanism, and a mapper that connects differently-named tests to the same AC. This is fragile, language-specific, and breaks whenever someone doesn't follow the convention perfectly.

The typed approach solves this for C# with attributes. For other languages, the same principle applies with their native mechanisms — Python decorators, JavaScript/TypeScript decorators, Rust proc macros. The point isn't "use C# attributes" — it's "use structural metadata, not naming conventions."

The Comparison

Dimension	Naming Convention (Spec-Driven)	[Verifies] Attribute (Typed)
Link type	Implicit (name encodes meaning)	Explicit (attribute declares link)
Refactoring	Manual rename across tests	IDE propagation via nameof()
Enforcement	Code review (human)	Compiler (automated)
Consistency	Depends on team discipline	Structural — always consistent
Cross-language	Different convention per language	Same pattern per language's metadata
Stale detection	None (stale names compile fine)	Compile error (nameof fails)
Traceability matrix	Requires name-parsing heuristics	Exact: attribute → feature → AC
New team members	Must read and memorize convention	Must add attribute (compiler reminds)
Grep-ability	Good (names are searchable)	Better (attribute is searchable AND precise)
Wrong link detection	Impossible (name can lie)	Compile error (wrong AC = CS0117)

The spec-driven naming convention is a social contract: "we all agree to name tests this way." Social contracts are valuable but fragile. They work when teams are small, stable, and disciplined. They break when teams grow, rotate, and face deadline pressure.

The [Verifies] attribute is a structural contract: "the compiler verifies this test covers this AC." Structural contracts don't depend on discipline. They work regardless of team size, turnover, or deadline pressure. The compiler doesn't get tired. The compiler doesn't forget the convention. The compiler doesn't join the team six months late and use a different pattern.

This is the general principle applied to a specific domain: conventions describe expectations; types enforce them. Naming conventions are conventions. Attributes are types. In a system where correctness matters, types win.

The Convention Graveyard

Every team that has existed for more than two years has a convention graveyard — a collection of abandoned, contradictory, or partially-followed naming conventions. The graveyard grows because conventions have no deprecation mechanism. When the team switches from MethodName_Scenario_ExpectedResult to BDD-style Should_behavior_when_condition, the old tests keep the old convention. Nobody renames 400 existing tests. The convention document gets updated; the codebase doesn't.

The Convention Graveyard:

Year 1 (3 developers):
  Convention: MethodName_Scenario_ExpectedResult
  Tests following convention: 100% (200 tests)
  Enforcement: Code review (3 people, consistent)

Year 2 (6 developers, 2 new):
  Convention: MethodName_Scenario_ExpectedResult
  Tests following convention: 85% (340 of 400 tests)
  New developers sometimes use: Should_behavior_when, GivenWhenThen
  Enforcement: Code review (inconsistent, reviewers disagree)

Year 3 (10 developers, 4 new):
  Convention: "We use MethodName_Scenario_ExpectedResult"
  Reality: 60% old convention, 25% BDD, 10% Given-When-Then, 5% random
  New convention document: "Use BDD style: Should_behavior_when_condition"
  Old tests: Not renamed (too many, too risky)
  Enforcement: Aspirational

Year 4 (15 developers, 6 new):
  Convention: "Check the wiki" (wiki has 3 conflicting entries)
  Reality: Every developer uses their own style
  Test-to-AC traceability: Impossible (no consistent pattern to parse)
  Enforcement: None

The typed approach has no convention graveyard. There's nothing to rename, nothing to migrate, nothing to deprecate. The [Verifies] attribute is the link. It was the link on Day 1 and it's the link on Day 1,000. The test method can be named anything — Test1, ShouldWork, A_very_descriptive_name_that_explains_the_scenario_in_detail — and the traceability is identical.

This isn't a trivial advantage. Test naming conventions are one of the most common sources of technical debt in test suites. Teams spend hours in code reviews debating names. They write linting rules to enforce naming patterns. They build custom tools to extract traceability from test names. All of this effort is eliminated by a single attribute. The convention that requires no convention is the best convention.

When Naming Conventions Actively Mislead

The worst case isn't inconsistent naming — it's consistently wrong naming. A test named RequestReset_ValidEmail_SendsEmail that was later refactored to test token validation, but never renamed:

// The name says: tests that a valid email sends an email
// The test actually: validates that expired tokens are rejected
// The disconnect: invisible to conventions, caught by nobody
[Fact]
public void RequestReset_ValidEmail_SendsEmail()
{
    var token = new ResetToken(Guid.NewGuid(), DateTime.UtcNow.AddHours(-25));
    _tokenStore.Setup(t => t.Find(token.Id)).Returns(token);

    var result = _service.ValidateResetToken(token.Id);

    Assert.False(result.IsSuccess);
    Assert.Contains("expired", result.Error);
}

The naming convention says this test covers "valid email sends email." The test actually covers "expired token is rejected." A traceability tool that parses names will map this test to the wrong AC. The coverage report will show "valid email sending" as tested and "token expiry" as untested — the exact opposite of reality.

With [Verifies]:

[Verifies(typeof(PasswordResetFeature),
    nameof(PasswordResetFeature.ResetLinkExpiresAfter24Hours))]
public void RequestReset_ValidEmail_SendsEmail()  // Name is wrong, but irrelevant
{
    // ... same test body
}

The name is wrong, but the attribute is right. The traceability system maps this test to ResetLinkExpiresAfter24Hours — the correct AC. The coverage report is accurate. The misleading name is a cosmetic issue, not a structural one.

The typed approach separates the concern of "what does this test verify?" from "what is this test called?" Naming is for humans (readability). Linking is for the system (traceability). Conflating the two — using the name as the link — guarantees that one will be wrong when the other changes.

Summary

Dimension	Spec-Driven Testing	Typed Specification Testing
Scope	15+ strategies, comprehensive	Requirement-to-test chain only
Guidance	Excellent (principles, practices, patterns)	Minimal (compiler diagnostics)
Enforcement	Coverage thresholds (line, branch, function)	Per-AC requirement coverage (REQ3xx)
Stale detection	None	REQ302 (stale `[Verifies]` reference)
Missing detection	Low coverage flag (line-level)	Specific AC diagnostic (REQ301)
Granularity	File/class/method level	Acceptance criterion level
Learning	Read the document	Use the system
Language support	Rust, Python, JS, Java, C#, C++, Go	C# (with .NET ecosystem)
Testing DSLs	Described (text)	Enforced (compiler-checked attributes)
Naming	Convention-based (social contract)	Attribute-based (structural contract)

Part VI examines the broader validation question: quality gates vs Roslyn analyzers.

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

Testing: Strategy Documents vs Compiler-Enforced Coverage📋

The Spec-Driven Testing Specification📋

Principles (3)📋

Practices (18+)📋

Metrics (3)📋

Language-Specific Patterns📋

What This Gives You📋

What It Cannot Give You📋

Typed Specification Testing📋

1. Test-to-Requirement Linking📋

2. Compiler-Enforced Coverage📋

3. Quality Gates Integration📋

A Side-by-Side Comparison📋

Scenario: Team adds property-based testing📋

Scenario: Developer forgets to test an AC📋

Scenario: Feature is deleted, tests become stale📋

Scenario: New team member needs to understand testing expectations📋

The Coverage Granularity Gap📋

What Typed Specifications Miss📋

The Testing Inertness Problem📋

Can the Spec-Driven Testing Spec Become Active?📋

A Full Test File: Side by Side📋

Spec-Driven Test File📋

Typed Specification Test File📋

Summary📋

The Testing DSL Vision📋

Property-Based Testing DSL📋

Mutation Testing DSL📋

Fuzz Testing DSL📋

Contract Testing DSL📋

Performance Testing DSL📋

The Complete Testing DSL Analyzer Suite📋

The Coverage Dashboard: Generated, Not Assembled📋

The Test Naming Convention Trap📋

How Convention-Based Naming Breaks📋

The Comparison📋

The Convention Graveyard📋

When Naming Conventions Actively Mislead📋

Summary📋

Testing: Strategy Documents vs Compiler-Enforced Coverage

The Spec-Driven Testing Specification

Principles (3)

Practices (18+)

Metrics (3)

Language-Specific Patterns

What This Gives You

What It Cannot Give You

Typed Specification Testing

1. Test-to-Requirement Linking

2. Compiler-Enforced Coverage

3. Quality Gates Integration

A Side-by-Side Comparison

Scenario: Team adds property-based testing

Scenario: Developer forgets to test an AC

Scenario: Feature is deleted, tests become stale

Scenario: New team member needs to understand testing expectations

The Coverage Granularity Gap

What Typed Specifications Miss

The Testing Inertness Problem

Can the Spec-Driven Testing Spec Become Active?

A Full Test File: Side by Side

Spec-Driven Test File

Typed Specification Test File

Summary

The Testing DSL Vision

Property-Based Testing DSL

Mutation Testing DSL

Fuzz Testing DSL

Contract Testing DSL

Performance Testing DSL

The Complete Testing DSL Analyzer Suite

The Coverage Dashboard: Generated, Not Assembled

The Test Naming Convention Trap

How Convention-Based Naming Breaks

The Comparison

The Convention Graveyard

When Naming Conventions Actively Mislead

Summary