The Loop: Claude Code + Quality Gates as a Self-Reinforcing Test Augmentation Cycle
An AI that writes tests without a quality gate is a random text generator. A quality gate without an AI to feed it is a number nobody reads. Put them together and you get something neither can do alone: a self-reinforcing cycle that grinds coverage upward until the gate says stop.
The Problem: The Coverage Plateau
Every team I've worked with hits the same wall. You adopt a testing culture, write unit tests for your new features, maybe even enforce coverage in CI. Coverage climbs to 60%, then 70%. Then it stops.
Not because anyone decided to stop. Because the remaining 30% is the hard part — the error paths nobody wants to think about, the branch conditions buried in complex state machines, the edge cases that require elaborate setup. Writing tests for the easy 70% is pleasant. Writing tests for the next 20% is tedious. Writing tests for the last 10% feels like punishment.
So the coverage report becomes write-only. Generated every build, read by nobody. The dashboard exists. The number doesn't move.
And here's the cruel part: even when coverage is "high," it can lie. Line coverage tells you "this line was executed during a test." It does not tell you "this test would fail if this line changed." A test that calls a function without asserting anything achieves 100% line coverage and detects exactly zero bugs. That's where mutation testing comes in — and where the gap between "covered" and "tested" becomes visible.
The real problem isn't that teams lack testing tools. It's that someone needs to:
- Read the coverage report
- Understand which lines are uncovered and why
- Read the source code to understand the logic
- Write a test that exercises the missing path
- Run it, check coverage again
- Repeat
That "someone" is expensive, gets bored, and has features to ship.
The Insight: Close the Loop
What if that "someone" isn't a someone?
The ingredients have been sitting on the table for years:
- Quality gates produce machine-readable output — JSON coverage reports, XML Cobertura files, Stryker mutation JSON, structured summaries with per-method metrics
- Claude Code can read files, understand code context, write code, run commands, and read the output — all inside the project, not in a browser window
- Thresholds define an objective target — not "write more tests" but "reach 95% branch coverage and 80% mutation score"
The missing connection was always: who reads the report and acts on it?
The answer: the same AI agent that can read the code. The quality gate is the objective. Claude is the executor. The human is the architect who sets the bar and reviews the output.
┌─────────────────────────────────────────────────────────────────┐
│ THE LOOP │
│ │
│ Human sets threshold │
│ │ │
│ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌──────────────────┐ │
│ │ Claude │────►│ Run tests │───►│ Read coverage + │ │
│ │ writes │ │ + collect │ │ mutation reports │ │
│ │ tests │ │ metrics │ │ (machine- │ │
│ └───────────┘ └───────────┘ │ readable JSON) │ │
│ ▲ └────────┬─────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ Gate passes? │ │
│ │ └───────┬──┬───────┘ │
│ │ NO │ │ YES │
│ │ │ │ │
│ │ ┌─────────────────┐ │ ▼ │
│ │ │ Identify │◄──────┘ ┌────────────┐ │
│ └─────────│ uncovered lines │ │ Done. │ │
│ │ + branches │ │ Human │ │
│ └─────────────────┘ │ reviews + │ │
│ │ ratchets │ │
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐
│ THE LOOP │
│ │
│ Human sets threshold │
│ │ │
│ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌──────────────────┐ │
│ │ Claude │────►│ Run tests │───►│ Read coverage + │ │
│ │ writes │ │ + collect │ │ mutation reports │ │
│ │ tests │ │ metrics │ │ (machine- │ │
│ └───────────┘ └───────────┘ │ readable JSON) │ │
│ ▲ └────────┬─────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ Gate passes? │ │
│ │ └───────┬──┬───────┘ │
│ │ NO │ │ YES │
│ │ │ │ │
│ │ ┌─────────────────┐ │ ▼ │
│ │ │ Identify │◄──────┘ ┌────────────┐ │
│ └─────────│ uncovered lines │ │ Done. │ │
│ │ + branches │ │ Human │ │
│ └─────────────────┘ │ reviews + │ │
│ │ ratchets │ │
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘This is The Loop. It's not a framework. It's not a product. It's a workflow pattern I've used hundreds of times across a 57-project .NET monorepo and a TypeScript CV website. It works because each component does what it's best at:
- The human decides what quality means (thresholds, which metrics matter, when to ratchet)
- The quality gate measures objectively (no opinions, no fatigue, no shortcuts)
- Claude does the grinding (reads reports, reads code, writes tests, iterates)
The Setup
In my .NET ecosystem, the quality gate is FrenchExDev.Net.QualityGate — a Roslyn-powered tool that analyzes syntax trees and enforces thresholds for complexity, coupling, cohesion, coverage, and mutation testing. Each project has a quality-gate.yml:
# quality-gate.yml — Injectable project
solution: FrenchExDev.Net.Injectable.slnx
gates:
max-cyclomatic-complexity: 15
max-cognitive-complexity: 20
max-class-coupling: 55
max-inheritance-depth: 5
min-maintainability-index: 55
max-lcom: 25
max-distance-from-main-sequence: 1.0
min-test-quality-score: 0.80 # Combined coverage + mutation# quality-gate.yml — Injectable project
solution: FrenchExDev.Net.Injectable.slnx
gates:
max-cyclomatic-complexity: 15
max-cognitive-complexity: 20
max-class-coupling: 55
max-inheritance-depth: 5
min-maintainability-index: 55
max-lcom: 25
max-distance-from-main-sequence: 1.0
min-test-quality-score: 0.80 # Combined coverage + mutationThat last line — min-test-quality-score: 0.80 — is the key. It's not just coverage. It's a combined score of line coverage and mutation testing. You can't game it with empty assertions.
A Typical Session
Here's what The Loop looks like in practice. I open Claude Code in the Injectable project directory:
me: Run the quality gate. Let's see where we stand.
claude: Running `dotnet quality-gate test`...
┌─────────────────────────────────────────────────┐
│ QualityGate Report — Injectable │
├─────────────────────────────────────────────────┤
│ Cyclomatic Complexity 12/15 ✓ PASS │
│ Cognitive Complexity 18/20 ✓ PASS │
│ Class Coupling 42/55 ✓ PASS │
│ LCOM4 8/25 ✓ PASS │
│ Test Quality Score 0.71/0.80 ✗ FAIL │
│ Line Coverage 78% │
│ Mutation Score 64% │
└─────────────────────────────────────────────────┘
Gate FAILED. Test quality score is 0.71, threshold is 0.80.
Reading report.json for details...
Lowest coverage areas:
- InjectableEmitter.cs: 62% lines, 48% branches
Missing: lines 87-112 (error handling for malformed attributes)
Missing: lines 145-163 (generic type parameter resolution)
- InterfaceContractModel.cs: 71% lines, 55% branches
Missing: lines 34-52 (nullable reference type handling)
Surviving mutants (12 total):
- InjectableEmitter.cs:94 — `if (x > 0)` mutated to `if (x >= 0)` — SURVIVED
- InjectableEmitter.cs:103 — `return null` mutated to `return ""` — SURVIVED
- InterfaceContractModel.cs:41 — `== null` mutated to `!= null` — SURVIVED
...
me: Fix InjectableEmitter first. Follow the existing test patterns.
claude: Reading InjectableEmitter.cs...
Reading existing test file InjectableEmitterTests.cs...
I see the test pattern: in-memory Roslyn compilations with
fake source text, assertions on emitted code via
`VerifyGeneratedOutput()`. Let me add tests for the missing
paths.
[writes 6 new test methods targeting error handling + generics]
Running `dotnet test`... 47 passed, 0 failed.
Running `dotnet quality-gate test`...
Test Quality Score: 0.76/0.80 — still below threshold.
3 mutants still surviving in InjectableEmitter.cs.
The surviving mutant at line 94 needs a boundary test.
The mutation changes `> 0` to `>= 0`, so I need a test
where the parameter count is exactly 0.
[writes boundary test for parameterCount == 0]
Running `dotnet quality-gate test`...
Test Quality Score: 0.82/0.80 ✓ PASSme: Run the quality gate. Let's see where we stand.
claude: Running `dotnet quality-gate test`...
┌─────────────────────────────────────────────────┐
│ QualityGate Report — Injectable │
├─────────────────────────────────────────────────┤
│ Cyclomatic Complexity 12/15 ✓ PASS │
│ Cognitive Complexity 18/20 ✓ PASS │
│ Class Coupling 42/55 ✓ PASS │
│ LCOM4 8/25 ✓ PASS │
│ Test Quality Score 0.71/0.80 ✗ FAIL │
│ Line Coverage 78% │
│ Mutation Score 64% │
└─────────────────────────────────────────────────┘
Gate FAILED. Test quality score is 0.71, threshold is 0.80.
Reading report.json for details...
Lowest coverage areas:
- InjectableEmitter.cs: 62% lines, 48% branches
Missing: lines 87-112 (error handling for malformed attributes)
Missing: lines 145-163 (generic type parameter resolution)
- InterfaceContractModel.cs: 71% lines, 55% branches
Missing: lines 34-52 (nullable reference type handling)
Surviving mutants (12 total):
- InjectableEmitter.cs:94 — `if (x > 0)` mutated to `if (x >= 0)` — SURVIVED
- InjectableEmitter.cs:103 — `return null` mutated to `return ""` — SURVIVED
- InterfaceContractModel.cs:41 — `== null` mutated to `!= null` — SURVIVED
...
me: Fix InjectableEmitter first. Follow the existing test patterns.
claude: Reading InjectableEmitter.cs...
Reading existing test file InjectableEmitterTests.cs...
I see the test pattern: in-memory Roslyn compilations with
fake source text, assertions on emitted code via
`VerifyGeneratedOutput()`. Let me add tests for the missing
paths.
[writes 6 new test methods targeting error handling + generics]
Running `dotnet test`... 47 passed, 0 failed.
Running `dotnet quality-gate test`...
Test Quality Score: 0.76/0.80 — still below threshold.
3 mutants still surviving in InjectableEmitter.cs.
The surviving mutant at line 94 needs a boundary test.
The mutation changes `> 0` to `>= 0`, so I need a test
where the parameter count is exactly 0.
[writes boundary test for parameterCount == 0]
Running `dotnet quality-gate test`...
Test Quality Score: 0.82/0.80 ✓ PASSThree iterations. Maybe 15 minutes. The quality gate went from 0.71 to 0.82. The human (me) made one decision: "Fix InjectableEmitter first." Claude did the rest.
The .NET Flow
Why It Works for .NET
Roslyn gives Claude precise context. The QualityGate report.json doesn't just say "coverage is low." It says "method EmitRegistration in InjectableEmitter.cs has 48% branch coverage, and mutant at line 94 survived because no test exercises the zero-parameter path." That's actionable. Claude reads the source, understands the branch, and writes the test.
The test infrastructure matters too. Projects in my ecosystem use in-memory Roslyn compilations — you create fake source text, run the generator, and assert on the emitted code. No filesystem, no process spawning, no flakiness. Claude learns this pattern from existing tests and replicates it. The patterns come from the human. The volume comes from Claude.
The Setup
This website — the one you're reading — has a comprehensive test suite with Vitest coverage gates that act as the quality gate:
// vitest.config.js — the quality gate for this website
thresholds: {
// Pure state machine logic — high bar
'src/lib/**/*.ts': {
statements: 98,
branches: 95,
functions: 98,
lines: 99,
},
// Build pipeline — total coverage via dependency injection
'scripts/build-static.js': {
statements: 100,
branches: 100,
functions: 100,
lines: 100,
},
}// vitest.config.js — the quality gate for this website
thresholds: {
// Pure state machine logic — high bar
'src/lib/**/*.ts': {
statements: 98,
branches: 95,
functions: 98,
lines: 99,
},
// Build pipeline — total coverage via dependency injection
'scripts/build-static.js': {
statements: 100,
branches: 100,
functions: 100,
lines: 100,
},
}98% statements. 95% branches. 99% lines. For the build pipeline: 100% across the board. These aren't aspirational — they're enforced. npx vitest run returns exit code 1 if any threshold is violated.
A Typical Session
me: I just refactored the scroll spy state machine. Run
the tests, let's see the damage.
claude: Running `npx vitest run --coverage`...
Test Files 144 passed (144)
Tests 2812 passed (2812)
Coverage:
src/lib/scroll-spy.ts
Statements: 94.2% (threshold: 98%) ✗
Branches: 88.1% (threshold: 95%) ✗
Functions: 100% (threshold: 98%) ✓
Lines: 94.8% (threshold: 99%) ✗
Coverage thresholds not met. Reading coverage report...
Uncovered branches in scroll-spy.ts:
- Line 47: `else` branch when headingElements is empty
- Line 63: early return when scrollY < firstHeadingOffset
- Lines 78-82: fallback when no heading is in viewport
- Line 91: `else` branch when activeId === previousId
me: Cover those branches. Check the existing scroll-spy
tests for the pattern.
claude: Reading test/unit/scroll-spy.test.ts...
The existing tests use a mock DOM factory:
`createMockHeadings(count, offsets)` that returns fake
elements with getBoundingClientRect stubs. I'll follow
this pattern.
[writes 4 new test cases]
Running `npx vitest run --coverage`...
src/lib/scroll-spy.ts
Statements: 98.5% ✓
Branches: 95.2% ✓
Functions: 100% ✓
Lines: 99.1% ✓
All thresholds met.
me: Now run compliance.
claude: Running `npx tsx scripts/compliance-report.ts --strict`...
97 features, 829 acceptance criteria, 100% coverage.
Total tests linked to ACs: 2812 (2757 unit + 55 e2e)
Runtime coverage warnings: 0
Unbound features: 0
Orphan source files: 0
Quality gate: PASSme: I just refactored the scroll spy state machine. Run
the tests, let's see the damage.
claude: Running `npx vitest run --coverage`...
Test Files 144 passed (144)
Tests 2812 passed (2812)
Coverage:
src/lib/scroll-spy.ts
Statements: 94.2% (threshold: 98%) ✗
Branches: 88.1% (threshold: 95%) ✗
Functions: 100% (threshold: 98%) ✓
Lines: 94.8% (threshold: 99%) ✗
Coverage thresholds not met. Reading coverage report...
Uncovered branches in scroll-spy.ts:
- Line 47: `else` branch when headingElements is empty
- Line 63: early return when scrollY < firstHeadingOffset
- Lines 78-82: fallback when no heading is in viewport
- Line 91: `else` branch when activeId === previousId
me: Cover those branches. Check the existing scroll-spy
tests for the pattern.
claude: Reading test/unit/scroll-spy.test.ts...
The existing tests use a mock DOM factory:
`createMockHeadings(count, offsets)` that returns fake
elements with getBoundingClientRect stubs. I'll follow
this pattern.
[writes 4 new test cases]
Running `npx vitest run --coverage`...
src/lib/scroll-spy.ts
Statements: 98.5% ✓
Branches: 95.2% ✓
Functions: 100% ✓
Lines: 99.1% ✓
All thresholds met.
me: Now run compliance.
claude: Running `npx tsx scripts/compliance-report.ts --strict`...
97 features, 829 acceptance criteria, 100% coverage.
Total tests linked to ACs: 2812 (2757 unit + 55 e2e)
Runtime coverage warnings: 0
Unbound features: 0
Orphan source files: 0
Quality gate: PASSTwo iterations for the coverage gate. One pass for compliance. The refactored state machine is tested, the thresholds hold, and the typed specifications confirm that every acceptance criterion is still linked to a test.
Beyond Coverage: The Compliance Scanner
The Vitest thresholds catch line and branch coverage. But coverage doesn't tell you whether you're testing the right things. That's what the compliance scanner does.
This website has 97 typed feature specifications with 829 acceptance criteria, verified by 2812 tests (2757 unit + 55 e2e). Each test is linked to features via @Implements decorators. The compliance scanner reads the features, scans test files for decorator references, and builds a coverage matrix:
ID Title Total Covered TU E2E % src
──────────────────────────────────────────────────────────────────────────────────────────────
✓ NAV SPA Navigation + Deep Links 8 8 4 4 100% src 100% (1 file)
✓ THEME Theme Switching 5 5 5 0 100% src 100% (1 file)
✓ SEARCH Search 5 5 5 0 100% src 100% (1 file)
✓ SPY Scroll Spy 12 12 6 6 100% src 100% (1 file)
✓ HOT-RELOAD WebSocket Hot Reload 43 43 43 0 100% src 100% (7 files)
✓ TEST-BINDINGS-INF Test-driven bindings inference 23 23 23 0 100% src 100% (1 file)
...
Features: 97 active
Acceptance criteria: 829/829 ACs covered (100%)
Total tests linked to ACs: 2812 (2757 unit + 55 e2e)
Orphan source files: 0
Quality gate: PASS ID Title Total Covered TU E2E % src
──────────────────────────────────────────────────────────────────────────────────────────────
✓ NAV SPA Navigation + Deep Links 8 8 4 4 100% src 100% (1 file)
✓ THEME Theme Switching 5 5 5 0 100% src 100% (1 file)
✓ SEARCH Search 5 5 5 0 100% src 100% (1 file)
✓ SPY Scroll Spy 12 12 6 6 100% src 100% (1 file)
✓ HOT-RELOAD WebSocket Hot Reload 43 43 43 0 100% src 100% (7 files)
✓ TEST-BINDINGS-INF Test-driven bindings inference 23 23 23 0 100% src 100% (1 file)
...
Features: 97 active
Acceptance criteria: 829/829 ACs covered (100%)
Total tests linked to ACs: 2812 (2757 unit + 55 e2e)
Orphan source files: 0
Quality gate: PASSIf Claude writes tests that satisfy the coverage gate but miss an acceptance criterion, the compliance scanner catches it. Two gates, two dimensions: coverage measures breadth, compliance measures intent. The Handoff article goes deeper into how the AST scanner infers the traceability graph transitively — proving that each test actually calls the code it claims to verify.
Why This Works: The Three Prerequisites
The Loop isn't magic. It works because three conditions are met simultaneously:
| Prerequisite | .NET Example | TypeScript Example | Why It Matters |
|---|---|---|---|
| Machine-readable reports | report.json with per-method metrics |
V8 coverage JSON + compliance JSON | Claude needs structured data, not a dashboard screenshot |
| Code-reading AI agent | Claude reads Roslyn-analyzed source | Claude reads TS modules + test files | Not a chatbot — an agent that works inside the project |
| Objective threshold | min-test-quality-score: 0.80 in YAML |
branches: 95 in vitest.config.js |
The gate defines "done" — without it, the loop has no termination condition |
Remove any one of these and the system breaks:
- No machine-readable reports? Claude can't know what's missing. It would have to guess, and guessing means writing redundant tests that cover already-covered paths.
- No code-reading agent? The reports exist but nobody acts on them. We're back to write-only dashboards.
- No objective threshold? The loop has no termination condition. "Write more tests" is not a goal — "reach 95% branch coverage" is.
This is why dotnet quality-gate check outputs JSON and why Vitest has a json-summary reporter. Machine-readable output isn't a nice-to-have. It's the interface between the gate and the agent.
The Ratchet: Thresholds Only Tighten
Quality gates are not aspirational goals. They are ratchets. They only move in one direction: up.
The Loop accelerates the ratchet. Before The Loop, tightening a threshold meant a human had to write the missing tests. That's a cost in time and motivation. Now, tightening a threshold means telling Claude "the bar is now 85%." The cost is one sentence.
Here's what the progression looks like on a real project:
| Iteration | Threshold | Before | After | Iterations | Human Intervention |
|---|---|---|---|---|---|
| Week 1 | 60% | 52% | 63% | 2 | Set initial threshold |
| Week 3 | 75% | 63% | 77% | 3 | Ratcheted to 75%, reviewed new tests |
| Week 5 | 85% | 77% | 87% | 4 | Ratcheted to 85%, rewrote 2 naive tests |
| Week 7 | 95% | 87% | 96% | 5 | Ratcheted to 95%, added property-based tests |
Notice the pattern: as the threshold climbs, the iterations increase. The easy coverage is fast. The hard coverage takes more cycles and more human review. That's expected — and that's where the human's judgment matters most.
The human reviews every ratchet step. Some tests Claude writes at the 60% level are acceptable. At the 95% level, you're in edge-case territory where semantic correctness matters more than structural coverage. That's when I rewrite tests, add property-based invariants with fast-check, or redesign the test strategy entirely.
The Loop doesn't replace the human. It changes what the human does — from writing tests to reviewing and directing them.
Addressing Skepticism
I've heard every objection. Let me address them head-on.
"AI just generates trivial assertions"
Without a quality gate, yes. Claude will happily write expect(result).toBeDefined() and call it a day. That's why the quality gate exists.
But the real answer is mutation testing. A trivial assertion lets mutants survive. Consider:
// Source code
public int CalculateDiscount(int quantity)
{
if (quantity > 10)
return quantity * 2;
return quantity;
}// Source code
public int CalculateDiscount(int quantity)
{
if (quantity > 10)
return quantity * 2;
return quantity;
}A naive test:
[Fact]
public void CalculateDiscount_Returns_Value()
{
var result = CalculateDiscount(15);
Assert.True(result > 0); // trivial — always true for positive input
}[Fact]
public void CalculateDiscount_Returns_Value()
{
var result = CalculateDiscount(15);
Assert.True(result > 0); // trivial — always true for positive input
}This achieves 100% line coverage. But Stryker mutates quantity > 10 to quantity >= 10 — and the test still passes. The mutant survives. The min-test-quality-score gate fails.
Claude reads the Stryker report, sees the surviving mutant, and writes:
[Fact]
public void CalculateDiscount_BoundaryAt10_NoDiscount()
{
var result = CalculateDiscount(10);
Assert.Equal(10, result); // boundary: exactly 10 → no discount
}
[Fact]
public void CalculateDiscount_Above10_DoubleDiscount()
{
var result = CalculateDiscount(11);
Assert.Equal(22, result); // 11 → 11 * 2 = 22
}[Fact]
public void CalculateDiscount_BoundaryAt10_NoDiscount()
{
var result = CalculateDiscount(10);
Assert.Equal(10, result); // boundary: exactly 10 → no discount
}
[Fact]
public void CalculateDiscount_Above10_DoubleDiscount()
{
var result = CalculateDiscount(11);
Assert.Equal(22, result); // 11 → 11 * 2 = 22
}Now the mutant dies. The gate passes. Mutation testing is the antidote to trivial assertions, and Claude reads mutation reports as naturally as coverage reports.
"AI doesn't understand the business domain"
Correct. Claude doesn't know that CalculateDiscount is a pricing rule with tax implications. That's why the human designs the architecture, writes the typed feature specifications, and reviews the output.
Claude writes tests that satisfy structural quality gates — coverage and mutation scores. The human ensures semantic quality — that the right behaviors are tested, that the assertions match business intent, and that the typed specs link tests to acceptance criteria.
The division of labor is clear:
BEFORE (All Human) AFTER (The Loop)
═══════════════════ ═════════════════════
Human decides what to test Human decides what to test
Human writes the test Claude writes the test
Human runs the test Claude runs the test
Human reads the report Claude reads the report
Human fixes the gap Claude fixes the gap
Human gets bored at 70% Claude doesn't get bored
Coverage plateaus Coverage meets gate
Human reviews + approves BEFORE (All Human) AFTER (The Loop)
═══════════════════ ═════════════════════
Human decides what to test Human decides what to test
Human writes the test Claude writes the test
Human runs the test Claude runs the test
Human reads the report Claude reads the report
Human fixes the gap Claude fixes the gap
Human gets bored at 70% Claude doesn't get bored
Coverage plateaus Coverage meets gate
Human reviews + approvesThe first line is identical. The human always decides what to test. The Loop automates the tedious part: writing, running, reading, fixing.
"AI tests are brittle"
Tests that test implementation details are brittle regardless of who writes them. If your project tests mock.Verify(x => x.CallDatabase(), Times.Exactly(3)), those tests will break when you refactor — whether a human or an AI wrote them.
Claude follows the project's existing test patterns. If the project uses fakes and in-memory compilations (as QualityGate does), Claude writes tests that way. If the project uses dependency-injected mock I/O (as this website's build pipeline does), Claude uses that pattern. The conventions come from the human. The volume comes from Claude.
Want non-brittle tests? Write non-brittle patterns. Claude will replicate them.
"You're just gaming coverage numbers"
This objection assumes coverage is the only metric. It's not.
The QualityGate's min-test-quality-score averages coverage and mutation score. The website's compliance scanner checks that every acceptance criterion has a linked test. Together, they form a three-dimensional quality measure:
- Coverage — "was this code executed during tests?"
- Mutation score — "would this test catch a bug in this code?"
- Compliance — "are we testing the right features?"
You cannot game all three simultaneously. High coverage with low mutation score means weak assertions — the gate fails. High mutation score with missing compliance means you're testing the wrong things — the scanner fails. The Loop converges on genuine test quality because the gates measure it from multiple angles.
The Compound Effect: 57 Projects
A single project, this is useful. Across a 57-project monorepo, it's transformative.
Each project in FrenchExDev has its own quality-gate.yml. Each has its own thresholds, calibrated to the project's maturity. A new library starts at 60%. A stable library sits at 95%. Claude reads the project structure, finds the existing test patterns, and adapts.
me: Run quality gates across all Injectable projects.
claude: [runs dotnet quality-gate check for each project]
Results:
┌──────────────────────────────────────┬──────────┬───────────┐
│ Project │ Score │ Status │
├──────────────────────────────────────┼──────────┼───────────┤
│ Injectable.Lib │ 0.91 │ ✓ PASS │
│ Injectable.Abstractions │ 0.88 │ ✓ PASS │
│ Injectable.SourceGenerator.Lib │ 0.74 │ ✗ FAIL │
│ Injectable.Microsoft.SourceGenerator │ 0.68 │ ✗ FAIL │
│ Injectable.Tests │ N/A │ test proj │
└──────────────────────────────────────┴──────────┴───────────┘
2 projects below threshold (0.80):
- SourceGenerator.Lib: 0.74 (gap: 0.06)
- Microsoft.SourceGenerator: 0.68 (gap: 0.12)
me: Fix SourceGenerator.Lib first. Smaller gap.
claude: Reading SourceGenerator.Lib source and test files...
Found 23 existing tests in InjectableCollectorTests.cs
and EmitterTests.cs. Pattern: in-memory compilation
with SyntaxFactory, assertions on generated source text.
[iteration 1: writes 8 tests for uncovered branches]
Score: 0.78. Still 2 surviving mutants.
[iteration 2: writes boundary tests for mutant at line 112]
Score: 0.81. Gate passes.
Moving to Microsoft.SourceGenerator...
Reading source and test files...
Found 15 existing tests. Same compilation pattern,
plus Microsoft DI container assertions.
[iteration 1: writes 11 tests]
Score: 0.75.
[iteration 2: writes 6 more tests targeting nullable handling]
Score: 0.79.
[iteration 3: writes property-based test for generic resolution]
Score: 0.83. Gate passes.
me: Good. Review the property-based test before I commit.me: Run quality gates across all Injectable projects.
claude: [runs dotnet quality-gate check for each project]
Results:
┌──────────────────────────────────────┬──────────┬───────────┐
│ Project │ Score │ Status │
├──────────────────────────────────────┼──────────┼───────────┤
│ Injectable.Lib │ 0.91 │ ✓ PASS │
│ Injectable.Abstractions │ 0.88 │ ✓ PASS │
│ Injectable.SourceGenerator.Lib │ 0.74 │ ✗ FAIL │
│ Injectable.Microsoft.SourceGenerator │ 0.68 │ ✗ FAIL │
│ Injectable.Tests │ N/A │ test proj │
└──────────────────────────────────────┴──────────┴───────────┘
2 projects below threshold (0.80):
- SourceGenerator.Lib: 0.74 (gap: 0.06)
- Microsoft.SourceGenerator: 0.68 (gap: 0.12)
me: Fix SourceGenerator.Lib first. Smaller gap.
claude: Reading SourceGenerator.Lib source and test files...
Found 23 existing tests in InjectableCollectorTests.cs
and EmitterTests.cs. Pattern: in-memory compilation
with SyntaxFactory, assertions on generated source text.
[iteration 1: writes 8 tests for uncovered branches]
Score: 0.78. Still 2 surviving mutants.
[iteration 2: writes boundary tests for mutant at line 112]
Score: 0.81. Gate passes.
Moving to Microsoft.SourceGenerator...
Reading source and test files...
Found 15 existing tests. Same compilation pattern,
plus Microsoft DI container assertions.
[iteration 1: writes 11 tests]
Score: 0.75.
[iteration 2: writes 6 more tests targeting nullable handling]
Score: 0.79.
[iteration 3: writes property-based test for generic resolution]
Score: 0.83. Gate passes.
me: Good. Review the property-based test before I commit.Five iterations total across two projects. Claude read the patterns from existing tests, wrote new ones that matched, and iterated until both gates passed. My contribution: two sentences of direction and a final review.
This is what expanding what a single developer can realistically build and maintain looks like. Not replacing judgment — compressing the distance between deciding what quality means and achieving it.
What Claude Gets Right (and Wrong)
Intellectual honesty demands this section. The Loop isn't perfect.
What Claude Gets Right
- Pattern replication. Give Claude 5 tests as examples and it will write 50 more in the same style. Test patterns are highly regular, which plays to AI's strength.
- Coverage grinding. Claude will patiently write the 47th test for the 47th branch without complaining. Humans check out mentally around test #12.
- Report reading. Claude parses JSON coverage reports and Stryker mutation reports accurately. It maps line numbers to source code, identifies the specific branch or mutant, and targets the gap.
- Boundary detection. Once Claude reads a mutation report showing
> 0mutated to>= 0, it reliably writes boundary tests at 0, -1, 1,int.MaxValue. The mutation report teaches it what matters.
What Claude Gets Wrong
- Semantic assertions. Claude will assert that a function returns "something" but may not assert the right "something." At low thresholds (60-75%), this is fine — you're building coverage mass. At high thresholds (90%+), you need to review assertions for business meaning.
- Over-mocking. If existing tests use mocks, Claude will use mocks everywhere — even when an integration test would be more valuable. The human needs to set the pattern correctly.
- Test naming. Claude writes descriptive but sometimes redundant test names. I regularly rename tests during review. This is cosmetic, not structural.
- Complex state setup. For state machines with 15+ states and complex transition guards, Claude sometimes writes tests that achieve coverage through unrealistic state combinations. The quality gate catches this indirectly (surviving mutants), but human review catches it faster.
The pattern is clear: Claude handles structural quality (coverage, mutation killing) well. Semantic quality (does this test make business sense?) requires human review. The Loop doesn't eliminate the human — it changes what the human does.
1. A Machine-Readable Quality Gate
For .NET: FrenchExDev.Net.QualityGate with report.json output. Or any tool that produces Cobertura XML + Stryker JSON.
For JavaScript/TypeScript: Vitest with json-summary reporter and coverage thresholds in vitest.config.js. Or Jest with --coverageReporters=json-summary and coverageThreshold in jest.config.js.
For other stacks: Any tool that (a) produces machine-readable output and (b) returns exit code 1 on failure.
The key requirement: the report must be granular enough for Claude to identify specific uncovered lines and branches. A summary that says "coverage is 72%" is useless. A report that says "line 47 of scroll-spy.ts: uncovered branch in else clause" is actionable.
2. Claude Code with Project Access
Not a chatbot. Not copy-paste from a browser. Claude Code — the CLI agent that reads files, runs commands, writes code, and iterates. The Loop requires an agent that can:
- Run test commands (
dotnet test,npx vitest run) - Read coverage report files (JSON, XML)
- Read source code to understand what needs testing
- Write test files following existing patterns
- Re-run tests to verify
All of this happens inside the project. Claude sees the same files you do.
3. Existing Test Patterns
Claude needs patterns to follow. If you have zero tests, write the first 10 yourself. Establish:
- How test files are organized (one per module? one per feature?)
- What assertion style you use (xUnit? NUnit? Vitest's
expect?) - How dependencies are handled (mocks? fakes? DI?)
- How test data is created (factories? builders? literals?)
Claude will replicate whatever you give it. The quality of The Loop's output is directly proportional to the quality of the patterns you seed it with.
4. Start Low, Ratchet Up
Don't set 95% on day one. If your current coverage is 52%, set the threshold at 60%. Run The Loop. Review the tests. If they're good, ratchet to 70%. Repeat.
Each ratchet step is a conversation:
me: Coverage is at 77%. I'm raising the threshold to 85%.
The failing gate should be in the state machine modules.
Get us there.
claude: [runs tests, reads report, identifies gaps, iterates]me: Coverage is at 77%. I'm raising the threshold to 85%.
The failing gate should be in the state machine modules.
Get us there.
claude: [runs tests, reads report, identifies gaps, iterates]The Loop is most effective when the gap between current and target is 5-15 percentage points. Larger gaps work but produce more tests to review in a single session.
5. Always Review
The Loop is not "fire and forget." The human reviews every batch of tests Claude writes. At low thresholds, the review is quick — "yes, these cover the basic paths." At high thresholds, the review is more careful — "this assertion checks the return type but not the value; rewrite it."
The review is also where you catch structural improvements that Claude won't suggest: "these 6 tests should be a property-based test instead" or "this mock should be a fake."
The Philosophy: Quality as a System Property
This article is really about one idea from Don't Put the Burden on Developers:
Every recurring failure is a structural gap, not a discipline problem.
Low test coverage is not a discipline problem. Developers don't lack the skill to write tests. They lack the time, the motivation, and — honestly — the patience to grind through the last 30% of coverage on a Thursday afternoon when there are features to ship.
The Loop is a structural solution to this. The quality gate defines the floor. Claude does the grinding. The human controls the system — setting thresholds, reviewing output, ratcheting upward.
The AI is not the author. The quality gate is not the judge. Together, they are a feedback loop. And the human — the architect — decides when the loop runs, what it targets, and when to raise the bar.
The future of test writing is not "AI writes all tests." It's this:
The human decides what quality means. The gate measures it. The AI grinds toward it. The human reviews, approves, and ratchets upward. Repeat.
That's The Loop.
Further Reading
- QualityGate: Roslyn-Powered Static Analysis and Quality Metrics for .NET — the .NET quality gate tool that powers the left side of The Loop
- Quality to Its Finest: Testing a Terminal-Styled CV Website — the 8-layer testing architecture for this website, including coverage gates and compliance scanning
- Hardening the Test Pipeline — how smoke filtering, auto-baselines, and pre-push hooks make the test suite sustainable
- Don't Put the Burden on Developers — the philosophy that quality is a structural problem, not a discipline problem
- Building FrenchExDev with Claude AI — how I use Claude Code as a pair programmer across 57 projects
- The Journey: Building This Site with Claude — the complete story of human-AI collaboration that built this website
- Onboarding Typed Specifications — how typed feature specifications create the third quality dimension: compliance
- Handoff — Closing the Requirements → Code → Tests → Proof Loop — the AST scanner, transitive walker, and compliance report that prove every requirement is implemented and tested
- Requirements ARE Types — the .NET side: a Requirements DSL with Roslyn source generators, typed references, and build-time compliance validation