Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Generated vs. Written -- The Drift Comparison

Every previous part demonstrated what the DSL ecosystem produces. This part demonstrates why it matters. The argument is simple: generated artifacts cannot drift from their source. Written artifacts always do.


Setup: Two Identical Services

Two teams. Same company. Same OrderService. Same requirements. Same deadline.

Team A uses the 22 Ops DSLs. OrderService has a single OrderServiceOps.cs with ~95 attributes. Every operational artifact -- deployment manifests, docker-compose files, Terraform modules, health checks, alert rules, chaos experiments, RBAC policies, dashboards, runbooks -- is generated from those attributes.

Team B does not use the DSLs. Instead, they maintain:

  • A Confluence wiki with 12 pages covering deployment, monitoring, security, and operational procedures
  • Hand-written Kubernetes manifests in a k8s/ directory
  • Hand-written Terraform modules in a terraform/ directory
  • Hand-written docker-compose files for local development
  • A runbook document with incident response procedures
  • A Grafana dashboard built by clicking in the UI and exported to JSON
  • Prometheus alert rules written directly in YAML
  • A compliance spreadsheet mapping SOC2 controls to evidence

Both teams launch on the same day. Both services are identical in function and operational posture. At day zero, the outputs are indistinguishable. The wiki is comprehensive, well-structured, and accurate. The manifests match the running system. The alerts fire correctly. The runbook works.

Day zero is the last time both teams are synchronized.


Month 1: Everything Matches

Side-by-side comparison at the one-month mark:

Aspect Team A (Generated) Team B (Written)
Deployment manifest Matches code Matches code
Health checks 3, all passing 3, all passing
Alert rules 3, tested 3, tested
Chaos experiments 3, all green 3, documented in wiki
RBAC policies 4, enforced 4, enforced
Runbook Generated, accurate Written, accurate
Compliance evidence Generated, current Spreadsheet, current
Drift count 0 0

Team B's wiki is actually prettier. It has diagrams, color-coded sections, and a table of contents. A new hire reads it and understands the system in an hour. Team A's attributes are dense and require familiarity with the DSL vocabulary.

At month 1, Team B's approach looks better.


Month 3: After a Routine Change

The database was upgraded from PostgreSQL 15 to 16. Both teams need to update their operational artifacts.

Team A changes one attribute:

// Before
[DeploymentOrchestrator("order-platform",
    DependsOn = new[] { "postgres-15", "redis-7", "rabbitmq-3" })]

// After
[DeploymentOrchestrator("order-platform",
    DependsOn = new[] { "postgres-16", "redis-7", "rabbitmq-3" })]

The build regenerates everything. The docker-compose file gets postgres:16. The Container chaos experiment gets postgres:16. The Terraform module gets the updated Azure Database for PostgreSQL version. The backup policy script switches to pg16 dump format. The migration plan references the correct pg16 features. The analyzer validates that every reference to the database is consistent.

One line changed. Twelve files regenerated. Zero drift.

Team B updates the docker-compose file:

# Before
postgres:
  image: postgres:15

# After
postgres:
  image: postgres:16

Then they update the Kubernetes deployment manifest. They remember to update the staging environment. They forget to update:

  1. The wiki page "Database Infrastructure" that says PostgreSQL 15
  2. The Terraform module that provisions the production database (still references postgres_version = "15")
  3. The backup script that uses pg_dump with a pg15-specific flag

Drift count after 3 months: Team A = 0, Team B = 3.

Nobody notices. The wiki page is wrong, but nobody reads it until the next incident. The Terraform module is wrong, but nobody runs terraform plan until the next infrastructure change. The backup script is wrong, but backups still succeed (pg16 is backward-compatible with pg15 dump format -- until it is not).


Month 6: After a Team Member Leaves

Sarah, the senior engineer who wrote most of Team B's wiki and runbooks, leaves the company. She knew which parts of the documentation were critical, which were aspirational, and which were outdated from day one.

Team A: Nothing changes. The attributes are the documentation. The generated artifacts are always current. Sarah's departure has no impact on operational accuracy. The new hire reads the attributes and the generated report. Everything matches the running system because it is derived from the same source.

Team B: The new hire, James, is assigned to on-call. He follows the runbook for a minor incident (RabbitMQ connection spike). Step 4 says: "Check the RabbitMQ management console at rabbit-mgmt.internal:15672." The endpoint was renamed to rabbitmq-dashboard.internal:15672 two months ago. Sarah knew this. James does not. He spends 45 minutes finding the correct URL, then another 45 minutes doubting the rest of the runbook.

He updates step 4. He does not check the other 23 steps.

Three weeks later, a different incident. Step 11 references a metric name that was renamed in a Prometheus refactoring. Another 30 minutes lost.

The wiki page "Monitoring Architecture" references three dashboards. One was deleted. One was moved to a different Grafana folder. The third is correct.

Drift count after 6 months: Team A = 0, Team B = 7.

The drift is not caused by negligence. Sarah was diligent. James is diligent. The drift is caused by the fact that the runbook, the wiki, and the Grafana dashboard are all secondary artifacts that must be manually synchronized with the primary artifacts (the code and the infrastructure). Every change to a primary artifact creates a synchronization obligation for every secondary artifact. Some obligations are met. Some are not.


Month 9: After a Major Version Bump

OrderService v3 becomes v4. New payment provider (Stripe replaces the legacy gateway). New event schema (CloudEvents replaces the custom format). New deployment strategy (blue-green replaces canary because the new payment provider cannot handle gradual traffic shifting).

Team A updates the attributes:

// Payment provider change
[CircuitBreaker(typeof(IStripeGateway), // was IPaymentGateway
    FailureThreshold = 5, BreakDuration = "30s")]

[ChaosExperiment("StripeTimeout", Tier = OpsExecutionTier.InProcess,
    Hypothesis = "Order placement degrades gracefully when Stripe times out")]
[TargetService(typeof(IStripeGateway))] // was IPaymentGateway

// Deployment strategy change
[BlueGreenStrategy( // was CanaryStrategy
    SwitchAfter = "10m",
    HealthCheckInterval = "30s",
    RollbackOnFailure = true)]

Build. The compiler catches twelve errors:

  1. CHS001: Chaos experiment still references IPaymentGateway -- no circuit breaker on that interface anymore.
  2. PRF001: SLI references metric payment_gateway_request_duration_seconds -- metric renamed to stripe_request_duration_seconds.
  3. OBS002: Alert rule references old metric name.
  4. DEP002: Canary strategy is gone but the error rate metric reference assumed canary.
  5. Eight more cross-DSL reference mismatches.

Every inconsistency is a compiler error. Not a warning. Not a TODO. An error that prevents the build from succeeding. The developer fixes all twelve. Rebuilds. All 31 validations pass. Every generated artifact is consistent with the new v4 architecture.

Team B updates the code. Updates the Kubernetes manifests. Updates the Terraform module for the new payment provider's IP allowlist. Updates the docker-compose file. Then:

  • The wiki page "Payment Integration" still references the old provider's API documentation.
  • The runbook step "Verify payment gateway health" references the old endpoint.
  • The Grafana dashboard panel "Payment Latency" queries the old metric name.
  • The alert rule for payment errors references the old metric.
  • The chaos experiment documentation references the old interface. (They had chaos experiments documented? James did not know. Sarah set them up before she left. Nobody has run them since month 2.)
  • The compliance spreadsheet maps SOC2 CC9.1 "Risk mitigation" to "Chaos experiments on IPaymentGateway" -- an interface that no longer exists.
  • The RBAC policy document references the old payment operation names.
  • The deployment runbook describes the canary strategy. The service now uses blue-green.

Three weeks later, an external audit. The auditor asks for evidence that SOC2 CC9.1 (risk mitigation) is implemented. Team B produces the compliance spreadsheet. The auditor reads "Chaos experiments on IPaymentGateway." The auditor asks to see the chaos experiment. There is no chaos experiment on IStripeGateway. The finding is documented as a gap.

Drift count after 9 months: Team A = 0, Team B = 15+.


The Drift Curve

Drift items
    ^
 20 |                                              Team B
    |                                           __/
 15 |                                       ___/
    |                                   ___/  <-- v3->v4 upgrade
 12 |                               ___/
    |                           ___/
  9 |                       ___/
    |                   ___/
  6 |               ___/  <-- Sarah leaves
    |           ___/
  3 |       ___/  <-- postgres upgrade
    |   ___/
  0 |--/______________________________________ Team A (flat at zero)
    +--+--+--+--+--+--+--+--+--+--+--+--+--> Months
    0  1  2  3  4  5  6  7  8  9 10 11 12

Team A is a flat line at zero. The compiler makes drift structurally impossible. You cannot ship a build where the deployment manifest contradicts the chaos experiment, or where the alert rule references a metric that does not exist, or where the RBAC policy lists an operation that was renamed.

Team B is an exponential curve. Each change introduces drift. Each departure accelerates it. Each major version multiplies it. The drift compounds because fixing one inconsistency does not fix the others, and the act of fixing often introduces new inconsistencies in the correction itself.


The Cost Analysis

Over 12 months, tracked by both teams:

Team B -- Manual Maintenance:

Activity Hours/Month Annual Total
Wiki updates after code changes 4 48
Runbook updates 3 36
Manifest synchronization 3 36
Dashboard/alert maintenance 2 24
Compliance evidence gathering 4 48
Debugging due to stale docs 3 36
Onboarding friction (stale info) 2 24
Incident time lost to wrong runbooks 2 24
Total 23 276

Team A -- Attribute Maintenance:

Activity Hours/Month Annual Total
Attribute updates after code changes 3 36
Reviewing generated artifacts 1 12
Learning DSL vocabulary (first 3 months) 4 (then 0) 12
Updating analyzers for new patterns 1 12
Total ~7 72

Team B spends 276 hours per year on operational documentation with increasing drift. Team A spends 72 hours per year on attribute declarations with zero drift. The difference -- 204 hours -- is five weeks of developer time. Per service. Per year.

And Team B's 276 hours does not buy them accuracy. It buys them a best-effort approximation that degrades monotonically.


The Killer Scenario: The 3 AM Incident

Both teams get paged at 3:07 AM. The payment gateway returns 504 for every request. Orders are failing.

Team A's response:

$ dotnet ops report --section resilience

  Resilience
    Circuit Breaker: IStripeGateway (5 failures / 30s break)
      Status: OPEN (tripped at 03:07:12)
      Fallback: Queued for retry via IRabbitMqPublisher
    Retry Policy: IRabbitMqPublisher (3x exponential from 100ms)
    Rollback: Not triggered (error_rate = 12%, threshold = 50%)

  Chaos Experiment: StripeTimeout
    Last run: 2026-04-04 (2 days ago)
    Result: PASSED
    Verified: Circuit breaker trips after 5 failures, fallback queues order
    Recovery: Orders processed from queue when gateway recovers

The on-call engineer reads the report. The circuit breaker is open. Orders are being queued. The chaos experiment proved this path works two days ago. The engineer monitors the queue depth. The payment gateway recovers at 3:19 AM. The queued orders drain in 4 minutes. No orders lost. No manual intervention. Total incident time: 16 minutes, of which 12 were waiting for the upstream provider.

Team B's response:

The on-call engineer (James) opens the runbook. Step 1: "Check the payment gateway health endpoint at https://payments.legacy-provider.com/health." That is the old provider. The current provider is Stripe. James searches his email for the Stripe health endpoint. Three minutes lost.

Step 3: "If the circuit breaker is open, verify the fallback queue." The runbook describes the fallback as a database table pending_orders. The actual fallback was changed to RabbitMQ six months ago. James checks the database table. It is empty. He thinks the fallback is not working. He escalates.

The escalation page goes to the team lead's phone. The team lead was on-call three rotations ago. The escalation policy in PagerDuty still lists the old team lead who transferred to another team. The page goes to someone who no longer works on OrderService. Seven minutes lost before the correct team lead responds.

The team lead knows the fallback is RabbitMQ now. She checks the queue. Orders are accumulating. The system is working as designed. But James has already started manual intervention based on the stale runbook -- he is trying to restart the order-worker pod because step 7 says "restart the worker if the fallback table is not draining." The worker restarts. The in-flight queue consumer disconnects. Twelve orders that were being processed are re-queued. Three of them are duplicated because the consumer had acknowledged but not committed.

Total incident time: 47 minutes. Three duplicate orders that require manual reconciliation the next morning. The post-mortem identifies "stale runbook" as a contributing factor. Action item: "Update the runbook." Assignee: James. Due date: next sprint.


The Fundamental Asymmetry

The comparison reveals a structural truth about operational documentation:

Primary artifacts are the things that execute: source code, compiled binaries, deployed containers, running infrastructure. They are always current because they are the system.

Secondary artifacts are things that describe: wiki pages, runbooks, architecture diagrams, compliance spreadsheets, dashboard JSON files that were exported and checked in. They represent a point-in-time snapshot of someone's understanding of the primary artifacts.

Every time a primary artifact changes, every secondary artifact that describes it becomes a candidate for drift. The synchronization is manual, voluntary, and invisible. There is no compiler error when the wiki says PostgreSQL 15 and the docker-compose says PostgreSQL 16. There is no test failure when the runbook references a renamed endpoint. There is no CI gate when the compliance spreadsheet cites evidence that no longer exists.

The DSL ecosystem eliminates secondary artifacts by generating them from the same source as the primary artifacts. The deployment manifest is not a description of the deployment -- it is the deployment, generated from the same attributes that generate the health checks, the alert rules, the chaos experiments, and the compliance evidence. There is no synchronization obligation because there is nothing to synchronize. There is one source, and everything else is derived.

Documents are a promise. Types are a proof. The drift is not a matter of discipline -- it is a matter of physics. Secondary artifacts always diverge from primary artifacts. The only way to prevent drift is to eliminate secondary artifacts entirely. Generate everything from the single source of truth, and the drift curve stays flat at zero. Forever.

⬇ Download