Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

The Problem -- Why Ops Is Still Untyped

"Your code is typed from requirement to test. Then it enters production, and everything becomes a string."


The Lifecycle Cliff

Follow a feature through the CMF ecosystem:

Requirement    [Feature("OrderCancellation")]                → typed, tracked
     ↓
Domain Model   [AggregateRoot(BoundedContext = "Ordering")]  → typed, validated
     ↓
API            [ApiEndpoint(Method.POST, "/orders/{id}/cancel")] → typed, documented
     ↓
Test           [FeatureTest(typeof(OrderCancellation))]      → typed, linked
     ↓
Deployment     ???
     ↓
Migration      ???
     ↓
Health Check   ???
     ↓
Alerting       ???
     ↓
Rollback       ???

The first four steps are compiler-checked. Roslyn analyzers verify that every requirement has an implementation. Source generators produce test scaffolding, API clients, and traceability matrices. The type system guarantees that if the code compiles, the chain is complete.

Then the feature ships. And the lifecycle falls off a cliff.


The Anti-Pattern: A Real Deployment Specification

Here is how a deployment is "specified" in most organizations. This is a wiki page, written by a developer who has since left the team:

## Order Service v2.4 Deployment

1. Run DB migration script `047_add_payment_status.sql`
2. Wait for migration to complete (~5 min)
3. Deploy order-service v2.4 to staging
4. Run smoke tests (see Postman collection in Teams)
5. If smoke tests pass, deploy to production (rolling update)
6. Check new payment-status endpoint returns 200
7. Monitor Grafana dashboard "Order Service" for 15 min
8. If errors > 5%, rollback to v2.3 (kubectl rollout undo)
9. Enable feature flag `payment-status-v2` in LaunchDarkly
10. Update Slack channel #deployments
11. Close JIRA ticket ORD-1847

Last updated: 2025-11-03 (by: unknown)

Count the problems:

Step 1: The migration script is referenced by filename. Has it been renamed? Does it still exist in the migrations folder? Does it match the schema version the code expects? Nobody knows at compile time.

Step 4: "See Postman collection in Teams." Where in Teams? Which channel? Which collection version? The Postman collection was last exported 3 months ago and is missing the payment-status endpoint.

Step 6: The endpoint /payment-status was renamed to /payments/status in the v2.4 codebase. This step will fail. Nobody will discover this until deployment day.

Step 7: The Grafana dashboard was renamed from "Order Service" to "Order Service v2" two weeks ago. The wiki was not updated.

Step 8: kubectl rollout undo will roll back to the previous deployment, which may or may not be v2.3. If another hotfix was deployed in between, the rollback target is wrong.

Step 9: The feature flag name in LaunchDarkly is payment_status_v2 (underscores), not payment-status-v2 (hyphens). The wiki has the wrong name.

Step 11: The JIRA ticket was moved to a different project during a reorg. The key is now PAYMENTS-1847.

Every step is a string. Every string can drift. Every drift is a production incident waiting to happen. And the wiki page says "Last updated: 2025-11-03 (by: unknown)" -- which means nobody has reviewed it in 5 months, and the original author is untraceable.

This is not a strawman. This is the median deployment specification in the industry.


The Taxonomy of Untyped Operational Knowledge

The wiki deployment checklist is just one species. Here is the complete taxonomy of operational knowledge that lives outside the type system:

Deployment & Release

Concern Where It Lives Today What Goes Wrong
Deployment ordering Wiki page, mental model Wrong order → cascading failure
Migration sequencing Filename conventions (001_, 002_) Gaps, duplicates, parallel conflicts
Rollback conditions Slack messages, verbal agreement Wrong threshold, wrong target version
Canary metrics Helm values, dashboard JSON Metric renamed, threshold stale
Feature flag rules LaunchDarkly UI, spreadsheet Flag name mismatch, stale rules

Observability & Response

Concern Where It Lives Today What Goes Wrong
Health checks Startup.cs, hardcoded paths Endpoint removed, path changed
Alerting rules Prometheus YAML, Grafana UI Alert references deleted metric
SLA definitions Contract PDF, spreadsheet SLO drifts from implementation
Incident runbooks Confluence, PagerDuty notes Steps outdated, wrong escalation
Escalation policies PagerDuty UI, team wiki Rotation changed, person left

Infrastructure & Security

Concern Where It Lives Today What Goes Wrong
Configuration transforms Helm templates, env files Missing key in production, wrong value
Secret rotation Manual process, ticket reminder Expired secret → outage
Network policies Kubernetes YAML, Terraform Policy blocks legitimate traffic
Certificate management Calendar reminder, monitoring Cert expires → TLS failure
Backup schedules Cron jobs, cloud console Backup misconfigured, never tested

Performance & Resilience

Concern Where It Lives Today What Goes Wrong
Performance budgets README, PR review convention Budget violated silently
Chaos experiments Gremlin UI, ad-hoc scripts Experiment scope unknown, no hypothesis
Scaling rules Cloud console, HPA YAML Wrong threshold, cost explosion
Circuit breaker configs appsettings.json, hardcoded Threshold too aggressive or too lenient
Cost budgets Spreadsheet, monthly review Overspend discovered after the fact

That is 18 categories of operational knowledge. Every one of them has the same failure mode: it is a string that drifts from the code it describes.


Why Terraform Does Not Solve This

The immediate objection: "We have Terraform / Pulumi / Helm / CDK. Our infrastructure IS typed."

Terraform solves infrastructure provisioning. It tells the cloud provider: create a VM with 4 CPUs, attach a 100GB disk, put it in subnet X. That is valuable. That is not what we are talking about.

Terraform does not express:

  • "When error rate exceeds 5% for 2 minutes, page the on-call engineer" -- that is an alerting rule, not a resource
  • "Migration 47 must complete before the application starts" -- that is deployment ordering, not provisioning
  • "The canary deployment should measure p99 latency against the baseline" -- that is a canary metric, not infrastructure
  • "This service's p95 response time must stay below 200ms" -- that is a performance budget, not a VM size
  • "If the circuit breaker opens 3 times in 10 minutes, trigger the rollback" -- that is a resilience policy, not a network rule
  • "The backup must be tested quarterly by restoring to a staging environment" -- that is a compliance requirement, not a storage bucket

Terraform/Pulumi type the nouns of infrastructure (VMs, databases, networks). They do not type the verbs of operations (deploy, migrate, alert, rollback, scale, rotate, audit).

Helm templates YAML but does not validate that your canary references a real metric. Pulumi is typed infrastructure-as-code, but the types describe cloud resources, not operational policies. CDK is CloudFormation with better syntax, not operational knowledge with compiler enforcement.

The gap is not "untyped infrastructure." Cloud teams solved that years ago. The gap is untyped operational knowledge -- the rules, policies, thresholds, sequences, and conditions that govern how typed infrastructure behaves in production.


The Typed Requirement Chain

The dev-side DSLs established a pattern:

Policy     → Declaration    → Generated Artifact   → Validation
───────────────────────────────────────────────────────────────
Requirement  [Feature]        RequirementConstants   FeatureComplianceValidator
DDD          [AggregateRoot]  Repository, Events     BoundedContextAnalyzer
API          [ApiEndpoint]    OpenAPI, Client        EndpointCoverageAnalyzer
Test         [FeatureTest]    TestScaffold           TestCoverageAnalyzer

At every level: a human declares policy via an attribute, the source generator produces the artifact, and an analyzer validates the constraints. The chain is typed end-to-end. Violations are compile errors, not production incidents.

The same chain should work for ops:

Policy     → Declaration    → Generated Artifact   → Validation
───────────────────────────────────────────────────────────────
Deployment   [DeployOrchestrator]  DAG, Helm values    CircularDependencyAnalyzer
Migration    [MigrationStep]       Ordered scripts      GapDetectionAnalyzer
Observability [HealthCheck]        Probe registration   StaleMetricAnalyzer
Alerting     [AlertRule]           Prometheus YAML      ThresholdRangeAnalyzer
Resilience   [CircuitBreaker]      Polly policy         FallbackCoverageAnalyzer
Chaos        [ChaosExperiment]     Decorator/Litmus     HypothesisRequiredAnalyzer
Performance  [PerformanceBudget]   k6 script            BaselineRegressionAnalyzer
Security     [SecurityPolicy]      Headers middleware    PolicyCoverageAnalyzer
Cost         [CostBudget]          Alert config         BudgetThresholdAnalyzer

Every row follows the same meta-pattern. Every row produces artifacts that are guaranteed correct by construction. Every row has an analyzer that catches violations before the code leaves the developer's machine.


The Cost of Not Typing

Consider what happens in the untyped world when a developer renames a Grafana dashboard:

  1. Developer renames dashboard in Grafana UI
  2. Wiki deployment checklist still references old name
  3. Six weeks later, a deployment follows the checklist
  4. Step 7 says "Monitor Grafana dashboard 'Order Service' for 15 min"
  5. The dashboard is now called "Order Service v2"
  6. The deployer searches, finds nothing, skips the step
  7. The deployment has a latency regression
  8. Nobody notices for 45 minutes because monitoring was skipped
  9. Postmortem: "We should update the wiki"
  10. The wiki gets updated. The cycle restarts.

Now consider the typed world:

[GrafanaDashboard("order-service",
    Title = "Order Service",
    DataSource = "prometheus",
    Panels = [typeof(OrderLatencyPanel), typeof(OrderErrorRatePanel)])]
public sealed class OrderServiceDashboard { }

[DeploymentOrchestrator("2.4")]
[MonitorAfterDeploy(typeof(OrderServiceDashboard), DurationMinutes = 15)]
public sealed class OrderServiceV24Deployment { }

The [MonitorAfterDeploy] attribute takes a typeof(). If the dashboard class is renamed, the compiler emits CS0246. If the dashboard is deleted, the compiler emits CS0246. The deployment cannot reference a dashboard that does not exist. The wiki cannot drift because there is no wiki -- the deployment specification is the code.


The Question

The dev-side DSLs proved that C# attributes, source generators, and Roslyn analyzers can type an entire domain -- from requirements to tests. The pattern is known. The infrastructure is known. The meta-metamodel is known.

So why stop at dotnet test?

What if every operational concern was a DSL? What if deployment ordering, migration sequencing, health checks, alerting rules, chaos experiments, performance budgets, compliance matrices, cost budgets, secret rotation, certificate management, backup schedules, scaling rules, SLA definitions, network policies, incident runbooks, feature flags, and data pipeline quality gates were all typed, validated, and generated?

What if dotnet build produced not just code, but a complete operational specification -- typed, validated, cross-referenced, and impossible to drift?

That is the Ops DSL Ecosystem. Twenty-two sub-DSLs. Three execution tiers. One meta-pattern. Every artifact generated. No hand-written YAML. No hand-written Terraform. No hand-written Docker Compose. No hand-written bash scripts. You declare C# attributes; dotnet build produces everything.

The rest of this series builds it.

⬇ Download