The Problem -- Why Ops Is Still Untyped
"Your code is typed from requirement to test. Then it enters production, and everything becomes a string."
The Lifecycle Cliff
Follow a feature through the CMF ecosystem:
Requirement [Feature("OrderCancellation")] → typed, tracked
↓
Domain Model [AggregateRoot(BoundedContext = "Ordering")] → typed, validated
↓
API [ApiEndpoint(Method.POST, "/orders/{id}/cancel")] → typed, documented
↓
Test [FeatureTest(typeof(OrderCancellation))] → typed, linked
↓
Deployment ???
↓
Migration ???
↓
Health Check ???
↓
Alerting ???
↓
Rollback ???Requirement [Feature("OrderCancellation")] → typed, tracked
↓
Domain Model [AggregateRoot(BoundedContext = "Ordering")] → typed, validated
↓
API [ApiEndpoint(Method.POST, "/orders/{id}/cancel")] → typed, documented
↓
Test [FeatureTest(typeof(OrderCancellation))] → typed, linked
↓
Deployment ???
↓
Migration ???
↓
Health Check ???
↓
Alerting ???
↓
Rollback ???The first four steps are compiler-checked. Roslyn analyzers verify that every requirement has an implementation. Source generators produce test scaffolding, API clients, and traceability matrices. The type system guarantees that if the code compiles, the chain is complete.
Then the feature ships. And the lifecycle falls off a cliff.
The Anti-Pattern: A Real Deployment Specification
Here is how a deployment is "specified" in most organizations. This is a wiki page, written by a developer who has since left the team:
## Order Service v2.4 Deployment
1. Run DB migration script `047_add_payment_status.sql`
2. Wait for migration to complete (~5 min)
3. Deploy order-service v2.4 to staging
4. Run smoke tests (see Postman collection in Teams)
5. If smoke tests pass, deploy to production (rolling update)
6. Check new payment-status endpoint returns 200
7. Monitor Grafana dashboard "Order Service" for 15 min
8. If errors > 5%, rollback to v2.3 (kubectl rollout undo)
9. Enable feature flag `payment-status-v2` in LaunchDarkly
10. Update Slack channel #deployments
11. Close JIRA ticket ORD-1847
Last updated: 2025-11-03 (by: unknown)## Order Service v2.4 Deployment
1. Run DB migration script `047_add_payment_status.sql`
2. Wait for migration to complete (~5 min)
3. Deploy order-service v2.4 to staging
4. Run smoke tests (see Postman collection in Teams)
5. If smoke tests pass, deploy to production (rolling update)
6. Check new payment-status endpoint returns 200
7. Monitor Grafana dashboard "Order Service" for 15 min
8. If errors > 5%, rollback to v2.3 (kubectl rollout undo)
9. Enable feature flag `payment-status-v2` in LaunchDarkly
10. Update Slack channel #deployments
11. Close JIRA ticket ORD-1847
Last updated: 2025-11-03 (by: unknown)Count the problems:
Step 1: The migration script is referenced by filename. Has it been renamed? Does it still exist in the migrations folder? Does it match the schema version the code expects? Nobody knows at compile time.
Step 4: "See Postman collection in Teams." Where in Teams? Which channel? Which collection version? The Postman collection was last exported 3 months ago and is missing the payment-status endpoint.
Step 6: The endpoint /payment-status was renamed to /payments/status in the v2.4 codebase. This step will fail. Nobody will discover this until deployment day.
Step 7: The Grafana dashboard was renamed from "Order Service" to "Order Service v2" two weeks ago. The wiki was not updated.
Step 8: kubectl rollout undo will roll back to the previous deployment, which may or may not be v2.3. If another hotfix was deployed in between, the rollback target is wrong.
Step 9: The feature flag name in LaunchDarkly is payment_status_v2 (underscores), not payment-status-v2 (hyphens). The wiki has the wrong name.
Step 11: The JIRA ticket was moved to a different project during a reorg. The key is now PAYMENTS-1847.
Every step is a string. Every string can drift. Every drift is a production incident waiting to happen. And the wiki page says "Last updated: 2025-11-03 (by: unknown)" -- which means nobody has reviewed it in 5 months, and the original author is untraceable.
This is not a strawman. This is the median deployment specification in the industry.
The Taxonomy of Untyped Operational Knowledge
The wiki deployment checklist is just one species. Here is the complete taxonomy of operational knowledge that lives outside the type system:
Deployment & Release
| Concern | Where It Lives Today | What Goes Wrong |
|---|---|---|
| Deployment ordering | Wiki page, mental model | Wrong order → cascading failure |
| Migration sequencing | Filename conventions (001_, 002_) | Gaps, duplicates, parallel conflicts |
| Rollback conditions | Slack messages, verbal agreement | Wrong threshold, wrong target version |
| Canary metrics | Helm values, dashboard JSON | Metric renamed, threshold stale |
| Feature flag rules | LaunchDarkly UI, spreadsheet | Flag name mismatch, stale rules |
Observability & Response
| Concern | Where It Lives Today | What Goes Wrong |
|---|---|---|
| Health checks | Startup.cs, hardcoded paths | Endpoint removed, path changed |
| Alerting rules | Prometheus YAML, Grafana UI | Alert references deleted metric |
| SLA definitions | Contract PDF, spreadsheet | SLO drifts from implementation |
| Incident runbooks | Confluence, PagerDuty notes | Steps outdated, wrong escalation |
| Escalation policies | PagerDuty UI, team wiki | Rotation changed, person left |
Infrastructure & Security
| Concern | Where It Lives Today | What Goes Wrong |
|---|---|---|
| Configuration transforms | Helm templates, env files | Missing key in production, wrong value |
| Secret rotation | Manual process, ticket reminder | Expired secret → outage |
| Network policies | Kubernetes YAML, Terraform | Policy blocks legitimate traffic |
| Certificate management | Calendar reminder, monitoring | Cert expires → TLS failure |
| Backup schedules | Cron jobs, cloud console | Backup misconfigured, never tested |
Performance & Resilience
| Concern | Where It Lives Today | What Goes Wrong |
|---|---|---|
| Performance budgets | README, PR review convention | Budget violated silently |
| Chaos experiments | Gremlin UI, ad-hoc scripts | Experiment scope unknown, no hypothesis |
| Scaling rules | Cloud console, HPA YAML | Wrong threshold, cost explosion |
| Circuit breaker configs | appsettings.json, hardcoded | Threshold too aggressive or too lenient |
| Cost budgets | Spreadsheet, monthly review | Overspend discovered after the fact |
That is 18 categories of operational knowledge. Every one of them has the same failure mode: it is a string that drifts from the code it describes.
Why Terraform Does Not Solve This
The immediate objection: "We have Terraform / Pulumi / Helm / CDK. Our infrastructure IS typed."
Terraform solves infrastructure provisioning. It tells the cloud provider: create a VM with 4 CPUs, attach a 100GB disk, put it in subnet X. That is valuable. That is not what we are talking about.
Terraform does not express:
- "When error rate exceeds 5% for 2 minutes, page the on-call engineer" -- that is an alerting rule, not a resource
- "Migration 47 must complete before the application starts" -- that is deployment ordering, not provisioning
- "The canary deployment should measure p99 latency against the baseline" -- that is a canary metric, not infrastructure
- "This service's p95 response time must stay below 200ms" -- that is a performance budget, not a VM size
- "If the circuit breaker opens 3 times in 10 minutes, trigger the rollback" -- that is a resilience policy, not a network rule
- "The backup must be tested quarterly by restoring to a staging environment" -- that is a compliance requirement, not a storage bucket
Terraform/Pulumi type the nouns of infrastructure (VMs, databases, networks). They do not type the verbs of operations (deploy, migrate, alert, rollback, scale, rotate, audit).
Helm templates YAML but does not validate that your canary references a real metric. Pulumi is typed infrastructure-as-code, but the types describe cloud resources, not operational policies. CDK is CloudFormation with better syntax, not operational knowledge with compiler enforcement.
The gap is not "untyped infrastructure." Cloud teams solved that years ago. The gap is untyped operational knowledge -- the rules, policies, thresholds, sequences, and conditions that govern how typed infrastructure behaves in production.
The Typed Requirement Chain
The dev-side DSLs established a pattern:
Policy → Declaration → Generated Artifact → Validation
───────────────────────────────────────────────────────────────
Requirement [Feature] RequirementConstants FeatureComplianceValidator
DDD [AggregateRoot] Repository, Events BoundedContextAnalyzer
API [ApiEndpoint] OpenAPI, Client EndpointCoverageAnalyzer
Test [FeatureTest] TestScaffold TestCoverageAnalyzerPolicy → Declaration → Generated Artifact → Validation
───────────────────────────────────────────────────────────────
Requirement [Feature] RequirementConstants FeatureComplianceValidator
DDD [AggregateRoot] Repository, Events BoundedContextAnalyzer
API [ApiEndpoint] OpenAPI, Client EndpointCoverageAnalyzer
Test [FeatureTest] TestScaffold TestCoverageAnalyzerAt every level: a human declares policy via an attribute, the source generator produces the artifact, and an analyzer validates the constraints. The chain is typed end-to-end. Violations are compile errors, not production incidents.
The same chain should work for ops:
Policy → Declaration → Generated Artifact → Validation
───────────────────────────────────────────────────────────────
Deployment [DeployOrchestrator] DAG, Helm values CircularDependencyAnalyzer
Migration [MigrationStep] Ordered scripts GapDetectionAnalyzer
Observability [HealthCheck] Probe registration StaleMetricAnalyzer
Alerting [AlertRule] Prometheus YAML ThresholdRangeAnalyzer
Resilience [CircuitBreaker] Polly policy FallbackCoverageAnalyzer
Chaos [ChaosExperiment] Decorator/Litmus HypothesisRequiredAnalyzer
Performance [PerformanceBudget] k6 script BaselineRegressionAnalyzer
Security [SecurityPolicy] Headers middleware PolicyCoverageAnalyzer
Cost [CostBudget] Alert config BudgetThresholdAnalyzerPolicy → Declaration → Generated Artifact → Validation
───────────────────────────────────────────────────────────────
Deployment [DeployOrchestrator] DAG, Helm values CircularDependencyAnalyzer
Migration [MigrationStep] Ordered scripts GapDetectionAnalyzer
Observability [HealthCheck] Probe registration StaleMetricAnalyzer
Alerting [AlertRule] Prometheus YAML ThresholdRangeAnalyzer
Resilience [CircuitBreaker] Polly policy FallbackCoverageAnalyzer
Chaos [ChaosExperiment] Decorator/Litmus HypothesisRequiredAnalyzer
Performance [PerformanceBudget] k6 script BaselineRegressionAnalyzer
Security [SecurityPolicy] Headers middleware PolicyCoverageAnalyzer
Cost [CostBudget] Alert config BudgetThresholdAnalyzerEvery row follows the same meta-pattern. Every row produces artifacts that are guaranteed correct by construction. Every row has an analyzer that catches violations before the code leaves the developer's machine.
The Cost of Not Typing
Consider what happens in the untyped world when a developer renames a Grafana dashboard:
- Developer renames dashboard in Grafana UI
- Wiki deployment checklist still references old name
- Six weeks later, a deployment follows the checklist
- Step 7 says "Monitor Grafana dashboard 'Order Service' for 15 min"
- The dashboard is now called "Order Service v2"
- The deployer searches, finds nothing, skips the step
- The deployment has a latency regression
- Nobody notices for 45 minutes because monitoring was skipped
- Postmortem: "We should update the wiki"
- The wiki gets updated. The cycle restarts.
Now consider the typed world:
[GrafanaDashboard("order-service",
Title = "Order Service",
DataSource = "prometheus",
Panels = [typeof(OrderLatencyPanel), typeof(OrderErrorRatePanel)])]
public sealed class OrderServiceDashboard { }
[DeploymentOrchestrator("2.4")]
[MonitorAfterDeploy(typeof(OrderServiceDashboard), DurationMinutes = 15)]
public sealed class OrderServiceV24Deployment { }[GrafanaDashboard("order-service",
Title = "Order Service",
DataSource = "prometheus",
Panels = [typeof(OrderLatencyPanel), typeof(OrderErrorRatePanel)])]
public sealed class OrderServiceDashboard { }
[DeploymentOrchestrator("2.4")]
[MonitorAfterDeploy(typeof(OrderServiceDashboard), DurationMinutes = 15)]
public sealed class OrderServiceV24Deployment { }The [MonitorAfterDeploy] attribute takes a typeof(). If the dashboard class is renamed, the compiler emits CS0246. If the dashboard is deleted, the compiler emits CS0246. The deployment cannot reference a dashboard that does not exist. The wiki cannot drift because there is no wiki -- the deployment specification is the code.
The Question
The dev-side DSLs proved that C# attributes, source generators, and Roslyn analyzers can type an entire domain -- from requirements to tests. The pattern is known. The infrastructure is known. The meta-metamodel is known.
So why stop at dotnet test?
What if every operational concern was a DSL? What if deployment ordering, migration sequencing, health checks, alerting rules, chaos experiments, performance budgets, compliance matrices, cost budgets, secret rotation, certificate management, backup schedules, scaling rules, SLA definitions, network policies, incident runbooks, feature flags, and data pipeline quality gates were all typed, validated, and generated?
What if dotnet build produced not just code, but a complete operational specification -- typed, validated, cross-referenced, and impossible to drift?
That is the Ops DSL Ecosystem. Twenty-two sub-DSLs. Three execution tiers. One meta-pattern. Every artifact generated. No hand-written YAML. No hand-written Terraform. No hand-written Docker Compose. No hand-written bash scripts. You declare C# attributes; dotnet build produces everything.
The rest of this series builds it.