The Problem -- Why Ops Is Still Untyped

"Your code is typed from requirement to test. Then it enters production, and everything becomes a string."

The Lifecycle Cliff

Follow a feature through the CMF ecosystem:

Requirement    [Feature("OrderCancellation")]                → typed, tracked
     ↓
Domain Model   [AggregateRoot(BoundedContext = "Ordering")]  → typed, validated
     ↓
API            [ApiEndpoint(Method.POST, "/orders/{id}/cancel")] → typed, documented
     ↓
Test           [FeatureTest(typeof(OrderCancellation))]      → typed, linked
     ↓
Deployment     ???
     ↓
Migration      ???
     ↓
Health Check   ???
     ↓
Alerting       ???
     ↓
Rollback       ???

The first four steps are compiler-checked. Roslyn analyzers verify that every requirement has an implementation. Source generators produce test scaffolding, API clients, and traceability matrices. The type system guarantees that if the code compiles, the chain is complete.

Then the feature ships. And the lifecycle falls off a cliff.

The Anti-Pattern: A Real Deployment Specification

Here is how a deployment is "specified" in most organizations. This is a wiki page, written by a developer who has since left the team:

## Order Service v2.4 Deployment

1. Run DB migration script `047_add_payment_status.sql`
2. Wait for migration to complete (~5 min)
3. Deploy order-service v2.4 to staging
4. Run smoke tests (see Postman collection in Teams)
5. If smoke tests pass, deploy to production (rolling update)
6. Check new payment-status endpoint returns 200
7. Monitor Grafana dashboard "Order Service" for 15 min
8. If errors > 5%, rollback to v2.3 (kubectl rollout undo)
9. Enable feature flag `payment-status-v2` in LaunchDarkly
10. Update Slack channel #deployments
11. Close JIRA ticket ORD-1847

Last updated: 2025-11-03 (by: unknown)

Count the problems:

Step 1: The migration script is referenced by filename. Has it been renamed? Does it still exist in the migrations folder? Does it match the schema version the code expects? Nobody knows at compile time.

Step 4: "See Postman collection in Teams." Where in Teams? Which channel? Which collection version? The Postman collection was last exported 3 months ago and is missing the payment-status endpoint.

Step 6: The endpoint /payment-status was renamed to /payments/status in the v2.4 codebase. This step will fail. Nobody will discover this until deployment day.

Step 7: The Grafana dashboard was renamed from "Order Service" to "Order Service v2" two weeks ago. The wiki was not updated.

Step 8: kubectl rollout undo will roll back to the previous deployment, which may or may not be v2.3. If another hotfix was deployed in between, the rollback target is wrong.

Step 9: The feature flag name in LaunchDarkly is payment_status_v2 (underscores), not payment-status-v2 (hyphens). The wiki has the wrong name.

Step 11: The JIRA ticket was moved to a different project during a reorg. The key is now PAYMENTS-1847.

Every step is a string. Every string can drift. Every drift is a production incident waiting to happen. And the wiki page says "Last updated: 2025-11-03 (by: unknown)" -- which means nobody has reviewed it in 5 months, and the original author is untraceable.

This is not a strawman. This is the median deployment specification in the industry.

The Taxonomy of Untyped Operational Knowledge

The wiki deployment checklist is just one species. Here is the complete taxonomy of operational knowledge that lives outside the type system:

Deployment & Release

Concern	Where It Lives Today	What Goes Wrong
Deployment ordering	Wiki page, mental model	Wrong order → cascading failure
Migration sequencing	Filename conventions (001_, 002_)	Gaps, duplicates, parallel conflicts
Rollback conditions	Slack messages, verbal agreement	Wrong threshold, wrong target version
Canary metrics	Helm values, dashboard JSON	Metric renamed, threshold stale
Feature flag rules	LaunchDarkly UI, spreadsheet	Flag name mismatch, stale rules

Observability & Response

Concern	Where It Lives Today	What Goes Wrong
Health checks	Startup.cs, hardcoded paths	Endpoint removed, path changed
Alerting rules	Prometheus YAML, Grafana UI	Alert references deleted metric
SLA definitions	Contract PDF, spreadsheet	SLO drifts from implementation
Incident runbooks	Confluence, PagerDuty notes	Steps outdated, wrong escalation
Escalation policies	PagerDuty UI, team wiki	Rotation changed, person left

Infrastructure & Security

Concern	Where It Lives Today	What Goes Wrong
Configuration transforms	Helm templates, env files	Missing key in production, wrong value
Secret rotation	Manual process, ticket reminder	Expired secret → outage
Network policies	Kubernetes YAML, Terraform	Policy blocks legitimate traffic
Certificate management	Calendar reminder, monitoring	Cert expires → TLS failure
Backup schedules	Cron jobs, cloud console	Backup misconfigured, never tested

Performance & Resilience

Concern	Where It Lives Today	What Goes Wrong
Performance budgets	README, PR review convention	Budget violated silently
Chaos experiments	Gremlin UI, ad-hoc scripts	Experiment scope unknown, no hypothesis
Scaling rules	Cloud console, HPA YAML	Wrong threshold, cost explosion
Circuit breaker configs	appsettings.json, hardcoded	Threshold too aggressive or too lenient
Cost budgets	Spreadsheet, monthly review	Overspend discovered after the fact

That is 18 categories of operational knowledge. Every one of them has the same failure mode: it is a string that drifts from the code it describes.

Why Terraform Does Not Solve This

The immediate objection: "We have Terraform / Pulumi / Helm / CDK. Our infrastructure IS typed."

Terraform solves infrastructure provisioning. It tells the cloud provider: create a VM with 4 CPUs, attach a 100GB disk, put it in subnet X. That is valuable. That is not what we are talking about.

Terraform does not express:

"When error rate exceeds 5% for 2 minutes, page the on-call engineer" -- that is an alerting rule, not a resource
"Migration 47 must complete before the application starts" -- that is deployment ordering, not provisioning
"The canary deployment should measure p99 latency against the baseline" -- that is a canary metric, not infrastructure
"This service's p95 response time must stay below 200ms" -- that is a performance budget, not a VM size
"If the circuit breaker opens 3 times in 10 minutes, trigger the rollback" -- that is a resilience policy, not a network rule
"The backup must be tested quarterly by restoring to a staging environment" -- that is a compliance requirement, not a storage bucket

Terraform/Pulumi type the nouns of infrastructure (VMs, databases, networks). They do not type the verbs of operations (deploy, migrate, alert, rollback, scale, rotate, audit).

Helm templates YAML but does not validate that your canary references a real metric. Pulumi is typed infrastructure-as-code, but the types describe cloud resources, not operational policies. CDK is CloudFormation with better syntax, not operational knowledge with compiler enforcement.

The gap is not "untyped infrastructure." Cloud teams solved that years ago. The gap is untyped operational knowledge -- the rules, policies, thresholds, sequences, and conditions that govern how typed infrastructure behaves in production.

The Typed Requirement Chain

The dev-side DSLs established a pattern:

Policy     → Declaration    → Generated Artifact   → Validation
───────────────────────────────────────────────────────────────
Requirement  [Feature]        RequirementConstants   FeatureComplianceValidator
DDD          [AggregateRoot]  Repository, Events     BoundedContextAnalyzer
API          [ApiEndpoint]    OpenAPI, Client        EndpointCoverageAnalyzer
Test         [FeatureTest]    TestScaffold           TestCoverageAnalyzer

At every level: a human declares policy via an attribute, the source generator produces the artifact, and an analyzer validates the constraints. The chain is typed end-to-end. Violations are compile errors, not production incidents.

The same chain should work for ops:

Policy     → Declaration    → Generated Artifact   → Validation
───────────────────────────────────────────────────────────────
Deployment   [DeployOrchestrator]  DAG, Helm values    CircularDependencyAnalyzer
Migration    [MigrationStep]       Ordered scripts      GapDetectionAnalyzer
Observability [HealthCheck]        Probe registration   StaleMetricAnalyzer
Alerting     [AlertRule]           Prometheus YAML      ThresholdRangeAnalyzer
Resilience   [CircuitBreaker]      Polly policy         FallbackCoverageAnalyzer
Chaos        [ChaosExperiment]     Decorator/Litmus     HypothesisRequiredAnalyzer
Performance  [PerformanceBudget]   k6 script            BaselineRegressionAnalyzer
Security     [SecurityPolicy]      Headers middleware    PolicyCoverageAnalyzer
Cost         [CostBudget]          Alert config         BudgetThresholdAnalyzer

Every row follows the same meta-pattern. Every row produces artifacts that are guaranteed correct by construction. Every row has an analyzer that catches violations before the code leaves the developer's machine.

The Cost of Not Typing

Consider what happens in the untyped world when a developer renames a Grafana dashboard:

Developer renames dashboard in Grafana UI
Wiki deployment checklist still references old name
Six weeks later, a deployment follows the checklist
Step 7 says "Monitor Grafana dashboard 'Order Service' for 15 min"
The dashboard is now called "Order Service v2"
The deployer searches, finds nothing, skips the step
The deployment has a latency regression
Nobody notices for 45 minutes because monitoring was skipped
Postmortem: "We should update the wiki"
The wiki gets updated. The cycle restarts.

Now consider the typed world:

[GrafanaDashboard("order-service",
    Title = "Order Service",
    DataSource = "prometheus",
    Panels = [typeof(OrderLatencyPanel), typeof(OrderErrorRatePanel)])]
public sealed class OrderServiceDashboard { }

[DeploymentOrchestrator("2.4")]
[MonitorAfterDeploy(typeof(OrderServiceDashboard), DurationMinutes = 15)]
public sealed class OrderServiceV24Deployment { }

The [MonitorAfterDeploy] attribute takes a typeof(). If the dashboard class is renamed, the compiler emits CS0246. If the dashboard is deleted, the compiler emits CS0246. The deployment cannot reference a dashboard that does not exist. The wiki cannot drift because there is no wiki -- the deployment specification is the code.

The Question

The dev-side DSLs proved that C# attributes, source generators, and Roslyn analyzers can type an entire domain -- from requirements to tests. The pattern is known. The infrastructure is known. The meta-metamodel is known.

So why stop at dotnet test?

What if every operational concern was a DSL? What if deployment ordering, migration sequencing, health checks, alerting rules, chaos experiments, performance budgets, compliance matrices, cost budgets, secret rotation, certificate management, backup schedules, scaling rules, SLA definitions, network policies, incident runbooks, feature flags, and data pipeline quality gates were all typed, validated, and generated?

What if dotnet build produced not just code, but a complete operational specification -- typed, validated, cross-referenced, and impossible to drift?

That is the Ops DSL Ecosystem. Twenty-two sub-DSLs. Three execution tiers. One meta-pattern. Every artifact generated. No hand-written YAML. No hand-written Terraform. No hand-written Docker Compose. No hand-written bash scripts. You declare C# attributes; dotnet build produces everything.

The rest of this series builds it.

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

The Problem -- Why Ops Is Still Untyped📋

The Lifecycle Cliff📋

The Anti-Pattern: A Real Deployment Specification📋

The Taxonomy of Untyped Operational Knowledge📋

Deployment & Release📋

Observability & Response📋

Infrastructure & Security📋

Performance & Resilience📋

Why Terraform Does Not Solve This📋

The Typed Requirement Chain📋

The Cost of Not Typing📋

The Question📋