Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Shared Primitives -- The Ops Kernel

"Eight concepts. Every Ops DSL composes them. No sub-DSL invents its own threshold, its own severity, its own environment model."


Why a Shared Kernel

The 22 sub-DSLs must share a common vocabulary. Without it, each DSL invents its own:

  • Chaos has ChaosSeverity.High, Observability has AlertSeverity.Critical, Incident has IncidentPriority.P1 -- three names for the same concept
  • Performance uses EnvironmentName = "prod", Deployment uses TargetEnv.Production, Configuration uses Tier = "production" -- three representations of the same environment
  • SLA defines thresholds as (metric, operator, value), Observability defines them as (name, condition, threshold), Cost defines them as (budget, limit, alert) -- three shapes for the same pattern

The shared kernel normalizes these into 8 primitives that every DSL reuses. A Warning in Chaos is the same OpsSeverity.Warning as in Observability. A Production in Deployment is the same EnvironmentTier.Production as in Configuration. A threshold is always (metric, condition, value, severity).


Primitive 1: OpsTarget -- What You Are Operating

Every operational concern applies to something. That something is an OpsTarget.

[OpsProbe("order-db-health", Target = OpsTarget.Database)]
[OpsProbe("order-api-health", Target = OpsTarget.Application)]
[OpsProbe("order-cache-health", Target = OpsTarget.Cache)]
public sealed class OrderServiceHealthChecks { }

The target determines what kind of probe, threshold, and artifact makes sense:

Target Valid Probe Kinds Generated Artifact
Application Http, Grpc Kubernetes readiness/liveness probe
Database Sql, Tcp Connection pool check, migration status
Queue Tcp, Http Queue depth metric, consumer lag alert
Cache Tcp, Command Hit rate metric, eviction alert
Gateway Http, Tcp Upstream health aggregation
Storage Http, Command Bucket accessibility, quota check
Network Tcp Connectivity probe, latency measurement
Certificate Command Expiry check, chain validation
Dns Command Resolution check, TTL validation
Cdn Http Cache hit rate, purge verification

The analyzer validates target-probe compatibility:

// OPS020: Sql probe on Application target
[OpsProbe("bad-probe", Target = OpsTarget.Application, Kind = ProbeKind.Sql)]
//                                                            ^^^^^^^^^^^^
// Error: ProbeKind.Sql is only valid for OpsTarget.Database

Target Hierarchies

Sub-DSLs extend targets with domain-specific detail. The Deployment DSL adds application-level properties:

[DeploymentOrchestrator("2.4")]
[DeploymentApp("order-service",
    Target = OpsTarget.Application,
    Runtime = "dotnet",
    Replicas = 3)]
[DeploymentApp("order-db",
    Target = OpsTarget.Database,
    Engine = "postgres",
    Version = "16")]
public sealed class OrderServiceV24Deployment { }

The Chaos DSL uses targets to determine fault injection scope:

[ChaosExperiment("CacheFailure", Tier = OpsExecutionTier.InProcess)]
[TargetService(typeof(ICacheService), Target = OpsTarget.Cache)]
[FaultInjection(FaultKind.Exception,
    ExceptionType = typeof(RedisConnectionException))]
public sealed class CacheFailureExperiment { }

The OpsTarget flows through the generated artifacts. A health check with Target = OpsTarget.Database generates a Kubernetes liveness probe that checks the database connection. A health check with Target = OpsTarget.Application generates a readiness probe that calls the HTTP health endpoint.


Primitive 2: OpsProbe -- Checking the Target

A probe is the atomic unit of operational observation. Every monitoring system, every health check, every readiness gate reduces to: "call something, check the result, report the status."

Full Usage Example

[OpsProbe("order-db-connectivity",
    Target = OpsTarget.Database,
    Kind = ProbeKind.Sql,
    IntervalSeconds = 15,
    TimeoutSeconds = 3,
    FailureThreshold = 3,
    Endpoint = "SELECT 1")]
[OpsProbe("order-api-ready",
    Target = OpsTarget.Application,
    Kind = ProbeKind.Http,
    IntervalSeconds = 10,
    TimeoutSeconds = 5,
    FailureThreshold = 2,
    Endpoint = "/health/ready")]
[OpsProbe("order-cache-alive",
    Target = OpsTarget.Cache,
    Kind = ProbeKind.Tcp,
    IntervalSeconds = 30,
    TimeoutSeconds = 2,
    FailureThreshold = 5)]
public sealed class OrderServiceProbes { }

Generated: Health Check Registration

// <auto-generated by Ops.Observability.Generators />
namespace Ops.Observability.Generated;

public static class OrderServiceProbesRegistration
{
    public static IHealthChecksBuilder AddOrderServiceProbes(
        this IHealthChecksBuilder builder)
    {
        builder.AddCheck("order-db-connectivity",
            new SqlHealthCheck(
                query: "SELECT 1",
                timeout: TimeSpan.FromSeconds(3)),
            failureStatus: HealthStatus.Unhealthy,
            tags: ["db", "readiness"]);

        builder.AddCheck("order-api-ready",
            new HttpHealthCheck(
                endpoint: "/health/ready",
                timeout: TimeSpan.FromSeconds(5)),
            failureStatus: HealthStatus.Unhealthy,
            tags: ["api", "readiness"]);

        builder.AddCheck("order-cache-alive",
            new TcpHealthCheck(
                timeout: TimeSpan.FromSeconds(2)),
            failureStatus: HealthStatus.Degraded,
            tags: ["cache", "liveness"]);

        return builder;
    }
}

Generated: Kubernetes Probes

# <auto-generated by Ops.Observability.Generators />
# Probes for: OrderServiceProbes
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: order-service
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 2
      livenessProbe:
        tcpSocket:
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 30
        timeoutSeconds: 2
        failureThreshold: 5

One set of attributes. Two generated artifacts (C# health check registration + Kubernetes YAML). Both always in sync because they are generated from the same source.


Primitive 3: OpsThreshold -- Severity Escalation

Thresholds define when a metric value becomes a problem. The severity determines the response.

The Escalation Pattern

[OpsThreshold("order.error.rate",
    ThresholdCondition.GreaterThan, 0.01,
    Severity = OpsSeverity.Info,
    Unit = "ratio",
    Description = "Error rate above 1% — log for investigation")]
[OpsThreshold("order.error.rate",
    ThresholdCondition.GreaterThan, 0.05,
    Severity = OpsSeverity.Warning,
    Description = "Error rate above 5% — Slack notification")]
[OpsThreshold("order.error.rate",
    ThresholdCondition.GreaterThan, 0.10,
    Severity = OpsSeverity.Critical,
    Description = "Error rate above 10% — PagerDuty alert")]
[OpsThreshold("order.error.rate",
    ThresholdCondition.GreaterThan, 0.25,
    Severity = OpsSeverity.PageNow,
    Description = "Error rate above 25% — immediate page, likely outage")]
public sealed class OrderServiceThresholds { }

Four thresholds on the same metric, escalating from Info to PageNow. The Source Generator produces a Prometheus alerting rule with severity labels:

# <auto-generated by Ops.Observability.Generators />
groups:
  - name: OrderServiceThresholds
    rules:
      - alert: OrderErrorRateInfo
        expr: rate(order_errors_total[5m]) / rate(order_requests_total[5m]) > 0.01
        labels:
          severity: info
        annotations:
          summary: "Error rate above 1% — log for investigation"

      - alert: OrderErrorRateWarning
        expr: rate(order_errors_total[5m]) / rate(order_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Error rate above 5% — Slack notification"

      - alert: OrderErrorRateCritical
        expr: rate(order_errors_total[5m]) / rate(order_requests_total[5m]) > 0.10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 10% — PagerDuty alert"

      - alert: OrderErrorRatePageNow
        expr: rate(order_errors_total[5m]) / rate(order_requests_total[5m]) > 0.25
        labels:
          severity: pagenow
        annotations:
          summary: "Error rate above 25% — immediate page, likely outage"

Cross-DSL Usage

The same OpsThreshold primitive is used by multiple DSLs with different semantics:

DSL Metric Meaning
Observability order.error.rate Alert when error rate exceeds boundary
Resilience order.canary.error.rate Rollback canary when metric exceeds threshold
Performance order.p95.latency Fail build when latency regresses
Cost order.monthly.cost Alert when cloud spend exceeds budget
SLA order.availability Burn error budget when availability drops
Chaos order.completion.rate Validate steady-state hypothesis during experiment

Same attribute shape. Same severity enum. Same condition operators. Different DSL, different generated artifact.


Primitive 4: OpsPolicy -- Governance Enforcement

Policies are rules about rules. They govern how the other primitives may be used.

Enforcement Modes

// CompileTime: Roslyn analyzer emits error. Build fails.
[OpsPolicy("AllHealthChecksRequired",
    Enforcement = PolicyEnforcement.CompileTime,
    Description = "Every deployment must declare at least one health check",
    AppliesTo = ["DeploymentOrchestrator"])]

// RuntimeWarning: ILogger.Warning at startup. Build succeeds.
[OpsPolicy("RecommendChaosTests",
    Enforcement = PolicyEnforcement.RuntimeWarning,
    Description = "Services with > 3 external dependencies should have chaos tests",
    Rationale = "High dependency count increases failure surface")]

// RuntimeBlock: Throws at startup. Build succeeds, runtime fails.
[OpsPolicy("RequireEncryptionAtRest",
    Enforcement = PolicyEnforcement.RuntimeBlock,
    Description = "All database connections must use TLS in production",
    AppliesTo = ["Database"])]
public sealed class OrderServicePolicies { }

CompileTime enforcement is the strongest: the analyzer scans the compilation for policy violations and emits diagnostic errors. The project does not build until the policy is satisfied. This is appropriate for rules that must never be violated (every deployment has a health check, every canary has a metric).

RuntimeWarning enforcement is for recommendations: the generated IHostedService checks the policy at startup and logs a warning. This is appropriate for best practices that teams should follow but are not hard requirements (chaos tests for high-dependency services).

RuntimeBlock enforcement is for environment-specific rules: the generated startup check throws an InvalidOperationException if the policy is violated in the target environment. This is appropriate for security rules that must be enforced in production but can be relaxed in development (TLS requirement, secret rotation).

Generated Policy Validator

// <auto-generated by Ops.Primitives.Generators />
namespace Ops.Primitives.Generated;

public sealed class OpsPolicyValidator : IHostedService
{
    private readonly ILogger<OpsPolicyValidator> _logger;
    private readonly IHostEnvironment _env;

    public OpsPolicyValidator(
        ILogger<OpsPolicyValidator> logger,
        IHostEnvironment env)
    {
        _logger = logger;
        _env = env;
    }

    public Task StartAsync(CancellationToken ct)
    {
        // RuntimeWarning: RecommendChaosTests
        if (ExternalDependencyCount > 3 && !HasChaosTests)
        {
            _logger.LogWarning(
                "Policy 'RecommendChaosTests': Service has {Count} external " +
                "dependencies but no chaos tests declared",
                ExternalDependencyCount);
        }

        // RuntimeBlock: RequireEncryptionAtRest (production only)
        if (_env.IsProduction() && !AllDatabaseConnectionsUseTls)
        {
            throw new InvalidOperationException(
                "Policy 'RequireEncryptionAtRest' violated: " +
                "Database connection 'OrderDb' does not use TLS. " +
                "All database connections must use TLS in production.");
        }

        return Task.CompletedTask;
    }

    public Task StopAsync(CancellationToken ct) => Task.CompletedTask;
}

Primitive 5: OpsEnvironment -- Scoping to Tiers

Not every declaration applies to every environment. A chaos experiment that kills database connections should not run in production (unless explicitly configured). A backup schedule that runs every hour is overkill for development.

[ChaosExperiment("DbPartition", Tier = OpsExecutionTier.Container)]
[OpsEnvironment(EnvironmentTier.Testing)]
[OpsEnvironment(EnvironmentTier.Staging)]
// Intentionally NOT Production — this experiment is too destructive
public sealed class DbPartitionExperiment { }

[BackupSchedule("order-db-backup",
    CronExpression = "0 */6 * * *")]
[OpsEnvironment(EnvironmentTier.Production)]
[OpsEnvironment(EnvironmentTier.DisasterRecovery)]
// No backup in dev/test — waste of resources
public sealed class OrderDbBackup { }

The Source Generator filters declarations by environment when generating artifacts:

# <auto-generated by Ops.Storage.Generators />
# Environment: Production
# Backup: order-db-backup — every 6 hours
apiVersion: batch/v1
kind: CronJob
metadata:
  name: order-db-backup
spec:
  schedule: "0 */6 * * *"
  # ...

The analyzer validates environment constraints:

OPS022: ChaosExperiment 'DbPartition' has no OpsEnvironment declaration.
        Add [OpsEnvironment] to scope this experiment to specific environments,
        or add [OpsEnvironment(EnvironmentTier.Production)] to confirm
        it should run everywhere.

Primitive 6: OpsSchedule -- Cron-Based Operations

Many operational concerns are time-based: backup schedules, secret rotation, compliance audits, cost reports, certificate renewal checks.

[OpsSchedule("0 2 * * *",
    Timezone = "Europe/Paris",
    Description = "Nightly backup at 2 AM Paris time")]
[OpsEnvironment(EnvironmentTier.Production)]
public sealed class OrderDbNightlyBackup { }

[OpsSchedule("0 0 1 * *",
    Timezone = "UTC",
    Description = "Monthly cost report on the 1st")]
public sealed class OrderServiceCostReport { }

[OpsSchedule("0 8 * * 1",
    Timezone = "UTC",
    Description = "Weekly certificate expiry check on Monday 8 AM")]
public sealed class CertificateExpiryCheck { }

The analyzer validates cron expressions at compile time:

OPS024: Invalid cron expression '0 25 * * *' on CertificateExpiryCheck.
        Hour field '25' is out of range (0-23).

Primitive 7: OpsExecutionTier -- Tier Constraint Enforcement

Covered in depth in Part 2. Here we focus on the constraint enforcement implementation.

The analyzer maintains a classification of every Ops attribute by its minimum required tier:

internal static class TierClassification
{
    // Attributes that are valid at InProcess (Tier 0+)
    private static readonly HashSet<string> InProcessValid =
    [
        "ChaosExperimentAttribute",     // with TargetService
        "FaultInjectionAttribute",
        "SteadyStateHypothesisAttribute",
        "PerformanceBudgetAttribute",
        "CircuitBreakerAttribute",
        "RetryPolicyAttribute",
        "FallbackAttribute",
        "RateLimitAttribute",
    ];

    // Attributes that require Container (Tier 1+)
    private static readonly HashSet<string> ContainerRequired =
    [
        "ContainerAttribute",
        "ToxiProxyAttribute",
        "NetworkFaultAttribute",
        "DockerVolumeAttribute",
    ];

    // Attributes that require Cloud (Tier 2)
    private static readonly HashSet<string> CloudRequired =
    [
        "CloudProviderAttribute",
        "CloudRegionAttribute",
        "AzFailureAttribute",
        "DistributedFromAttribute",
        "TerraformResourceAttribute",
    ];
}

When a class declares Tier = OpsExecutionTier.InProcess but uses a ContainerRequired attribute, the analyzer emits OPS014. The constraint matrix is clear:

Declared Tier InProcess attrs Container attrs Cloud attrs
InProcess allowed OPS014 error OPS014 error
Container allowed allowed OPS016 error
Cloud allowed allowed allowed

The Requirements DSL declares features and acceptance criteria. The Ops DSLs implement operational aspects of those requirements. The OpsRequirementLink bridges them.

// In Requirements DSL:
[Feature("OrderCancellation",
    Description = "Customer can cancel an order within 30 minutes")]
[AcceptanceCriterion("Cancellation triggers full refund")]
[AcceptanceCriterion("Cancelled order status updated within 5 seconds")]
public sealed class OrderCancellationFeature { }

// In Ops DSLs:
[PerformanceBudget("order.cancel.latency",
    P95Ms = 5000,
    Description = "Cancellation must complete within 5 seconds")]
[OpsRequirementLink(
    typeof(OrderCancellationFeature),
    AcceptanceCriterion = nameof(
        OrderCancellationFeature.CancelledOrderStatusUpdatedWithin5Seconds),
    Rationale = "AC requires status update within 5 seconds; " +
                "performance budget enforces this at the ops level")]
public sealed class CancellationLatencyBudget { }

[ChaosExperiment("RefundGatewayTimeout",
    Tier = OpsExecutionTier.InProcess)]
[TargetService(typeof(IRefundGateway))]
[FaultInjection(FaultKind.Timeout, Probability = 0.3)]
[OpsRequirementLink(
    typeof(OrderCancellationFeature),
    AcceptanceCriterion = nameof(
        OrderCancellationFeature.CancellationTriggersFullRefund),
    Rationale = "Verify refund completes even when payment gateway is slow")]
public sealed class RefundGatewayTimeoutExperiment { }

The typeof() is a compile-time reference. If OrderCancellationFeature is renamed or deleted, the compiler emits CS0246. The nameof() is a compile-time reference to the acceptance criterion. If the criterion is renamed, the compiler catches it.

The Source Generator produces a traceability report linking requirements to their operational coverage:

// <auto-generated> ops-requirement-traceability.g.json
{
  "requirements": [
    {
      "type": "OrderCancellationFeature",
      "feature": "OrderCancellation",
      "acceptanceCriteria": [
        {
          "name": "CancellationTriggersFullRefund",
          "opsLinks": [
            {
              "dsl": "Ops.Chaos",
              "declaration": "RefundGatewayTimeoutExperiment",
              "kind": "ChaosExperiment",
              "tier": "InProcess",
              "rationale": "Verify refund completes even when payment gateway is slow"
            }
          ]
        },
        {
          "name": "CancelledOrderStatusUpdatedWithin5Seconds",
          "opsLinks": [
            {
              "dsl": "Ops.Performance",
              "declaration": "CancellationLatencyBudget",
              "kind": "PerformanceBudget",
              "tier": "InProcess",
              "rationale": "AC requires status update within 5 seconds"
            }
          ]
        }
      ]
    }
  ]
}

The Generated Ops Manifest

Every Ops declaration from every sub-DSL is collected into a single manifest file: ops-manifest.g.json. This is the single source of truth for the operational posture of a service.

// <auto-generated> ops-manifest.g.json
{
  "generatedAt": "2026-04-06T14:32:00Z",
  "service": "OrderService",
  "version": "2.4.0",
  "summary": {
    "totalDeclarations": 47,
    "byDsl": {
      "Deployment": 3,
      "Migration": 5,
      "Observability": 12,
      "Configuration": 4,
      "Resilience": 6,
      "Chaos": 8,
      "Performance": 4,
      "Security": 2,
      "SLA": 2,
      "Compliance": 1
    },
    "byTier": {
      "InProcess": 31,
      "Container": 12,
      "Cloud": 4
    },
    "byEnvironment": {
      "Development": 15,
      "Testing": 31,
      "Staging": 40,
      "Production": 47,
      "DisasterRecovery": 8
    }
  },
  "declarations": [
    {
      "dsl": "Ops.Observability",
      "kind": "HealthCheck",
      "name": "order-db-connectivity",
      "target": "Database",
      "tier": "InProcess",
      "environments": ["Testing", "Staging", "Production"],
      "probe": {
        "kind": "Sql",
        "endpoint": "SELECT 1",
        "intervalSeconds": 15,
        "timeoutSeconds": 3,
        "failureThreshold": 3
      }
    },
    {
      "dsl": "Ops.Chaos",
      "kind": "ChaosExperiment",
      "name": "PaymentTimeout",
      "target": "Application",
      "tier": "InProcess",
      "environments": ["Testing", "Staging"],
      "fault": {
        "kind": "Timeout",
        "probability": 0.3,
        "timeoutMs": 5000
      },
      "hypothesis": {
        "metric": "order.completion.rate",
        "condition": "GreaterThan",
        "value": 0.95
      },
      "requirementLinks": [
        {
          "requirement": "OrderCancellationFeature",
          "criterion": "CancellationTriggersFullRefund"
        }
      ]
    }
    // ... 45 more declarations
  ]
}

The dotnet ops report Command

A CLI tool reads ops-manifest.g.json and produces a human-readable operational posture report:

$ dotnet ops report

OrderService v2.4.0 — Operational Posture
═════════════════════════════════════════════

Declarations:  47 total (31 InProcess, 12 Container, 4 Cloud)
Health Checks: 6 probes across 4 targets
Chaos Tests:   8 experiments (5 InProcess, 2 Container, 1 Cloud)
Perf Budgets:  4 budgets (p95 < 200ms, p99 < 500ms, error < 0.1%)
SLA:           99.9% availability, 43.8 min/month error budget
Compliance:    SOC2 Type II — 12/14 controls evidenced

Coverage by Environment:
  Development     15/47  (32%)   ← expected: not all ops concerns apply to dev
  Testing         31/47  (66%)
  Staging         40/47  (85%)
  Production      47/47  (100%)  ← full coverage
  DR               8/47  (17%)   ← expected: only backup/failover relevant

Cross-DSL Validation: 14/14 rules passed ✓
Requirement Traceability: 4 features linked, 11/14 ACs covered (79%)

The Generated DI Registration

// <auto-generated by Ops.Primitives.Generators />
namespace Ops.Generated;

public static class OpsRegistryExtensions
{
    /// Registers all Ops infrastructure discovered in this compilation:
    /// health checks, metrics, middleware, policies, validators.
    public static IServiceCollection AddOpsInfrastructure(
        this IServiceCollection services)
    {
        // Observability: health checks
        services.AddHealthChecks()
            .AddOrderServiceProbes();

        // Observability: metrics middleware
        services.AddSingleton<OrderServiceMetricsMiddleware>();

        // Resilience: circuit breaker policies
        services.AddOrderServiceCircuitBreakers();

        // Chaos: experiment decorators (when chaos is active)
        services.AddChaosDecorators();

        // Primitives: policy validator
        services.AddHostedService<OpsPolicyValidator>();

        return services;
    }
}

One call: services.AddOpsInfrastructure(). Every operational concern registered. Generated from the same attributes that produce the Kubernetes YAML, the Prometheus alerts, and the Terraform modules.


The Primitive Composition Pattern

The 8 primitives compose. A single Ops declaration typically uses 3-5 primitives:

// Composes: OpsTarget + OpsProbe + OpsThreshold + OpsEnvironment + OpsSchedule
[OpsProbe("cert-expiry-check",
    Target = OpsTarget.Certificate,             // Primitive 1: target
    Kind = ProbeKind.Command,
    IntervalSeconds = 86400)]                    // Primitive 2: probe
[OpsThreshold("cert.days.remaining",
    ThresholdCondition.LessThan, 30,
    Severity = OpsSeverity.Warning,
    Description = "Certificate expires in < 30 days")]
[OpsThreshold("cert.days.remaining",
    ThresholdCondition.LessThan, 7,
    Severity = OpsSeverity.PageNow,
    Description = "Certificate expires in < 7 days")]  // Primitive 3: thresholds
[OpsEnvironment(EnvironmentTier.Staging)]
[OpsEnvironment(EnvironmentTier.Production)]            // Primitive 5: environments
[OpsSchedule("0 8 * * 1",
    Timezone = "UTC",
    Description = "Weekly check on Monday 8 AM")]       // Primitive 6: schedule
[OpsRequirementLink(typeof(SecurityComplianceFeature),
    AcceptanceCriterion = nameof(
        SecurityComplianceFeature.TlsCertificatesNeverExpire))]  // Primitive 8: req link
public sealed class TlsCertificateExpiryCheck { }

Seven attributes. Six primitives composed. One declaration. The Source Generator reads all of them and produces:

  1. A Kubernetes CronJob that runs weekly
  2. A Prometheus alert rule with two severity levels
  3. An entry in ops-manifest.g.json
  4. A traceability link to the SecurityComplianceFeature requirement
  5. An environment filter (staging + production only)

All from one class with attributes. No YAML. No wiki. No manual process. The primitives are the kernel. The sub-DSLs are the vocabulary. The generators are the compilers. The analyzers are the type checkers. The manifest is the output.

⬇ Download