Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Ops.Configuration + Ops.Resilience -- Environment Correctness and Recovery

Two DSLs, one post. They are complementary: Configuration ensures the system starts correctly in every environment. Resilience ensures the system recovers when something fails at runtime.


Part A: Ops.Configuration -- Environment Correctness

"It works on staging." -- every developer, minutes before production goes down because PaymentGateway:ApiKey is missing.


The Problem

Configuration management in most systems is a collection of appsettings.json files, environment variables, and secrets scattered across three or four systems. The failure mode is always the same:

  1. A developer adds a new config key: PaymentGateway:WebhookSecret.
  2. They add it to appsettings.Development.json and appsettings.Staging.json.
  3. They forget appsettings.Production.json.
  4. The PR is reviewed. Nobody catches the missing production config because reviewers look at code, not JSON files.
  5. The deployment succeeds. The first webhook from Stripe fails signature validation. Payment processing is broken for 45 minutes.

Or the secret rotation scenario:

  1. The PaymentGateway:ApiKey was rotated in the vault three months ago.
  2. The old key is still cached in the application's environment variable.
  3. The vault has the new key, but the application never reads it because the rotation was manual.
  4. One day the old key expires. The payment service returns 401. The on-call engineer spends 90 minutes discovering that the issue is a stale secret.

What is missing:

  • Environment completeness validation. If you declare that production is a target environment, every config key must have a production value. The compiler should reject incomplete matrices.
  • Secret rotation enforcement. If a secret has a rotation period of 90 days, the build should warn when the rotation is overdue.
  • Config value validation. A connection string has a format. An API key has a minimum length. A URL must be well-formed. These constraints should be compile-time checks.

Attribute Definitions

// =================================================================
// Ops.Configuration.Lib -- Environment Configuration DSL
// =================================================================

/// Declare a configuration key with per-environment values.
/// The generator verifies that every declared environment has a value.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class ConfigTransformAttribute : Attribute
{
    public string Key { get; }
    public string[] Environments { get; init; } = [];
    public string DefaultValue { get; init; } = "";
    public bool Required { get; init; } = true;
    public string Description { get; init; } = "";

    public ConfigTransformAttribute(string key) => Key = key;
}

/// Declare a secret with vault path and rotation policy.
/// The generator verifies rotation schedules and vault path format.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class SecretAttribute : Attribute
{
    public string Name { get; }
    public string Vault { get; init; } = "default";
    public string VaultPath { get; init; } = "";
    public int RotationDays { get; init; } = 90;
    public string[] Environments { get; init; } = [];
    public SecretKind Kind { get; init; } = SecretKind.ApiKey;

    public SecretAttribute(string name) => Name = name;
}

public enum SecretKind
{
    ApiKey,            // third-party API key
    ConnectionString,  // database connection string
    Certificate,       // TLS certificate
    EncryptionKey,     // symmetric encryption key
    OAuthSecret,       // OAuth client secret
    WebhookSecret      // webhook signature secret
}

/// Declare the environments that must be complete.
/// The generator checks that every [ConfigTransform] and [Secret]
/// has a value for every declared environment.
[AttributeUsage(AttributeTargets.Class)]
public sealed class EnvironmentMatrixAttribute : Attribute
{
    public string[] Environments { get; }

    public EnvironmentMatrixAttribute(params string[] environments)
        => Environments = environments;
}

/// Declare a validation rule for a configuration value.
/// The generator emits startup validation code.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class ConfigValidationAttribute : Attribute
{
    public string Key { get; }
    public string Pattern { get; init; } = "";           // regex
    public ConfigValidationKind Kind { get; init; } = ConfigValidationKind.Regex;
    public string ErrorMessage { get; init; } = "";
    public int MinLength { get; init; } = 0;
    public int MaxLength { get; init; } = int.MaxValue;

    public ConfigValidationAttribute(string key) => Key = key;
}

public enum ConfigValidationKind
{
    Regex,             // value must match pattern
    Url,               // value must be a valid URI
    ConnectionString,  // value must parse as a connection string
    Integer,           // value must be an integer
    Boolean,           // value must be true/false
    Enum               // value must be one of the allowed values
}

Usage Example

// -- OrderServiceConfig.cs ------------------------------------------

[EnvironmentMatrix("development", "staging", "production", "dr-region")]
[ConfigTransform("ConnectionStrings:OrderDb",
    Environments = ["development", "staging", "production", "dr-region"],
    Required = true,
    Description = "Primary database connection string")]
[ConfigTransform("PaymentGateway:BaseUrl",
    Environments = ["development", "staging", "production", "dr-region"],
    Required = true,
    Description = "Payment gateway API base URL")]
[ConfigTransform("PaymentGateway:TimeoutSeconds",
    Environments = ["development", "staging", "production", "dr-region"],
    DefaultValue = "30",
    Required = false,
    Description = "HTTP timeout for payment gateway calls")]
[ConfigTransform("FeatureFlags:NewCheckoutEnabled",
    Environments = ["development", "staging"],
    DefaultValue = "false",
    Required = false,
    Description = "Enable new checkout flow (not yet production-ready)")]
[ConfigValidation("ConnectionStrings:OrderDb",
    Kind = ConfigValidationKind.ConnectionString,
    ErrorMessage = "OrderDb connection string is not valid")]
[ConfigValidation("PaymentGateway:BaseUrl",
    Kind = ConfigValidationKind.Url,
    ErrorMessage = "PaymentGateway:BaseUrl must be a valid HTTPS URL")]
[ConfigValidation("PaymentGateway:TimeoutSeconds",
    Kind = ConfigValidationKind.Integer,
    ErrorMessage = "PaymentGateway:TimeoutSeconds must be a positive integer")]
[Secret("PaymentGateway:ApiKey",
    Vault = "azure-keyvault",
    VaultPath = "secrets/payment-gateway-api-key",
    RotationDays = 90,
    Environments = ["staging", "production", "dr-region"],
    Kind = SecretKind.ApiKey)]
[Secret("PaymentGateway:WebhookSecret",
    Vault = "azure-keyvault",
    VaultPath = "secrets/payment-gateway-webhook-secret",
    RotationDays = 180,
    Environments = ["staging", "production", "dr-region"],
    Kind = SecretKind.WebhookSecret)]
[Secret("ConnectionStrings:OrderDb",
    Vault = "azure-keyvault",
    VaultPath = "secrets/order-db-connection-string",
    RotationDays = 365,
    Environments = ["production", "dr-region"],
    Kind = SecretKind.ConnectionString)]
public sealed class OrderServiceConfig { }

One class declares the entire configuration surface of the Order Service: which keys exist, which environments they apply to, how to validate them, and where the secrets live.


Three-tier projection

The same [ConfigTransform] and [Secret] declarations fan out into one set of artifacts per execution tier. Local runs use ASP.NET configuration files; container runs use compose env files plus a secrets overlay; cloud runs use Kubernetes ConfigMaps plus External Secrets Operator references back to the same vault.

Diagram
The same ConfigTransform and Secret declarations fan out to ASP.NET appsettings locally, env files and compose overlays in containers, and ConfigMaps plus ExternalSecret references in the cloud tier.

Local Tier: appsettings.{env}.g.json -- Per-Environment Transforms

// <auto-generated by Ops.Configuration.Generators />
// appsettings.production.g.json

{
  "ConnectionStrings": {
    "OrderDb": "${SECRET:azure-keyvault:secrets/order-db-connection-string}"
  },
  "PaymentGateway": {
    "BaseUrl": "${CONFIG:PaymentGateway:BaseUrl:production}",
    "TimeoutSeconds": "30",
    "ApiKey": "${SECRET:azure-keyvault:secrets/payment-gateway-api-key}",
    "WebhookSecret": "${SECRET:azure-keyvault:secrets/payment-gateway-webhook-secret}"
  }
}

The ${SECRET:...} placeholders are resolved at deployment time by the secret provider. The ${CONFIG:...} placeholders are resolved from the environment-specific value store. The generated file is a template, not a finished config -- it declares what must exist without embedding actual values in source control.

Container Tier: .env.{env}.g

For local Docker Compose runs the same [ConfigTransform] projection produces a flat shell-style env file consumed by docker-compose --env-file. Non-secret values are inlined; secret values stay as ${SECRET:...} references that the compose overlay below resolves at container start via a Vault sidecar.

# <auto-generated by Ops.Configuration.Generators />
# .env.staging.g
# Source: [ConfigTransform] attributes on OrderServiceConfig

PAYMENT_GATEWAY__BASE_URL=https://staging.payments.example.com
PAYMENT_GATEWAY__TIMEOUT_SECONDS=30
DATABASE__MAX_POOL_SIZE=50
FEATURE_FLAGS__NEW_CHECKOUT=true

# Secrets resolved by docker-compose.config.yaml overlay
PAYMENT_GATEWAY__API_KEY=${SECRET:azure-keyvault:secrets/payment-gateway-api-key}
PAYMENT_GATEWAY__WEBHOOK_SECRET=${SECRET:azure-keyvault:secrets/payment-gateway-webhook-secret}
CONNECTION_STRINGS__ORDER_DB=${SECRET:azure-keyvault:secrets/order-db-connection-string}

The __ separator is the ASP.NET Core convention for nested configuration sections, so PAYMENT_GATEWAY__BASE_URL binds to the same PaymentGateway:BaseUrl key as the appsettings file. One source of truth, two file formats.

Container Tier: docker-compose.config.yaml

A Compose overlay that wires the env file in and mounts vault-fetched secrets via the secrets: block. Pairs with the existing docker-compose.ops.yaml from the Deployment chapter.

# <auto-generated by Ops.Configuration.Generators />
# docker-compose.config.yaml -- Configuration overlay
# Usage: docker compose -f docker-compose.ops.yaml -f docker-compose.config.yaml --env-file .env.staging.g up

services:
  order-api:
    env_file:
      - .env.staging.g
    secrets:
      - payment_gateway_api_key
      - payment_gateway_webhook_secret
      - order_db_connection_string
    environment:
      # Re-export the file-mounted secrets as env vars for ASP.NET binding
      PAYMENT_GATEWAY__API_KEY_FILE: /run/secrets/payment_gateway_api_key
      PAYMENT_GATEWAY__WEBHOOK_SECRET_FILE: /run/secrets/payment_gateway_webhook_secret
      CONNECTION_STRINGS__ORDER_DB_FILE: /run/secrets/order_db_connection_string

secrets:
  payment_gateway_api_key:
    # Resolved at compose-up time by a Vault sidecar; NOT committed to git
    external: true
    name: payment-gateway-api-key
  payment_gateway_webhook_secret:
    external: true
    name: payment-gateway-webhook-secret
  order_db_connection_string:
    external: true
    name: order-db-connection-string

The external: true flag tells Compose the secret is provisioned out-of-band (typically by a vault-agent sidecar that pulls from Azure Key Vault using the developer's local credentials). Secrets never enter the working tree.

Cloud Tier: k8s/configmap.yaml

For Kubernetes the same [ConfigTransform] non-secret values become a v1 ConfigMap mounted into the Order API pods. The keys match the ${CONFIG:...} placeholders from the Local-tier appsettings file -- single source of truth, three projections.

# <auto-generated by Ops.Configuration.Generators />
# k8s/configmap.yaml -- Non-secret configuration for OrderService
# Source: [ConfigTransform] attributes (production environment)

apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
  namespace: orders
  labels:
    app.kubernetes.io/name: order-service
    app.kubernetes.io/managed-by: ops.configuration.generators
    ops.configuration/environment: production
data:
  PaymentGateway__BaseUrl: "https://api.payments.example.com"
  PaymentGateway__TimeoutSeconds: "30"
  Database__MaxPoolSize: "200"
  FeatureFlags__NewCheckout: "true"
  # Secret keys live in the ExternalSecret manifest, not here

A volumeMount or envFrom: configMapRef in the Deployment manifest (Part 5) consumes this. The label ops.configuration/environment lets the analyzer correlate which [EnvironmentMatrix] value drove the emission.

Cloud Tier: k8s/external-secret.yaml

For secrets the generator emits an ExternalSecret (External Secrets Operator CRD) rather than a raw v1 Secret. The chapter's narrative is "secrets live in Azure Key Vault, never in git" -- writing the resolved secret material into a YAML file would violate that. Instead the CRD declares a binding: the operator pulls the value from Vault at the cluster and materializes a regular v1 Secret in-cluster.

# <auto-generated by Ops.Configuration.Generators />
# k8s/external-secret.yaml -- Vault-backed secrets for OrderService
# Source: [Secret] attributes on OrderServiceConfig

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: order-service-secrets
  namespace: orders
  labels:
    app.kubernetes.io/name: order-service
    app.kubernetes.io/managed-by: ops.configuration.generators
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault-store
    kind: ClusterSecretStore
  target:
    name: order-service-secrets
    creationPolicy: Owner
  data:
    - secretKey: ConnectionStrings__OrderDb
      remoteRef:
        key: secrets/order-db-connection-string
        # rotationPolicy mirrors [Secret(RotationDays = 365)]
    - secretKey: PaymentGateway__ApiKey
      remoteRef:
        key: secrets/payment-gateway-api-key
        # rotationPolicy mirrors [Secret(RotationDays = 90)]
    - secretKey: PaymentGateway__WebhookSecret
      remoteRef:
        key: secrets/payment-gateway-webhook-secret
        # rotationPolicy mirrors [Secret(RotationDays = 180)]

The remoteRef.key values are the exact VaultPath declared in the [Secret] attribute on line 181. Rename the path in C# and the next build emits a different ExternalSecret -- the audit trail follows the typed declaration. The same RotationDays values that drive SecretRotationSchedule.g.cs (below) also annotate the ExternalSecret refresh policy when External Secrets Operator v0.10+ is in use.

ConfigValidation.g.cs -- Startup Validation

// <auto-generated by Ops.Configuration.Generators />
namespace Ops.Configuration.Generated;

public static class ConfigValidation
{
    public static void ValidateConfiguration(IConfiguration configuration)
    {
        var errors = new List<string>();

        // Required keys
        ValidateRequired(configuration, "ConnectionStrings:OrderDb", errors);
        ValidateRequired(configuration, "PaymentGateway:BaseUrl", errors);
        ValidateRequired(configuration, "PaymentGateway:ApiKey", errors);
        ValidateRequired(configuration, "PaymentGateway:WebhookSecret", errors);

        // Format validations
        ValidateConnectionString(configuration, "ConnectionStrings:OrderDb",
            "OrderDb connection string is not valid", errors);
        ValidateUrl(configuration, "PaymentGateway:BaseUrl",
            "PaymentGateway:BaseUrl must be a valid HTTPS URL", errors);
        ValidateInteger(configuration, "PaymentGateway:TimeoutSeconds",
            "PaymentGateway:TimeoutSeconds must be a positive integer", errors);

        if (errors.Count > 0)
            throw new ConfigurationValidationException(
                $"Configuration validation failed with {errors.Count} error(s):\n" +
                string.Join("\n", errors.Select((e, i) => $"  {i + 1}. {e}")));
    }

    private static void ValidateRequired(
        IConfiguration config, string key, List<string> errors)
    {
        if (string.IsNullOrWhiteSpace(config[key]))
            errors.Add($"Required configuration key '{key}' is missing or empty");
    }

    private static void ValidateConnectionString(
        IConfiguration config, string key, string message, List<string> errors)
    {
        var value = config[key];
        if (value is null) return; // handled by ValidateRequired

        try { _ = new System.Data.Common.DbConnectionStringBuilder { ConnectionString = value }; }
        catch { errors.Add(message); }
    }

    private static void ValidateUrl(
        IConfiguration config, string key, string message, List<string> errors)
    {
        var value = config[key];
        if (value is null) return;
        if (!Uri.TryCreate(value, UriKind.Absolute, out var uri) ||
            (uri.Scheme != "http" && uri.Scheme != "https"))
            errors.Add(message);
    }

    private static void ValidateInteger(
        IConfiguration config, string key, string message, List<string> errors)
    {
        var value = config[key];
        if (value is null) return;
        if (!int.TryParse(value, out _))
            errors.Add(message);
    }
}

This validation runs at application startup. It catches the "missing production config" scenario before the first request arrives, not 45 minutes later when a webhook fails.

SecretRotationSchedule.g.cs -- Rotation Enforcement

// <auto-generated by Ops.Configuration.Generators />
namespace Ops.Configuration.Generated;

public static class SecretRotationSchedule
{
    public static readonly IReadOnlyList<SecretRotationEntry> Entries =
    [
        new("PaymentGateway:ApiKey",
            Vault: "azure-keyvault",
            Path: "secrets/payment-gateway-api-key",
            RotationDays: 90,
            Kind: SecretKind.ApiKey,
            Environments: ["staging", "production", "dr-region"]),

        new("PaymentGateway:WebhookSecret",
            Vault: "azure-keyvault",
            Path: "secrets/payment-gateway-webhook-secret",
            RotationDays: 180,
            Kind: SecretKind.WebhookSecret,
            Environments: ["staging", "production", "dr-region"]),

        new("ConnectionStrings:OrderDb",
            Vault: "azure-keyvault",
            Path: "secrets/order-db-connection-string",
            RotationDays: 365,
            Kind: SecretKind.ConnectionString,
            Environments: ["production", "dr-region"]),
    ];

    /// Check for overdue rotations. Call from a scheduled job or health check.
    public static IReadOnlyList<RotationWarning> CheckRotations(
        ISecretMetadataProvider metadataProvider)
    {
        var warnings = new List<RotationWarning>();
        foreach (var entry in Entries)
        {
            var lastRotated = metadataProvider.GetLastRotatedDate(
                entry.Vault, entry.Path);
            var daysSinceRotation = (DateTimeOffset.UtcNow - lastRotated).TotalDays;

            if (daysSinceRotation > entry.RotationDays)
                warnings.Add(new(entry.Name, entry.RotationDays,
                    (int)daysSinceRotation, RotationStatus.Overdue));
            else if (daysSinceRotation > entry.RotationDays * 0.8)
                warnings.Add(new(entry.Name, entry.RotationDays,
                    (int)daysSinceRotation, RotationStatus.DueSoon));
        }
        return warnings;
    }
}

public sealed record SecretRotationEntry(
    string Name, string Vault, string Path,
    int RotationDays, SecretKind Kind, IReadOnlyList<string> Environments);

public sealed record RotationWarning(
    string SecretName, int RotationPolicyDays,
    int DaysSinceRotation, RotationStatus Status);

public enum RotationStatus { Current, DueSoon, Overdue }

OPS011: Missing Config for Declared Environment

[EnvironmentMatrix("development", "staging", "production")]
[ConfigTransform("NewFeature:Endpoint",
    Environments = ["development", "staging"])]
    // production is missing
public sealed class OrderServiceConfig { }

// error OPS011: ConfigTransform 'NewFeature:Endpoint' is marked Required
//   but has no value for environment 'production'. The EnvironmentMatrix
//   declares ["development", "staging", "production"]. Add 'production'
//   to the Environments array or set Required = false.

This is the core value proposition. The environment matrix is the source of truth. If you declare three environments, every required config must cover all three. The compiler rejects incomplete matrices.

OPS012: Secret Without Rotation Policy

[Secret("ExternalApi:Token",
    Vault = "azure-keyvault",
    VaultPath = "secrets/external-api-token",
    RotationDays = 0)]   // <-- rotation disabled
public sealed class OrderServiceConfig { }

// warning OPS012: Secret 'ExternalApi:Token' has RotationDays = 0
//   (rotation disabled). Secrets should have a rotation policy.
//   Set RotationDays to a positive value or suppress this warning
//   if the secret is intentionally long-lived.

OPS013: Transform Without Validation Pattern

[ConfigTransform("Database:MaxPoolSize",
    Environments = ["development", "staging", "production"],
    Required = true)]
// No [ConfigValidation] for this key
public sealed class OrderServiceConfig { }

// warning OPS013: ConfigTransform 'Database:MaxPoolSize' has no
//   [ConfigValidation]. Consider adding validation to catch
//   misconfiguration at startup rather than at runtime.

Part B: Ops.Resilience -- Recovery and Fault Tolerance

"The rollback steps are in the wiki. The circuit breaker config is in a Slack thread from November."


The Problem

Resilience patterns are everywhere in modern distributed systems: circuit breakers, retries, bulkheads, canary deployments, blue-green switches, rollback plans. The configuration for these patterns lives in:

  • A Polly policy buried in Startup.cs, configured with magic numbers that nobody remembers choosing
  • An Argo Rollout YAML that a platform engineer wrote and nobody else understands
  • A wiki page titled "Rollback Procedure" with 12 steps, 4 of which reference services that were renamed
  • A Slack thread where someone decided the circuit breaker threshold should be 5 failures in 30 seconds

The Resilience DSL makes all of these declarative. Attributes on classes. Source generators produce Polly registrations, Argo Rollout configs, and Terraform failover resources. Analyzers verify that canary strategies reference real metrics and circuit breakers have fallback methods.


Attribute Definitions

// =================================================================
// Ops.Resilience.Lib -- Fault Tolerance & Recovery DSL
// =================================================================

/// Declare a rollback plan -- the reverse-ordered steps to undo a deployment.
/// The generator validates step names and ordering.
[AttributeUsage(AttributeTargets.Class)]
public sealed class RollbackPlanAttribute : Attribute
{
    public string[] Steps { get; }
    public string AutomaticThreshold { get; init; } = "";
    public string Timeout { get; init; } = "00:30:00";
    public bool RequiresApproval { get; init; } = false;

    public RollbackPlanAttribute(params string[] steps) => Steps = steps;
}

/// Declare a canary deployment strategy with gradual traffic shifting.
/// The generator emits Argo Rollout config (Container) and
/// weighted routing (Cloud).
[AttributeUsage(AttributeTargets.Class)]
public sealed class CanaryStrategyAttribute : Attribute
{
    public int InitialPercentage { get; init; } = 5;
    public int StepPercentage { get; init; } = 15;
    public string StepDuration { get; init; } = "00:05:00";
    public string MetricThreshold { get; init; } = "";
    public string SuccessMetric { get; init; } = "";
    public int MaxPercentage { get; init; } = 100;
    public bool AutoPromote { get; init; } = true;

    public CanaryStrategyAttribute() { }
}

/// Declare a circuit breaker for a downstream service call.
/// The generator produces Polly CircuitBreakerPolicy registration.
[AttributeUsage(AttributeTargets.Class | AttributeTargets.Method, AllowMultiple = true)]
public sealed class CircuitBreakerAttribute : Attribute
{
    public string ServiceName { get; }
    public int FailureThreshold { get; init; } = 5;
    public string SamplingDuration { get; init; } = "00:00:30";
    public double FailureRatio { get; init; } = 0.5;
    public string RecoveryTimeout { get; init; } = "00:00:30";
    public string FallbackMethod { get; init; } = "";
    public Type? FallbackType { get; init; }

    public CircuitBreakerAttribute(string serviceName) => ServiceName = serviceName;
}

/// Declare a blue-green deployment switch.
/// The generator produces health-verified traffic switching logic.
[AttributeUsage(AttributeTargets.Class)]
public sealed class BlueGreenSwitchAttribute : Attribute
{
    public string HealthCheckEndpoint { get; init; } = "/health/ready";
    public string SwitchTimeout { get; init; } = "00:05:00";
    public int HealthCheckRetries { get; init; } = 5;
    public string HealthCheckInterval { get; init; } = "00:00:10";
    public bool AutoSwitch { get; init; } = false;

    public BlueGreenSwitchAttribute() { }
}

/// Declare a retry policy for a downstream call.
/// The generator produces Polly RetryPolicy registration.
[AttributeUsage(AttributeTargets.Class | AttributeTargets.Method, AllowMultiple = true)]
public sealed class RetryPolicyAttribute : Attribute
{
    public string ServiceName { get; }
    public int MaxRetries { get; init; } = 3;
    public BackoffStrategy Backoff { get; init; } = BackoffStrategy.ExponentialWithJitter;
    public double BackoffBaseSeconds { get; init; } = 1.0;
    public double MaxDelaySeconds { get; init; } = 30.0;
    public string[] RetryOn { get; init; } = [];           // exception types
    public int[] RetryOnStatusCodes { get; init; } = [];   // HTTP status codes

    public RetryPolicyAttribute(string serviceName) => ServiceName = serviceName;
}

public enum BackoffStrategy
{
    Constant,                // fixed delay between retries
    Linear,                  // delay increases linearly
    Exponential,             // delay doubles each retry
    ExponentialWithJitter    // exponential + random jitter (recommended)
}

/// Declare a bulkhead (concurrency limiter) for a downstream call.
/// Prevents one failing service from exhausting the thread pool.
[AttributeUsage(AttributeTargets.Class | AttributeTargets.Method, AllowMultiple = true)]
public sealed class BulkheadAttribute : Attribute
{
    public string ServiceName { get; }
    public int MaxConcurrency { get; init; } = 10;
    public int QueueSize { get; init; } = 20;
    public string Timeout { get; init; } = "00:00:30";

    public BulkheadAttribute(string serviceName) => ServiceName = serviceName;
}

Usage Example

Complete resilience configuration for the Order Service: circuit breakers on downstream services, retry policies, a bulkhead, a canary strategy, and a rollback plan.

// -- OrderServiceResilience.cs --------------------------------------

[RollbackPlan(
    "StopOrderWorker",
    "RevertOrderApiToV23",
    "RollbackMigration47",
    "RestartPaymentHealthCheck",
    AutomaticThreshold = "error_rate > 5% for 3m",
    Timeout = "00:15:00",
    RequiresApproval = false)]
[CanaryStrategy(
    InitialPercentage = 5,
    StepPercentage = 15,
    StepDuration = "00:05:00",
    SuccessMetric = "order_api_request_duration_seconds",
    MetricThreshold = "histogram_quantile(0.95, rate(order_api_request_duration_seconds_bucket[5m])) < 0.5",
    AutoPromote = true)]
[BlueGreenSwitch(
    HealthCheckEndpoint = "/health/ready",
    SwitchTimeout = "00:05:00",
    HealthCheckRetries = 5,
    HealthCheckInterval = "00:00:10",
    AutoSwitch = false)]
public sealed class OrderServiceResilience
{
    [CircuitBreaker("payment-service",
        FailureThreshold = 5,
        SamplingDuration = "00:00:30",
        FailureRatio = 0.5,
        RecoveryTimeout = "00:01:00",
        FallbackMethod = "GetCachedPaymentStatus",
        FallbackType = typeof(PaymentFallbacks))]
    [RetryPolicy("payment-service",
        MaxRetries = 3,
        Backoff = BackoffStrategy.ExponentialWithJitter,
        BackoffBaseSeconds = 0.5,
        MaxDelaySeconds = 10.0,
        RetryOnStatusCodes = [408, 429, 500, 502, 503, 504])]
    [Bulkhead("payment-service",
        MaxConcurrency = 25,
        QueueSize = 50,
        Timeout = "00:00:15")]
    public void PaymentServicePolicy() { }

    [CircuitBreaker("inventory-service",
        FailureThreshold = 10,
        SamplingDuration = "00:01:00",
        FailureRatio = 0.3,
        RecoveryTimeout = "00:02:00",
        FallbackMethod = "GetEstimatedInventory",
        FallbackType = typeof(InventoryFallbacks))]
    [RetryPolicy("inventory-service",
        MaxRetries = 2,
        Backoff = BackoffStrategy.Exponential,
        BackoffBaseSeconds = 1.0,
        RetryOnStatusCodes = [500, 502, 503])]
    public void InventoryServicePolicy() { }

    [CircuitBreaker("notification-service",
        FailureThreshold = 20,
        SamplingDuration = "00:02:00",
        FailureRatio = 0.5,
        RecoveryTimeout = "00:05:00")]
        // No fallback -- notifications are fire-and-forget
    [RetryPolicy("notification-service",
        MaxRetries = 1,
        Backoff = BackoffStrategy.Constant,
        BackoffBaseSeconds = 2.0)]
    public void NotificationServicePolicy() { }
}

public static class PaymentFallbacks
{
    public static PaymentStatus GetCachedPaymentStatus(string orderId)
        => PaymentCache.GetLastKnown(orderId);
}

public static class InventoryFallbacks
{
    public static int GetEstimatedInventory(string sku)
        => InventoryCache.GetEstimate(sku);
}

Three downstream services, each with its own circuit breaker, retry policy, and (where appropriate) bulkhead. A canary strategy for gradual rollout. A rollback plan with an automatic trigger. All typed. All validated.


Three-tier projection

[CircuitBreaker], [CanaryStrategy], and [RetryPolicy] declarations project across all three tiers. Local runs get Polly v8 pipelines and registered fallbacks; container runs get a Traefik-weighted compose overlay with the same SLI rules a local Prometheus can evaluate; cloud runs get Argo Rollout plus AnalysisTemplate driving the production canary.

Diagram
The resilience attributes project into Polly v8 pipelines locally, Traefik weighted compose plus Prometheus canary rules in containers, and Argo Rollout with AnalysisTemplate plus Route53 failover in the cloud tier.

InProcess Tier: PollyPolicies.g.cs

The source generator reads [CircuitBreaker], [RetryPolicy], and [Bulkhead] attributes and emits Polly v8 pipeline registrations:

// <auto-generated by Ops.Resilience.Generators />
namespace Ops.Resilience.Generated;

public static class PollyPolicies
{
    public static IServiceCollection AddGeneratedResiliencePolicies(
        this IServiceCollection services)
    {
        // ---- payment-service ----
        services.AddResiliencePipeline("payment-service", builder =>
        {
            builder
                .AddBulkhead(new BulkheadStrategyOptions
                {
                    MaxParallelization = 25,
                    MaxQueuingActions = 50,
                })
                .AddCircuitBreaker(new CircuitBreakerStrategyOptions
                {
                    FailureRatio = 0.5,
                    SamplingDuration = TimeSpan.FromSeconds(30),
                    MinimumThroughput = 5,
                    BreakDuration = TimeSpan.FromMinutes(1),
                })
                .AddRetry(new RetryStrategyOptions
                {
                    MaxRetryAttempts = 3,
                    BackoffType = DelayBackoffType.Exponential,
                    Delay = TimeSpan.FromMilliseconds(500),
                    MaxDelay = TimeSpan.FromSeconds(10),
                    UseJitter = true,
                    ShouldHandle = new PredicateBuilder()
                        .HandleResult<HttpResponseMessage>(r =>
                            new[] { 408, 429, 500, 502, 503, 504 }
                                .Contains((int)r.StatusCode)),
                })
                .AddTimeout(TimeSpan.FromSeconds(15));
        });

        // ---- inventory-service ----
        services.AddResiliencePipeline("inventory-service", builder =>
        {
            builder
                .AddCircuitBreaker(new CircuitBreakerStrategyOptions
                {
                    FailureRatio = 0.3,
                    SamplingDuration = TimeSpan.FromMinutes(1),
                    MinimumThroughput = 10,
                    BreakDuration = TimeSpan.FromMinutes(2),
                })
                .AddRetry(new RetryStrategyOptions
                {
                    MaxRetryAttempts = 2,
                    BackoffType = DelayBackoffType.Exponential,
                    Delay = TimeSpan.FromSeconds(1),
                    ShouldHandle = new PredicateBuilder()
                        .HandleResult<HttpResponseMessage>(r =>
                            new[] { 500, 502, 503 }
                                .Contains((int)r.StatusCode)),
                });
        });

        // ---- notification-service ----
        services.AddResiliencePipeline("notification-service", builder =>
        {
            builder
                .AddCircuitBreaker(new CircuitBreakerStrategyOptions
                {
                    FailureRatio = 0.5,
                    SamplingDuration = TimeSpan.FromMinutes(2),
                    MinimumThroughput = 20,
                    BreakDuration = TimeSpan.FromMinutes(5),
                })
                .AddRetry(new RetryStrategyOptions
                {
                    MaxRetryAttempts = 1,
                    BackoffType = DelayBackoffType.Constant,
                    Delay = TimeSpan.FromSeconds(2),
                });
        });

        return services;
    }
}

InProcess Tier: FallbackRegistration.g.cs

// <auto-generated by Ops.Resilience.Generators />
namespace Ops.Resilience.Generated;

public static class FallbackRegistration
{
    public static IServiceCollection AddGeneratedFallbacks(
        this IServiceCollection services)
    {
        // payment-service fallback
        services.AddKeyedSingleton<Func<string, PaymentStatus>>(
            "payment-service:fallback",
            (_, _) => PaymentFallbacks.GetCachedPaymentStatus);

        // inventory-service fallback
        services.AddKeyedSingleton<Func<string, int>>(
            "inventory-service:fallback",
            (_, _) => InventoryFallbacks.GetEstimatedInventory);

        // notification-service: no fallback registered (fire-and-forget)

        return services;
    }
}

Container Tier: docker-compose.canary.yaml

For local Docker Compose runs the same [CanaryStrategy] declaration projects to two replicas of the order service plus a Traefik instance doing weighted routing. Developers can reproduce the canary behavior on a laptop without an Argo Rollouts controller.

# <auto-generated by Ops.Resilience.Generators />
# docker-compose.canary.yaml -- Canary overlay for OrderServiceResilience
# Usage: docker compose -f docker-compose.ops.yaml -f docker-compose.canary.yaml up

services:
  order-api-stable:
    image: order-api:${STABLE_VERSION:-latest}
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.order-stable.rule=Host(`order-api.local`)"
      - "traefik.http.services.order-stable.loadbalancer.server.port=8080"
      - "traefik.http.services.order-stable.weight=95"   # 95% baseline
      - "ops.resilience/role=stable"

  order-api-canary:
    image: order-api:${CANARY_VERSION:-latest}
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.order-canary.rule=Host(`order-api.local`)"
      - "traefik.http.services.order-canary.loadbalancer.server.port=8080"
      - "traefik.http.services.order-canary.weight=5"    # 5% canary -- mirrors first Argo step
      - "ops.resilience/role=canary"

  traefik:
    image: traefik:v3.0
    command:
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--metrics.prometheus=true"
      - "--metrics.prometheus.entryPoint=metrics"
    ports:
      - "80:80"
      - "8082:8082"   # Prometheus scrape endpoint
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro

The weight values match the first step (setWeight: 5) of the Argo Rollout shown below. To advance the canary locally, edit the weights and docker compose up -d. The same SLI rules drive the decision in both tiers.

Container Tier: prometheus-canary-rules.yaml

The same Prometheus query that the Cloud-tier AnalysisTemplate evaluates, packaged as a recording + alert rule a local Prometheus can ingest. Developers running the canary overlay get the same go/no-go signal that the production controller uses.

# <auto-generated by Ops.Resilience.Generators />
# prometheus-canary-rules.yaml -- Local SLI rules mirroring AnalysisTemplate

groups:
  - name: order-service-canary-local
    interval: 60s
    rules:
      # Recording rule: latency p95 of canary instance only
      - record: order_service:canary_latency_p95:5m
        expr: >-
          histogram_quantile(0.95,
            rate(order_api_request_duration_seconds_bucket{ops_resilience_role="canary"}[5m])
          )

      # Alert: canary fails the same successCondition (< 0.5s) the AnalysisTemplate uses
      - alert: CanaryLatencyRegression
        expr: order_service:canary_latency_p95:5m >= 0.5
        for: 5m
        labels:
          severity: critical
          ops_resilience_strategy: canary
        annotations:
          summary: "Local canary p95 latency above 500ms"
          description: >-
            The canary replica's p95 latency has exceeded 500ms for 5 minutes.
            This is the same threshold the production AnalysisTemplate enforces
            (failureLimit: 2). Roll back the canary before promoting.
          mirrors: "k8s/analysistemplate.yaml -- order-service-canary-analysis"

The mirrors: annotation is a deliberate cross-link: the alert rule explicitly names the Cloud-tier file it shadows. If a future change tightens successCondition in the AnalysisTemplate, the same change must land in this file -- and an analyzer (OPS017, future) can enforce the parity.

Cloud Tier: Argo Rollout + AnalysisTemplate

# <auto-generated by Ops.Resilience.Generators />
# Canary strategy for OrderServiceResilience

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-service-canary
  labels:
    ops.resilience/strategy: canary
spec:
  replicas: 3
  strategy:
    canary:
      canaryService: order-service-canary-svc
      stableService: order-service-stable-svc
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: order-service-canary-analysis
        - setWeight: 20
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: order-service-canary-analysis
        - setWeight: 35
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 65
        - pause: { duration: 5m }
        - setWeight: 80
        - pause: { duration: 5m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: order-service-canary-analysis
        startingStep: 2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: order-service-canary-analysis
spec:
  metrics:
    - name: latency-p95
      provider:
        prometheus:
          address: http://prometheus:9090
          query: >-
            histogram_quantile(0.95,
              rate(order_api_request_duration_seconds_bucket{rollouts_pod_template_hash="{{args.canary-hash}}"}[5m])
            )
      successCondition: result[0] < 0.5
      interval: 60s
      count: 5
      failureLimit: 2

Cloud Tier: Multi-Region Failover Terraform

# <auto-generated by Ops.Resilience.Generators />
# Multi-region failover for Order Service

resource "aws_route53_health_check" "order_service_primary" {
  fqdn              = "order-api.primary.internal"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health/ready"
  failure_threshold  = 5
  request_interval   = 10
}

resource "aws_route53_health_check" "order_service_dr" {
  fqdn              = "order-api.dr-region.internal"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health/ready"
  failure_threshold  = 5
  request_interval   = 10
}

resource "aws_route53_record" "order_service_failover_primary" {
  zone_id = var.zone_id
  name    = "order-api.internal"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.order_service_primary.id
  set_identifier  = "primary"

  alias {
    name                   = var.primary_alb_dns
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "order_service_failover_dr" {
  zone_id = var.zone_id
  name    = "order-api.internal"
  type    = "A"

  failover_routing_policy {
    type = "SECONDARY"
  }

  health_check_id = aws_route53_health_check.order_service_dr.id
  set_identifier  = "dr-region"

  alias {
    name                   = var.dr_alb_dns
    zone_id                = var.dr_alb_zone_id
    evaluate_target_health = true
  }
}

OPS014: Canary Without Metric Reference

[CanaryStrategy(
    InitialPercentage = 10,
    StepPercentage = 20,
    StepDuration = "00:05:00")]
    // No SuccessMetric or MetricThreshold specified
public sealed class InventoryServiceResilience { }

// error OPS014: CanaryStrategy on 'InventoryServiceResilience' has no
//   SuccessMetric. A canary deployment without a metric gate is a
//   blind rollout -- there is no way to detect regressions.
//   Set SuccessMetric to a [Metric] name from the Observability DSL.

A canary without a metric is just a slow rollout. The analyzer requires at least a SuccessMetric so the canary can automatically roll back if the metric degrades. This is a cross-DSL diagnostic: the metric must exist in the Observability DSL.

OPS015: Circuit Breaker Without Fallback Method

[CircuitBreaker("critical-payment-service",
    FailureThreshold = 3,
    RecoveryTimeout = "00:00:30")]
    // No FallbackMethod
public sealed class PaymentResilience { }

// warning OPS015: CircuitBreaker for 'critical-payment-service' has no
//   FallbackMethod. When the circuit opens, calls will throw
//   BrokenCircuitException with no graceful degradation.
//   Set FallbackMethod = "..." and FallbackType = typeof(...)
//   or suppress if exceptions are handled by the caller.

Not every circuit breaker needs a fallback (the notification service does not). But the analyzer warns, and you either provide a fallback or suppress. The decision is explicit.

OPS016: Rollback Plan Not Tested

[RollbackPlan("RevertApi", "RollbackDb", "NotifyTeam")]
public sealed class OrderServiceResilience { }

// No [RollbackTest] attribute found for any step

// warning OPS016: RollbackPlan on 'OrderServiceResilience' references
//   3 steps but none have a corresponding [RollbackTest] in the
//   Testing DSL. Untested rollback plans may not work when needed.
//   Add [RollbackTest("RevertApi")] to a test method.

This is a cross-DSL diagnostic with the Testing sub-DSL. A rollback plan that has never been tested is a plan that might not work at 3 AM when you need it most.


Resilience to Observability

The canary strategy's SuccessMetric references a metric from the Observability DSL. The analyzer verifies the metric exists and has a compatible type. The generated Argo Rollout analysis template embeds the Prometheus query directly from the [Metric] attribute's metadata.

When the circuit breaker opens, the generated code emits a metric:

// Generated inside the circuit breaker pipeline
.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
    OnOpened = args =>
    {
        CircuitBreakerMetrics.CircuitOpened
            .Add(1, new("service", "payment-service"));
        return default;
    },
    OnClosed = args =>
    {
        CircuitBreakerMetrics.CircuitClosed
            .Add(1, new("service", "payment-service"));
        return default;
    },
})

The Observability DSL picks up circuit_breaker_opened_total and circuit_breaker_closed_total and wires them into dashboards and alerts.

Resilience to Chaos

The Chaos DSL (steady-state hypothesis testing) needs to know which circuit breakers exist and what their thresholds are. It uses this to design experiments:

// Chaos DSL reads the Resilience DSL's attributes:
[ChaosExperiment("payment-service-failure",
    Target = "payment-service",
    Action = ChaosAction.NetworkBlackhole,
    Duration = "00:02:00",
    SteadyStateMetric = "order_api_http_errors_total",
    SteadyStateThreshold = "rate < 0.01")]

The Chaos analyzer verifies that the circuit breaker for payment-service exists, that its RecoveryTimeout is shorter than the experiment Duration, and that the fallback method is defined.

Resilience to Deployment

The [RollbackPlan] steps reference deployment actions. The Deployment DSL's [DeploymentGate] can require that the rollback plan exists:

[DeploymentGate("rollback-plan-exists",
    Kind = GateKind.HealthCheck,
    Target = "resilience:OrderServiceResilience:rollback")]

If the rollback plan is removed from the Resilience DSL, the Deployment build fails.


Two DSLs, One Pattern

Configuration and Resilience are the bookends of operational correctness:

  • Configuration answers: "Will the system start correctly in this environment?"
  • Resilience answers: "Will the system recover when something fails at runtime?"

Both follow the same pattern: attributes declare intent, source generators produce infrastructure, analyzers validate cross-DSL references.

The config that was missing in production? The compiler catches it. The circuit breaker threshold that was in a Slack thread? It is an attribute. The rollback plan that was in a wiki? It is code. The canary strategy that was an Argo Rollout YAML nobody understood? It is five named properties on a C# attribute.

Environment correctness and fault tolerance are compiler outputs. They compile, or they do not ship.

⬇ Download