Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Ops.Observability -- Health, Metrics, Alerts, Dashboards

"The dashboard was renamed last week. The alert still references the old metric name. The runbook link returns 404. Everything is green."


The Problem

Observability infrastructure has a drift problem. It is not a code problem -- the code works. It is a metadata problem: the monitoring layer references names, endpoints, and thresholds that no longer match reality.

Concrete failures:

  1. A Grafana dashboard was created six months ago. Three of its eight panels reference metrics that were renamed in a refactor. The panels show "No data." Nobody notices because the dashboard has twenty panels and the broken ones are at the bottom.

  2. An alert rule fires on order_api_error_rate > 5%. The metric was renamed to order_api_http_errors_total two sprints ago. The alert never fires. The team thinks the error rate is zero.

  3. A critical alert has Severity = Critical and Runbook = "https://wiki.internal/runbooks/order-api-errors". The wiki page was moved to a new URL structure. The on-call engineer gets paged at 3 AM, clicks the runbook link, and sees a 404.

  4. A new service is deployed without any health checks. The Kubernetes readiness probe defaults to TCP socket check. The pod starts, passes the readiness check, but the application inside has not finished initializing. Traffic arrives. The service returns 503 for 45 seconds.

All four problems have the same root cause: the observability configuration is not typed. It is YAML files, JSON dashboards, and Terraform that reference string names. When those names change in code, nothing breaks in the monitoring stack. Until production.


Attribute Definitions

// =================================================================
// Ops.Observability.Lib -- Health, Metrics, Alerts, Dashboards DSL
// =================================================================

/// Declare a health check endpoint for a service.
/// The generator registers it with ASP.NET health checks (InProcess),
/// Docker HEALTHCHECK (Container), and monitoring probes (Cloud).
[AttributeUsage(AttributeTargets.Method)]
public sealed class HealthCheckAttribute : Attribute
{
    public string Name { get; }
    public string Endpoint { get; init; } = "/health";
    public string Timeout { get; init; } = "00:00:05";
    public int Retries { get; init; } = 3;
    public string[] Tags { get; init; } = [];
    public HealthCheckKind Kind { get; init; } = HealthCheckKind.Readiness;

    public HealthCheckAttribute(string name) => Name = name;
}

public enum HealthCheckKind
{
    Readiness,     // can accept traffic
    Liveness,      // process is alive (not deadlocked)
    Startup        // initialization complete
}

/// Declare a metric that the application exposes.
/// The generator creates OpenTelemetry meter registrations (InProcess),
/// Prometheus scrape targets (Container), and CloudWatch/Datadog
/// metric definitions (Cloud).
[AttributeUsage(AttributeTargets.Method | AttributeTargets.Class, AllowMultiple = true)]
public sealed class MetricAttribute : Attribute
{
    public string Name { get; }
    public MetricKind Kind { get; init; } = MetricKind.Counter;
    public string Unit { get; init; } = "";
    public string Description { get; init; } = "";
    public string[] Labels { get; init; } = [];
    public double[] HistogramBuckets { get; init; } = [];

    public MetricAttribute(string name) => Name = name;
}

public enum MetricKind
{
    Counter,       // monotonically increasing (request count, error count)
    Gauge,         // current value (queue depth, active connections)
    Histogram,     // distribution (latency, request size)
    Summary        // quantiles (p50, p95, p99)
}

/// Declare an alert rule that fires when a metric crosses a threshold.
/// The generator emits Prometheus alerting rules (Container) and
/// CloudWatch alarms (Cloud).
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class AlertRuleAttribute : Attribute
{
    public string Name { get; }
    public string Metric { get; }
    public string Condition { get; }
    public AlertSeverity Severity { get; init; } = AlertSeverity.Warning;
    public string[] NotifyChannels { get; init; } = [];
    public string Runbook { get; init; } = "";
    public string For { get; init; } = "5m";           // duration before firing
    public string Summary { get; init; } = "";
    public string Description { get; init; } = "";

    public AlertRuleAttribute(string name, string metric, string condition)
    {
        Name = name;
        Metric = metric;
        Condition = condition;
    }
}

public enum AlertSeverity
{
    Info,          // informational, no page
    Warning,       // needs attention during business hours
    Critical,      // pages on-call immediately
    Fatal          // all-hands incident
}

/// Declare a Grafana dashboard. Panels are declared with [DashboardPanel].
[AttributeUsage(AttributeTargets.Class)]
public sealed class DashboardAttribute : Attribute
{
    public string Name { get; }
    public string Folder { get; init; } = "Generated";
    public string[] Tags { get; init; } = [];
    public string RefreshInterval { get; init; } = "30s";

    public DashboardAttribute(string name) => Name = name;
}

/// Declare a panel within a dashboard.
/// The generator produces the Grafana JSON panel definition.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class DashboardPanelAttribute : Attribute
{
    public string Title { get; }
    public string Metric { get; }
    public PanelVisualization Visualization { get; init; } = PanelVisualization.TimeSeries;
    public int GridX { get; init; } = 0;
    public int GridY { get; init; } = 0;
    public int Width { get; init; } = 12;
    public int Height { get; init; } = 8;
    public string Legend { get; init; } = "";
    public string[] GroupBy { get; init; } = [];

    public DashboardPanelAttribute(string title, string metric)
    {
        Title = title;
        Metric = metric;
    }
}

public enum PanelVisualization
{
    TimeSeries,    // line chart over time
    Gauge,         // single value with thresholds
    BarChart,      // categorical comparison
    Stat,          // big number
    Table,         // tabular data
    Heatmap        // 2D histogram
}

Usage Example

Complete observability for the Order Service: a latency histogram, an error rate counter, a queue depth gauge, health checks, alerts, and a dashboard.

// -- OrderServiceObservability.cs -----------------------------------

[Dashboard("Order Service",
    Folder = "Services",
    Tags = ["orders", "payments"],
    RefreshInterval = "15s")]
[DashboardPanel("Request Latency (p95)",
    "order_api_request_duration_seconds",
    Visualization = PanelVisualization.TimeSeries,
    GridX = 0, GridY = 0, Width = 12, Height = 8,
    GroupBy = ["endpoint", "status_code"])]
[DashboardPanel("Error Rate",
    "order_api_http_errors_total",
    Visualization = PanelVisualization.TimeSeries,
    GridX = 12, GridY = 0, Width = 12, Height = 8)]
[DashboardPanel("Active Orders in Queue",
    "order_queue_depth",
    Visualization = PanelVisualization.Gauge,
    GridX = 0, GridY = 8, Width = 8, Height = 6)]
[DashboardPanel("Payment Success Rate",
    "order_payment_success_ratio",
    Visualization = PanelVisualization.Stat,
    GridX = 8, GridY = 8, Width = 8, Height = 6)]
[DashboardPanel("Latency Heatmap",
    "order_api_request_duration_seconds",
    Visualization = PanelVisualization.Heatmap,
    GridX = 16, GridY = 8, Width = 8, Height = 6)]
[AlertRule("OrderHighLatency",
    "order_api_request_duration_seconds",
    "histogram_quantile(0.95, rate(order_api_request_duration_seconds_bucket[5m])) > 0.5",
    Severity = AlertSeverity.Warning,
    For = "5m",
    NotifyChannels = ["#order-alerts", "oncall-orders"],
    Runbook = "https://docs.internal/runbooks/order-high-latency",
    Summary = "Order API p95 latency above 500ms",
    Description = "The 95th percentile latency for the Order API has exceeded " +
                  "500ms for 5 minutes. Check for database connection pool exhaustion " +
                  "or downstream payment service degradation.")]
[AlertRule("OrderHighErrorRate",
    "order_api_http_errors_total",
    "rate(order_api_http_errors_total[5m]) / rate(order_api_http_requests_total[5m]) > 0.05",
    Severity = AlertSeverity.Critical,
    For = "3m",
    NotifyChannels = ["#order-alerts", "oncall-orders", "engineering-leads"],
    Runbook = "https://docs.internal/runbooks/order-error-rate",
    Summary = "Order API error rate above 5%",
    Description = "More than 5% of Order API requests are failing. " +
                  "Immediate investigation required.")]
public sealed class OrderServiceObservability
{
    [HealthCheck("order-api-ready",
        Endpoint = "/health/ready",
        Timeout = "00:00:05",
        Retries = 3,
        Tags = ["order", "api"],
        Kind = HealthCheckKind.Readiness)]
    public void OrderApiReadiness() { }

    [HealthCheck("order-api-live",
        Endpoint = "/health/live",
        Timeout = "00:00:03",
        Retries = 1,
        Kind = HealthCheckKind.Liveness)]
    public void OrderApiLiveness() { }

    [HealthCheck("order-payment-check",
        Endpoint = "/health/payment",
        Timeout = "00:00:10",
        Retries = 2,
        Tags = ["order", "payment", "downstream"],
        Kind = HealthCheckKind.Readiness)]
    public void PaymentDownstreamHealth() { }

    [Metric("order_api_request_duration_seconds",
        Kind = MetricKind.Histogram,
        Unit = "seconds",
        Description = "HTTP request duration for Order API endpoints",
        Labels = ["endpoint", "method", "status_code"],
        HistogramBuckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0])]
    public void RequestLatency() { }

    [Metric("order_api_http_errors_total",
        Kind = MetricKind.Counter,
        Unit = "requests",
        Description = "Total HTTP error responses (4xx + 5xx)",
        Labels = ["endpoint", "status_code"])]
    public void ErrorCount() { }

    [Metric("order_api_http_requests_total",
        Kind = MetricKind.Counter,
        Unit = "requests",
        Description = "Total HTTP requests processed",
        Labels = ["endpoint", "method", "status_code"])]
    public void RequestCount() { }

    [Metric("order_queue_depth",
        Kind = MetricKind.Gauge,
        Unit = "messages",
        Description = "Number of orders waiting in the processing queue")]
    public void QueueDepth() { }

    [Metric("order_payment_success_ratio",
        Kind = MetricKind.Gauge,
        Unit = "ratio",
        Description = "Ratio of successful payment transactions (0.0 to 1.0)")]
    public void PaymentSuccessRate() { }
}

One class. Five panels. Two alerts. Three health checks. Five metrics. All typed, all cross-referenced, all validated at compile time.


Three-tier projection

[Metric] and [AlertRule] declarations project across all three tiers. Local runs get OpenTelemetry meters and registered health checks. Container runs get a Helm-friendly prometheus-rules.yaml plus a complete grafana-dashboard.json. Cloud runs add the Prometheus-Operator-native CRDs (ServiceMonitor for scraping, PrometheusRule for alerts) so Operator-based clusters consume the same SLI definitions without going through Helm.

Diagram
The same attribute surface projects into three tiers: OTel meters and health checks for local runs, Prometheus rules and Grafana dashboards for containers, and ServiceMonitor/PrometheusRule CRDs for cloud clusters.

Generated Artifacts

The Observability DSL is the most artifact-heavy of all 22 sub-DSLs. A single dotnet build produces C# registrations, YAML alert rules, a complete Grafana dashboard JSON, and -- on Operator-based clusters -- two Prometheus Operator CRDs.

InProcess Tier: HealthCheckRegistration.g.cs

// <auto-generated by Ops.Observability.Generators />
namespace Ops.Observability.Generated;

public static class HealthCheckRegistration
{
    public static IHealthChecksBuilder AddGeneratedHealthChecks(
        this IHealthChecksBuilder builder)
    {
        builder.AddCheck("order-api-ready",
            new HttpHealthCheck(
                endpoint: "/health/ready",
                timeout: TimeSpan.FromSeconds(5),
                retries: 3),
            tags: ["order", "api"],
            failureStatus: HealthStatus.Unhealthy);

        builder.AddCheck("order-api-live",
            new HttpHealthCheck(
                endpoint: "/health/live",
                timeout: TimeSpan.FromSeconds(3),
                retries: 1),
            tags: [],
            failureStatus: HealthStatus.Unhealthy);

        builder.AddCheck("order-payment-check",
            new HttpHealthCheck(
                endpoint: "/health/payment",
                timeout: TimeSpan.FromSeconds(10),
                retries: 2),
            tags: ["order", "payment", "downstream"],
            failureStatus: HealthStatus.Degraded);

        return builder;
    }
}

InProcess Tier: MetricsRegistration.g.cs

// <auto-generated by Ops.Observability.Generators />
namespace Ops.Observability.Generated;

public static class MetricsRegistration
{
    private static readonly Meter OrderServiceMeter = new("OrderService", "2.4.0");

    public static readonly Histogram<double> RequestDuration =
        OrderServiceMeter.CreateHistogram<double>(
            "order_api_request_duration_seconds",
            unit: "seconds",
            description: "HTTP request duration for Order API endpoints");

    public static readonly Counter<long> HttpErrors =
        OrderServiceMeter.CreateCounter<long>(
            "order_api_http_errors_total",
            unit: "requests",
            description: "Total HTTP error responses (4xx + 5xx)");

    public static readonly Counter<long> HttpRequests =
        OrderServiceMeter.CreateCounter<long>(
            "order_api_http_requests_total",
            unit: "requests",
            description: "Total HTTP requests processed");

    public static readonly ObservableGauge<int> QueueDepth =
        OrderServiceMeter.CreateObservableGauge<int>(
            "order_queue_depth",
            observeValue: () => OrderQueueMetrics.CurrentDepth,
            unit: "messages",
            description: "Number of orders waiting in the processing queue");

    public static readonly ObservableGauge<double> PaymentSuccessRatio =
        OrderServiceMeter.CreateObservableGauge<double>(
            "order_payment_success_ratio",
            observeValue: () => PaymentMetrics.SuccessRatio,
            unit: "ratio",
            description: "Ratio of successful payment transactions (0.0 to 1.0)");

    public static IServiceCollection AddGeneratedMetrics(
        this IServiceCollection services)
    {
        services.AddOpenTelemetry()
            .WithMetrics(builder => builder
                .AddMeter("OrderService")
                .AddPrometheusExporter());
        return services;
    }
}

Container Tier: prometheus-rules.yaml

# <auto-generated by Ops.Observability.Generators />
# Alert rules for: OrderServiceObservability

groups:
  - name: order-service-alerts
    rules:
      - alert: OrderHighLatency
        expr: >-
          histogram_quantile(0.95,
            rate(order_api_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          service: order-service
          team: orders
        annotations:
          summary: "Order API p95 latency above 500ms"
          description: >-
            The 95th percentile latency for the Order API has exceeded
            500ms for 5 minutes. Check for database connection pool
            exhaustion or downstream payment service degradation.
          runbook_url: "https://docs.internal/runbooks/order-high-latency"
          notify_channels: "#order-alerts,oncall-orders"

      - alert: OrderHighErrorRate
        expr: >-
          rate(order_api_http_errors_total[5m])
          / rate(order_api_http_requests_total[5m])
          > 0.05
        for: 3m
        labels:
          severity: critical
          service: order-service
          team: orders
        annotations:
          summary: "Order API error rate above 5%"
          description: >-
            More than 5% of Order API requests are failing.
            Immediate investigation required.
          runbook_url: "https://docs.internal/runbooks/order-error-rate"
          notify_channels: "#order-alerts,oncall-orders,engineering-leads"

Container Tier: grafana-dashboard.json

The full Grafana dashboard definition, ready to import:

{
  "dashboard": {
    "title": "Order Service",
    "uid": "order-service-generated",
    "tags": ["orders", "payments", "generated"],
    "timezone": "utc",
    "refresh": "15s",
    "schemaVersion": 39,
    "panels": [
      {
        "id": 1,
        "title": "Request Latency (p95)",
        "type": "timeseries",
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(order_api_request_duration_seconds_bucket[5m])) by (le, endpoint, status_code))",
            "legendFormat": "{{endpoint}} ({{status_code}})"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 0.25 },
                { "color": "red", "value": 0.5 }
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "timeseries",
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "rate(order_api_http_errors_total[5m])",
            "legendFormat": "{{endpoint}} ({{status_code}})"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        }
      },
      {
        "id": 3,
        "title": "Active Orders in Queue",
        "type": "gauge",
        "gridPos": { "x": 0, "y": 8, "w": 8, "h": 6 },
        "targets": [
          {
            "expr": "order_queue_depth",
            "legendFormat": "queue depth"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 100 },
                { "color": "red", "value": 500 }
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Payment Success Rate",
        "type": "stat",
        "gridPos": { "x": 8, "y": 8, "w": 8, "h": 6 },
        "targets": [
          {
            "expr": "order_payment_success_ratio",
            "legendFormat": "success ratio"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percentunit",
            "thresholds": {
              "steps": [
                { "color": "red", "value": null },
                { "color": "yellow", "value": 0.95 },
                { "color": "green", "value": 0.99 }
              ]
            }
          }
        }
      },
      {
        "id": 5,
        "title": "Latency Heatmap",
        "type": "heatmap",
        "gridPos": { "x": 16, "y": 8, "w": 8, "h": 6 },
        "targets": [
          {
            "expr": "sum(increase(order_api_request_duration_seconds_bucket[5m])) by (le)",
            "format": "heatmap",
            "legendFormat": "{{le}}"
          }
        ]
      }
    ]
  }
}

That JSON is not hand-written. It is generated from five [DashboardPanel] attributes. When a metric is renamed in C#, the panel expression updates on the next build. When a metric is deleted, the analyzer reports the broken panel reference before the dashboard drifts.

Cloud Tier: monitoring/servicemonitor.yaml

For clusters running the Prometheus Operator, the generator emits a ServiceMonitor CRD directly. The label selector matches the deployment label conventions used in Part 5 (app.kubernetes.io/name). This is the Prometheus-Operator-native path; the Helm-friendly prometheus-rules.yaml above is the alternative path for clusters that don't run the Operator. Both are emitted from the same [Metric] declarations -- pick by cluster.

# <auto-generated by Ops.Observability.Generators />
# monitoring/servicemonitor.yaml -- Prometheus Operator scrape target
# Source: [Metric] attributes on OrderServiceObservability

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: order-service-metrics
  namespace: orders
  labels:
    app.kubernetes.io/name: order-service
    app.kubernetes.io/managed-by: ops.observability.generators
    # Prometheus Operator selector — matches the operator's serviceMonitorSelector
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: order-service
  namespaceSelector:
    matchNames:
      - orders
  endpoints:
    - port: metrics
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
      honorLabels: true
      # Relabeling to surface the [Metric] attribute names as Prometheus labels
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_app_kubernetes_io_version]
          targetLabel: service_version
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace

Cloud Tier: monitoring/prometheusrule.yaml

The same alert rules shown in prometheus-rules.yaml above, wrapped in a Prometheus Operator PrometheusRule CRD envelope. Same rules, two formats -- the body is byte-identical, only the metadata wrapper changes. Operator-based clusters apply the CRD; chart-based clusters Helm-wrap the bare rules file. The duplication is deliberate and is what lets the same source declarations land in either deployment model without a lossy conversion.

# <auto-generated by Ops.Observability.Generators />
# monitoring/prometheusrule.yaml -- Prometheus Operator alert rules
# Source: [AlertRule] attributes on OrderServiceObservability
# Mirrors: prometheus-rules.yaml (same rules, CRD-wrapped for Operator clusters)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: order-service-alerts
  namespace: orders
  labels:
    app.kubernetes.io/name: order-service
    app.kubernetes.io/managed-by: ops.observability.generators
    release: kube-prometheus-stack
    prometheus: kube-prometheus
spec:
  groups:
    - name: order-service-alerts
      rules:
        - alert: OrderHighLatency
          expr: >-
            histogram_quantile(0.95,
              rate(order_api_request_duration_seconds_bucket[5m])
            ) > 0.5
          for: 5m
          labels:
            severity: warning
            service: order-service
            team: orders
          annotations:
            summary: "Order API p95 latency above 500ms"
            description: >-
              The 95th percentile latency for the Order API has exceeded
              500ms for 5 minutes. Check for database connection pool
              exhaustion or downstream payment service degradation.
            runbook_url: "https://docs.internal/runbooks/order-high-latency"
            notify_channels: "#order-alerts,oncall-orders"

        - alert: OrderHighErrorRate
          expr: >-
            rate(order_api_http_errors_total[5m])
            / rate(order_api_http_requests_total[5m])
            > 0.05
          for: 3m
          labels:
            severity: critical
            service: order-service
            team: orders
          annotations:
            summary: "Order API error rate above 5%"
            description: >-
              More than 5% of Order API requests are failing.
              Immediate investigation required.
            runbook_url: "https://docs.internal/runbooks/order-error-rate"
            notify_channels: "#order-alerts,oncall-orders,engineering-leads"

Cloud Tier: Production Monitoring Stack (Terraform wrapper)

For clusters that don't run the Prometheus Operator, the generator falls back to a Terraform Helm release that wraps the bare prometheus-rules.yaml file shown earlier in the chapter. Same alert content, three deployment paths.

# <auto-generated by Ops.Observability.Generators />

# Prometheus monitoring stack
resource "helm_release" "prometheus_rules" {
  name       = "order-service-rules"
  chart      = "prometheus-rules"
  namespace  = var.monitoring_namespace

  values = [file("${path.module}/prometheus-rules.yaml")]
}

# Grafana dashboard provisioning
resource "grafana_dashboard" "order_service" {
  folder      = grafana_folder.services.id
  config_json = file("${path.module}/grafana-dashboard.json")
}

# PagerDuty integration for critical alerts
resource "pagerduty_service" "order_service" {
  name                    = "Order Service"
  escalation_policy       = data.pagerduty_escalation_policy.engineering.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 14400
  acknowledgement_timeout = 600
}

OPS007: Metric Referenced in Alert But Not Defined

[AlertRule("HighMemoryUsage",
    "order_api_memory_usage_bytes",     // <-- this metric is not declared anywhere
    "order_api_memory_usage_bytes > 1073741824",
    Severity = AlertSeverity.Warning)]
public sealed class OrderServiceObservability { }

// error OPS007: Alert 'HighMemoryUsage' references metric
//   'order_api_memory_usage_bytes' but no [Metric] with this name
//   exists in the compilation. Add [Metric("order_api_memory_usage_bytes", ...)]
//   or correct the metric name.

The analyzer scans every [AlertRule] and verifies that the Metric property matches a declared [Metric] attribute somewhere in the compilation. This catches the renamed-metric problem at compile time.

OPS008: Critical Alert Without Runbook

[AlertRule("DatabaseDown",
    "order_db_connection_pool_active",
    "order_db_connection_pool_active == 0",
    Severity = AlertSeverity.Critical,
    NotifyChannels = ["#order-alerts"])]
    // No Runbook specified
public sealed class OrderServiceObservability { }

// error OPS008: Alert 'DatabaseDown' has Severity = Critical but no
//   Runbook URL. Critical alerts must have a runbook so the on-call
//   engineer can act immediately. Set Runbook = "https://..."

Critical and Fatal alerts require a runbook. Warning and Info do not. The analyzer enforces this because a page at 3 AM without a runbook is a page that escalates to the entire team.

OPS009: Dashboard Panel Referencing Nonexistent Metric

[DashboardPanel("CPU Usage",
    "order_api_cpu_usage_percent",     // <-- not declared as a [Metric]
    Visualization = PanelVisualization.TimeSeries)]
public sealed class OrderServiceObservability { }

// warning OPS009: Dashboard panel 'CPU Usage' references metric
//   'order_api_cpu_usage_percent' which is not declared as a [Metric]
//   in this compilation. The panel will show 'No data' in Grafana.

OPS010: App Without Any Health Check

// OrderServiceObservability exists but has no [HealthCheck] methods

// warning OPS010: No [HealthCheck] attributes found for service
//   'order-service'. Services without health checks cannot report
//   readiness to load balancers or orchestrators. Add at least one
//   [HealthCheck] with Kind = Readiness.

This is the same diagnostic that the Deployment DSL (OPS002) triggers, but from the Observability side. Both analyzers report it, ensuring the gap is visible regardless of which NuGet package the team uses.


Observability to Incident

When an alert with Severity = Critical fires, the Incident DSL (another of the 22 sub-DSLs) needs to create an incident. The Observability generator cross-references:

// Alert declares:
[AlertRule("OrderHighErrorRate", ..., Severity = AlertSeverity.Critical)]

// Incident DSL expects:
[IncidentEscalation("OrderHighErrorRate",
    Team = "orders",
    SlaMinutes = 15,
    AutoCreate = true)]

If a Critical alert has no corresponding [IncidentEscalation], the analyzer warns:

warning OPS030: Critical alert 'OrderHighErrorRate' has no matching
  [IncidentEscalation] in the Incident DSL. Critical alerts should
  have an escalation policy to ensure timely response.

Observability to Performance

The Performance DSL defines SLIs (Service Level Indicators) that reference metrics:

[ServiceLevelIndicator("order-api-latency",
    Metric = "order_api_request_duration_seconds",
    Objective = "p95 < 500ms",
    Window = "30d")]

The Observability analyzer verifies that the referenced metric exists and has the correct MetricKind (a latency SLI must reference a Histogram, not a Counter).

Observability to Chaos

The Chaos DSL (steady-state verification) references metrics as probes:

[SteadyStateProbe("order-error-rate",
    Metric = "order_api_http_errors_total",
    Assertion = "rate < 0.01",
    Description = "Error rate must stay below 1% during chaos experiment")]

The Observability analyzer verifies the metric reference. If the metric is renamed, the chaos experiment's steady-state probe updates automatically on the next build.


Why This Matters

The Observability DSL produces more generated artifacts than any other sub-DSL: C# health check registrations, OpenTelemetry meter configurations, Prometheus alerting rules, and complete Grafana dashboard JSON.

Every artifact references the same metric names, because they all come from the same [Metric] attribute. When a metric is renamed, one attribute changes, and dotnet build regenerates every artifact. The Prometheus rule updates. The Grafana panel updates. The alert condition updates. The health check registration updates.

No drift. No "the dashboard was renamed last week." No 3 AM page with a 404 runbook link.

The monitoring stack is a compiler output.

⬇ Download