Ops.Performance -- SLIs, SLOs, and Performance Budgets

The Problem

Every organization has performance goals. Very few enforce them.

The pattern repeats:

The SRE team defines SLOs in a Google Doc. The document is titled "SLO Definitions Q3 2025." It has a table with four columns: service, SLI, target, window. Nobody updates it after Q3.
A developer adds an endpoint. There is no performance budget. The endpoint returns 2 MB of JSON because it eagerly loads three navigation properties. Nobody notices until a mobile client on 3G reports a 12-second load time.
A cache is added to fix the slow endpoint. The cache has no invalidation strategy. A customer updates their order and sees stale data for 5 minutes. The fix is to reduce the TTL to 30 seconds, which defeats the purpose of the cache.
A performance regression ships because the benchmark suite was last run six months ago. The BenchmarkDotNet project still references the old API. It does not compile.

What is missing:

SLIs defined in code. An SLI is a measurement. It should be declared next to the service it measures, not in a wiki.
SLOs that reference SLIs. An SLO is a target for an SLI. If the SLI does not exist, the SLO is meaningless. The compiler should reject orphan SLOs.
Per-endpoint performance budgets. Every endpoint should have a P50, P95, P99 latency budget and a maximum payload size. The build should fail if an endpoint has no budget.
Cache policies with invalidation. A cache without an invalidation strategy is a bug waiting to happen. The analyzer should require an invalidation event for every cache policy.
Benchmark targets that stay in sync. If a method signature changes, the benchmark should fail to compile — not silently skip the method.

Attribute Definitions

// =================================================================
// Ops.Performance.Lib -- Performance DSL Attributes
// =================================================================

/// The kind of service-level indicator being measured.
public enum SliKind
{
    Availability,   // percentage of successful requests
    Latency,        // response time distribution
    Throughput,     // requests per second
    ErrorRate,      // percentage of failed requests
    Saturation      // resource utilization (CPU, memory, connections)
}

/// The unit of measurement for an SLI.
public enum SliMeasurement
{
    Milliseconds,
    Seconds,
    Percentage,
    RequestsPerSecond,
    BytesPerSecond,
    Count
}

/// Declares a service-level indicator — a concrete measurement
/// attached to a service or endpoint.
[AttributeUsage(AttributeTargets.Class | AttributeTargets.Method, AllowMultiple = true)]
public sealed class SliAttribute : Attribute
{
    public string Name { get; }
    public SliKind Kind { get; }
    public SliMeasurement Measurement { get; }
    public string Description { get; init; } = "";
    public string[] Labels { get; init; } = [];

    public SliAttribute(string name, SliKind kind, SliMeasurement measurement)
    {
        Name = name;
        Kind = kind;
        Measurement = measurement;
    }
}

/// Declares a service-level objective — a target for an SLI
/// over a rolling window, with error budget and burn-rate alerting.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class SloAttribute : Attribute
{
    public string Name { get; }
    public string SliName { get; }
    public double Target { get; }
    public int WindowDays { get; init; } = 30;
    public double ErrorBudget { get; init; }
    public double BurnRateAlertThreshold { get; init; } = 14.4;
    public string EscalationChannel { get; init; } = "";

    public SloAttribute(string name, string sliName, double target)
    {
        Name = name;
        SliName = sliName;
        Target = target;
        ErrorBudget = 1.0 - target;
    }
}

/// Per-endpoint performance budget. Every public endpoint must have one.
[AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
public sealed class PerformanceBudgetAttribute : Attribute
{
    public string Endpoint { get; }
    public int P50Ms { get; init; }
    public int P95Ms { get; init; }
    public int P99Ms { get; init; }
    public int MaxPayloadBytes { get; init; }
    public string Owner { get; init; } = "";

    public PerformanceBudgetAttribute(string endpoint)
    {
        Endpoint = endpoint;
    }
}

/// Cache strategy for a data source or endpoint.
public enum CacheStrategy
{
    ReadThrough,    // read from cache; on miss, read from source and populate
    WriteThrough,   // write to source and cache simultaneously
    WriteBehind,    // write to cache immediately, source asynchronously
    Aside           // application manages cache explicitly
}

/// Declares a cache policy with an explicit invalidation event.
[AttributeUsage(AttributeTargets.Method | AttributeTargets.Class, AllowMultiple = true)]
public sealed class CachePolicyAttribute : Attribute
{
    public string Key { get; }
    public int TtlSeconds { get; init; } = 300;
    public CacheStrategy Strategy { get; init; } = CacheStrategy.ReadThrough;
    public string InvalidationEvent { get; init; } = "";
    public bool SlidingExpiration { get; init; } = false;
    public string Region { get; init; } = "default";

    public CachePolicyAttribute(string key)
    {
        Key = key;
    }
}

/// Links a method to a BenchmarkDotNet target with regression thresholds.
[AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
public sealed class BenchmarkTargetAttribute : Attribute
{
    public string Method { get; }
    public int MaxDurationMs { get; init; }
    public long MaxAllocationsBytes { get; init; }
    public double RegressionThresholdPercent { get; init; } = 10.0;
    public int WarmupCount { get; init; } = 3;
    public int IterationCount { get; init; } = 10;

    public BenchmarkTargetAttribute(string method)
    {
        Method = method;
    }
}

Usage: Order API Performance Contract

An e-commerce order service with latency SLIs, SLOs with error budgets, per-endpoint budgets, and cache policies that invalidate on domain events.

// =================================================================
// OrderService Performance Contract
// =================================================================

[OpsTarget("order-service")]

// SLIs: what we measure
[Sli("order-latency", SliKind.Latency, SliMeasurement.Milliseconds,
    Description = "End-to-end latency for order operations",
    Labels = ["endpoint", "method", "status_code"])]
[Sli("order-availability", SliKind.Availability, SliMeasurement.Percentage,
    Description = "Percentage of non-5xx responses")]
[Sli("order-throughput", SliKind.Throughput, SliMeasurement.RequestsPerSecond,
    Description = "Orders processed per second")]

// SLOs: what we promise
[Slo("order-latency-slo", "order-latency", 0.999,
    WindowDays = 30,
    BurnRateAlertThreshold = 14.4,
    EscalationChannel = "#order-oncall")]
[Slo("order-availability-slo", "order-availability", 0.999,
    WindowDays = 30,
    BurnRateAlertThreshold = 10.0,
    EscalationChannel = "#order-oncall")]

public partial class OrderPerformanceContract
{
    // Per-endpoint budgets
    [PerformanceBudget("POST /api/orders",
        P50Ms = 50, P95Ms = 150, P99Ms = 300, MaxPayloadBytes = 8192)]
    public partial void CreateOrder();

    [PerformanceBudget("GET /api/orders/{id}",
        P50Ms = 20, P95Ms = 50, P99Ms = 100, MaxPayloadBytes = 4096)]
    [CachePolicy("order:{id}",
        TtlSeconds = 60,
        Strategy = CacheStrategy.ReadThrough,
        InvalidationEvent = nameof(OrderUpdatedEvent))]
    public partial void GetOrder();

    [PerformanceBudget("GET /api/orders?status={status}",
        P50Ms = 80, P95Ms = 200, P99Ms = 400, MaxPayloadBytes = 32768)]
    [CachePolicy("orders:list:{status}",
        TtlSeconds = 30,
        Strategy = CacheStrategy.Aside,
        InvalidationEvent = nameof(OrderStatusChangedEvent),
        SlidingExpiration = true)]
    public partial void ListOrders();

    // Benchmark targets for hot paths
    [BenchmarkTarget(nameof(OrderPriceCalculator.CalculateTotal),
        MaxDurationMs = 5, MaxAllocationsBytes = 1024,
        RegressionThresholdPercent = 15.0)]
    public partial void BenchmarkPriceCalculation();

    [BenchmarkTarget(nameof(OrderValidator.Validate),
        MaxDurationMs = 2, MaxAllocationsBytes = 512)]
    public partial void BenchmarkOrderValidation();
}

One class. Five SLI/SLO definitions. Three endpoint budgets. Two cache policies with domain-event invalidation. Two benchmark targets with allocation limits. Zero ambiguity about what "fast enough" means.

Generated Artifacts

Tier 1: InProcess

BenchmarkConfig.g.cs

The generator produces a BenchmarkDotNet harness that compiles against the actual method signatures and fails if a regression exceeds the declared threshold.

// <auto-generated by Ops.Performance.Generator />
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Validators;

[MemoryDiagnoser]
[SimpleJob(warmupCount: 3, iterationCount: 10)]
public class OrderPerformanceBenchmarks
{
    private OrderPriceCalculator _calculator = null!;
    private OrderValidator _validator = null!;
    private Order _testOrder = null!;

    [GlobalSetup]
    public void Setup()
    {
        _calculator = new OrderPriceCalculator();
        _validator = new OrderValidator();
        _testOrder = OrderTestData.CreateTypicalOrder();
    }

    [Benchmark]
    [MaxDuration(milliseconds: 5)]
    [MaxAllocations(bytes: 1024)]
    public decimal CalculateTotal() => _calculator.CalculateTotal(_testOrder);

    [Benchmark]
    [MaxDuration(milliseconds: 2)]
    [MaxAllocations(bytes: 512)]
    public ValidationResult Validate() => _validator.Validate(_testOrder);
}

/// Regression gate: fails CI if mean exceeds threshold.
public class OrderPerformanceRegressionValidator : IValidator
{
    private static readonly Dictionary<string, (int MaxMs, long MaxBytes, double Threshold)> _targets = new()
    {
        ["CalculateTotal"] = (5, 1024, 0.15),
        ["Validate"] = (2, 512, 0.10),
    };

    public bool TreatsWarningsAsErrors => true;

    public IEnumerable<ValidationError> Validate(ValidationParameters parameters)
    {
        // Compares current run against baseline stored in benchmarks/baseline.json
        // Returns ValidationError for each method exceeding RegressionThresholdPercent
    }
}

CacheRegistration.g.cs

The generator wires up IDistributedCache or IMemoryCache with the declared policies and subscribes to domain events for invalidation.

// <auto-generated by Ops.Performance.Generator />
using Microsoft.Extensions.Caching.Distributed;
using Microsoft.Extensions.DependencyInjection;

public static class OrderCacheRegistration
{
    public static IServiceCollection AddOrderCachePolicies(this IServiceCollection services)
    {
        services.AddSingleton<ICachePolicy>(new CachePolicy
        {
            Key = "order:{id}",
            TtlSeconds = 60,
            Strategy = CacheStrategy.ReadThrough,
            SlidingExpiration = false,
            Region = "default",
        });

        services.AddSingleton<ICachePolicy>(new CachePolicy
        {
            Key = "orders:list:{status}",
            TtlSeconds = 30,
            Strategy = CacheStrategy.Aside,
            SlidingExpiration = true,
            Region = "default",
        });

        // Subscribe to invalidation events
        services.AddTransient<IEventHandler<OrderUpdatedEvent>, OrderCacheInvalidator>();
        services.AddTransient<IEventHandler<OrderStatusChangedEvent>, OrderListCacheInvalidator>();

        return services;
    }
}

/// Invalidates "order:{id}" when OrderUpdatedEvent is raised.
public sealed class OrderCacheInvalidator : IEventHandler<OrderUpdatedEvent>
{
    private readonly IDistributedCache _cache;

    public OrderCacheInvalidator(IDistributedCache cache) => _cache = cache;

    public async Task HandleAsync(OrderUpdatedEvent evt, CancellationToken ct)
    {
        await _cache.RemoveAsync($"order:{evt.OrderId}", ct);
    }
}

/// Invalidates "orders:list:{status}" when OrderStatusChangedEvent is raised.
public sealed class OrderListCacheInvalidator : IEventHandler<OrderStatusChangedEvent>
{
    private readonly IDistributedCache _cache;

    public OrderListCacheInvalidator(IDistributedCache cache) => _cache = cache;

    public async Task HandleAsync(OrderStatusChangedEvent evt, CancellationToken ct)
    {
        // Invalidate both old and new status lists
        await _cache.RemoveAsync($"orders:list:{evt.OldStatus}", ct);
        await _cache.RemoveAsync($"orders:list:{evt.NewStatus}", ct);
    }
}

InProcess SLO Tracker

A lightweight in-memory SLO tracker for development and test environments.

// <auto-generated by Ops.Performance.Generator />
public sealed class OrderSloTracker : ISloTracker
{
    private readonly ConcurrentDictionary<string, SliWindow> _windows = new();

    public void RecordLatency(string endpoint, double ms)
    {
        var window = _windows.GetOrAdd("order-latency", _ => new SliWindow(TimeSpan.FromDays(30)));
        window.Record(ms);
    }

    public SloStatus GetStatus(string sloName) => sloName switch
    {
        "order-latency-slo" => Evaluate("order-latency", target: 0.999, burnRateThreshold: 14.4),
        "order-availability-slo" => Evaluate("order-availability", target: 0.999, burnRateThreshold: 10.0),
        _ => SloStatus.Unknown,
    };

    public double GetErrorBudgetRemaining(string sloName)
    {
        var status = GetStatus(sloName);
        return status.ErrorBudgetRemainingPercent;
    }
}

Tier 2: Container

prometheus-slo-rules.yaml

Multi-window, multi-burn-rate alert rules following the Google SRE book pattern.

# Auto-generated by Ops.Performance.Generator
# Source: OrderPerformanceContract

groups:
  - name: order-service-slo-rules
    rules:
      # ── SLI Recording Rules ──────────────────────────────────
      - record: order_service:latency:p99_5m
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))

      - record: order_service:availability:ratio_5m
        expr: |
          1 - (
            sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="order-service"}[5m]))
          )

      - record: order_service:throughput:rps_5m
        expr: |
          sum(rate(http_requests_total{service="order-service"}[5m]))

      # ── Error Budget ─────────────────────────────────────────
      - record: order_service:latency_slo:error_budget_remaining
        expr: |
          1 - (
            (1 - order_service:latency:slo_compliance_30d)
            /
            (1 - 0.999)
          )

      # ── Multi-window Burn Rate Alerts ────────────────────────
      # Fast burn: 14.4x in 1h (page)
      - alert: OrderLatencyBurnRateCritical
        expr: |
          (
            order_service:latency:error_ratio_1h > (14.4 * 0.001)
            and
            order_service:latency:error_ratio_5m > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
          slo: order-latency-slo
          escalation: "#order-oncall"
        annotations:
          summary: "Order latency SLO burn rate critical (14.4x)"
          description: |
            Error budget for order-latency-slo is burning at {{ $value | humanizePercentage }}
            of the 30-day budget per hour. At this rate, the budget will be exhausted in
            {{ printf "%.1f" (divf 1.0 (mulf $value 24)) }} days.

      # Slow burn: 3x in 3d (ticket)
      - alert: OrderLatencyBurnRateSlow
        expr: |
          (
            order_service:latency:error_ratio_3d > (3.0 * 0.001)
            and
            order_service:latency:error_ratio_6h > (3.0 * 0.001)
          )
        for: 1h
        labels:
          severity: warning
          slo: order-latency-slo
        annotations:
          summary: "Order latency SLO burn rate elevated (3x)"

      # ── Availability SLO ─────────────────────────────────────
      - alert: OrderAvailabilityBurnRateCritical
        expr: |
          (
            order_service:availability:error_ratio_1h > (10.0 * 0.001)
            and
            order_service:availability:error_ratio_5m > (10.0 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
          slo: order-availability-slo
          escalation: "#order-oncall"

grafana-slo-dashboard.json

A Grafana dashboard with panels for each SLI, error budget burn-down, and per-endpoint budget compliance.

{
  "dashboard": {
    "title": "Order Service SLO Dashboard",
    "uid": "order-service-slo",
    "panels": [
      {
        "title": "Latency SLO (99.9% target, 30d window)",
        "type": "gauge",
        "targets": [{
          "expr": "order_service:latency:slo_compliance_30d",
          "legendFormat": "Compliance"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 0.995, "color": "yellow" },
                { "value": 0.999, "color": "green" }
              ]
            },
            "min": 0.99, "max": 1.0
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "timeseries",
        "targets": [{
          "expr": "order_service:latency_slo:error_budget_remaining",
          "legendFormat": "Budget %"
        }]
      },
      {
        "title": "Per-Endpoint P95 vs Budget",
        "type": "table",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"order-service\"}[5m]))",
          "legendFormat": "{{ endpoint }}"
        }],
        "transformations": [{
          "id": "addFieldFromCalculation",
          "options": {
            "mode": "binary",
            "fieldName": "budget_ms",
            "values": {
              "POST /api/orders": 150,
              "GET /api/orders/{id}": 50,
              "GET /api/orders?status={status}": 200
            }
          }
        }]
      }
    ]
  }
}

Tier 3: Cloud

The Cloud tier produces production SLO monitoring integrated with the cloud provider's native monitoring.

# Auto-generated by Ops.Performance.Generator
# terraform/performance/order-slo/main.tf

resource "azurerm_monitor_metric_alert" "order_latency_slo" {
  name                = "order-latency-slo-burn-rate"
  resource_group_name = var.resource_group_name
  scopes              = [var.app_service_id]
  description         = "Order latency SLO burn rate exceeds 14.4x threshold"
  severity            = 0
  frequency           = "PT1M"
  window_size         = "PT1H"

  criteria {
    metric_namespace = "Microsoft.Web/sites"
    metric_name      = "HttpResponseTime"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 0.300  # P99 budget = 300ms
  }

  action {
    action_group_id = var.oncall_action_group_id
  }
}

resource "azurerm_monitor_scheduled_query_rules_alert_v2" "order_error_budget" {
  name                = "order-error-budget-alert"
  resource_group_name = var.resource_group_name
  location            = var.location
  scopes              = [var.log_analytics_workspace_id]
  description         = "Order service error budget consumed past 50%"

  criteria {
    query = <<-QUERY
      AppRequests
      | where AppRoleName == "order-service"
      | where TimeGenerated > ago(30d)
      | summarize
          total = count(),
          failures = countif(ResultCode >= 500 or Duration > 300)
      | extend error_ratio = todouble(failures) / todouble(total)
      | extend budget_consumed = error_ratio / 0.001
      | where budget_consumed > 0.5
    QUERY

    time_aggregation_method = "Count"
    operator                = "GreaterThan"
    threshold               = 0
  }

  action {
    action_groups = [var.oncall_action_group_id]
  }
}

resource "grafana_dashboard" "order_slo" {
  config_json = file("${path.module}/grafana-slo-dashboard.json")
  folder      = var.grafana_slo_folder_id
}

Analyzer Diagnostics

ID	Severity	Rule	Example
PRF001	Error	SLO references nonexistent SLI	`[Slo("x", "missing-sli", 0.99)]` -- no `[Sli]` with name `"missing-sli"` in scope
PRF002	Warning	Public endpoint without performance budget	Controller action `GetOrderHistory` has `[HttpGet]` but no `[PerformanceBudget]`
PRF003	Error	Cache policy without invalidation event	`[CachePolicy("key", TtlSeconds = 300)]` -- `InvalidationEvent` is empty string
PRF004	Warning	Performance budget P99 exceeds SLO target	P99 of 500ms on an endpoint when the service SLO implies max 300ms
PRF005	Info	Benchmark target method signature changed	`[BenchmarkTarget("OldMethodName")]` -- method was renamed or parameters changed

PRF003 is the one that catches the most bugs. Every cache must declare how it gets invalidated. No exception. If the answer is "it just expires after the TTL," the developer must explicitly set InvalidationEvent = "TTL_ONLY" to acknowledge the decision.

Cross-DSL Integration

Performance to Observability

Every [Sli] maps to an [OpsMetric] in the Observability DSL. The generator verifies that a Prometheus metric exists for each SLI:

// Ops.Performance declares the SLI
[Sli("order-latency", SliKind.Latency, SliMeasurement.Milliseconds)]

// Ops.Observability must have a matching metric — PRF006 fires if missing
[OpsMetric("http_request_duration_seconds", MetricKind.Histogram,
    Labels = ["service", "endpoint", "method", "status_code"])]

Performance to Requirements

Performance budgets link to features via OpsRequirementLink:

[PerformanceBudget("POST /api/orders", P50Ms = 50, P95Ms = 150, P99Ms = 300)]
[OpsRequirementLink("FEATURE-789", "Order creation must complete within 300ms at P99")]
public partial void CreateOrder();

The compliance report shows which features have performance budgets and which do not. If a feature acceptance criterion mentions latency, the analyzer warns if no [PerformanceBudget] matches.

Performance to LoadTesting

Every [PerformanceBudget] defines the pass/fail criteria for the load test of that endpoint. The LoadTesting DSL reads the budget and generates k6 thresholds that match:

// PerformanceBudget says P95 <= 150ms for POST /api/orders
// LoadTesting generator produces:
//   thresholds: { 'http_req_duration{endpoint:POST /api/orders}': ['p(95)<150'] }

No manual synchronization. Change the budget attribute, the load test threshold updates on next build.

Performance to Resilience

When a [PerformanceBudget] P99 is declared, the generator verifies that the corresponding endpoint has a [CircuitBreaker] with a timeout that does not exceed the P99. If a circuit breaker timeout is 5 seconds but the P99 budget is 300ms, the analyzer flags the inconsistency: the circuit breaker will never trip before the budget is blown.

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

Ops.Performance -- SLIs, SLOs, and Performance Budgets📋

The Problem📋

Attribute Definitions📋

Usage: Order API Performance Contract📋

Generated Artifacts📋

Tier 1: InProcess📋

BenchmarkConfig.g.cs📋

CacheRegistration.g.cs📋

InProcess SLO Tracker📋

Tier 2: Container📋

prometheus-slo-rules.yaml📋

grafana-slo-dashboard.json📋

Tier 3: Cloud📋

Analyzer Diagnostics📋

Cross-DSL Integration📋

Performance to Observability📋

Performance to Requirements📋

Performance to LoadTesting📋

Performance to Resilience📋