Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Ops.Capacity -- Autoscaling, Throttling, and Projections

"It scaled to 100 replicas and the cloud bill exploded." The next month: "It didn't scale and users got 503s."


The Problem

Two incidents, same quarter, same service.

Incident 1: Uncontrolled scale-up. A marketing campaign drove 10x normal traffic to the product catalog. The Horizontal Pod Autoscaler (HPA) scaled from 3 replicas to the maximum of 100. The HPA was configured with maxReplicas: 100 because a developer copied the value from a Kubernetes tutorial. The campaign lasted 6 hours. The 100 replicas cost $2,400. The service needed at most 15 replicas to handle the load. The HPA scaleDown stabilization window was set to the default of 5 minutes, but the metric was noisy, so the replicas oscillated between 80 and 100 for the full 6 hours.

Incident 2: No scale-up. The next month, the same service received a traffic spike from a partner integration. The HPA did not scale because the team had reduced maxReplicas to 10 after the cost incident. The 10 replicas could not handle the load. Users received 503 errors for 45 minutes. The SLO was breached. The postmortem said "we need better autoscaling."

Rate limiting was absent. Neither incident would have been as severe if the API had rate limiting. But rate limiting was never implemented because nobody could agree on the limits. The product team wanted no limits ("it might affect legitimate users"). The platform team wanted strict limits ("it might affect the infrastructure"). The compromise was no limits, which affected everyone.

Capacity planning was reactive. The team had no projection of future load. They could not answer "will the current infrastructure handle Black Friday traffic?" because they had no model of growth rate, no baseline of current load, and no review cadence.

Scale-to-zero was feared. The staging environment ran 24/7 despite being used only during business hours. Nobody configured scale-to-zero because the cold start time was "too long." Nobody measured the cold start time. It was 3 seconds.

What is missing:

  • Autoscaling rules in code. Min replicas, max replicas, scale-up metrics, scale-down metrics, and cooldown periods should be declared and reviewed, not copied from tutorials.
  • Rate limiting as typed policy. Requests per second, burst size, and retry-after header should be attributes on the endpoint.
  • Capacity projections as a planning tool. Current load, growth rate, and review date should be declared so the team can see when they will outgrow the current infrastructure.
  • Scale-to-zero with explicit warmup. Idle timeout and warmup duration should be typed values, not things people are afraid of.

Attribute Definitions

// =================================================================
// Ops.Capacity.Lib -- Capacity DSL Attributes
// =================================================================

/// Declares autoscaling behavior for a workload.
[AttributeUsage(AttributeTargets.Class)]
public sealed class AutoscaleRuleAttribute : Attribute
{
    public string Target { get; }
    public int MinReplicas { get; }
    public int MaxReplicas { get; }
    public string ScaleUpMetric { get; }
    public string ScaleDownMetric { get; }
    public string CooldownPeriod { get; }

    public AutoscaleRuleAttribute(
        string target,
        int minReplicas,
        int maxReplicas,
        string scaleUpMetric,
        string scaleDownMetric,
        string cooldownPeriod)
    {
        Target = target;
        MinReplicas = minReplicas;
        MaxReplicas = maxReplicas;
        ScaleUpMetric = scaleUpMetric;
        ScaleDownMetric = scaleDownMetric;
        CooldownPeriod = cooldownPeriod;
    }

    /// <summary>
    /// Scale-up threshold (e.g., 80 for 80% CPU utilization).
    /// </summary>
    public int ScaleUpThreshold { get; init; } = 80;

    /// <summary>
    /// Scale-down threshold (e.g., 30 for 30% CPU utilization).
    /// </summary>
    public int ScaleDownThreshold { get; init; } = 30;

    /// <summary>
    /// Scale-down stabilization window to prevent flapping.
    /// </summary>
    public string ScaleDownStabilization { get; init; } = "5m";

    /// <summary>
    /// Maximum number of replicas to add per scale-up event.
    /// Prevents the "0 to 100" problem.
    /// </summary>
    public int MaxScaleUpStep { get; init; } = 4;

    /// <summary>
    /// VPA mode: Off, Initial (set at pod creation), Auto (live resize).
    /// </summary>
    public string VpaMode { get; init; } = "Off";
}

/// Declares rate limiting / throttle policy for an endpoint or service.
[AttributeUsage(AttributeTargets.Class | AttributeTargets.Method,
    AllowMultiple = true)]
public sealed class ThrottlePolicyAttribute : Attribute
{
    public string Endpoint { get; }
    public int RequestsPerSecond { get; }
    public int BurstSize { get; }
    public bool RetryAfterHeader { get; }

    public ThrottlePolicyAttribute(
        string endpoint,
        int requestsPerSecond,
        int burstSize,
        bool retryAfterHeader = true)
    {
        Endpoint = endpoint;
        RequestsPerSecond = requestsPerSecond;
        BurstSize = burstSize;
        RetryAfterHeader = retryAfterHeader;
    }

    /// <summary>
    /// Partition key: PerIp, PerUser, PerApiKey, Global.
    /// </summary>
    public string PartitionBy { get; init; } = "PerIp";

    /// <summary>
    /// Queue excess requests instead of rejecting immediately.
    /// </summary>
    public int QueueSize { get; init; } = 0;

    /// <summary>HTTP status code for rejected requests.</summary>
    public int RejectStatusCode { get; init; } = 429;

    /// <summary>
    /// Exempt specific roles or clients from throttling.
    /// </summary>
    public string[]? ExemptRoles { get; init; }
}

/// Declares capacity projection for planning.
[AttributeUsage(AttributeTargets.Class)]
public sealed class CapacityProjectionAttribute : Attribute
{
    public string Target { get; }
    public string CurrentLoad { get; }
    public string GrowthRate { get; }
    public string ReviewDate { get; }

    public CapacityProjectionAttribute(
        string target,
        string currentLoad,
        string growthRate,
        string reviewDate)
    {
        Target = target;
        CurrentLoad = currentLoad;
        GrowthRate = growthRate;
        ReviewDate = reviewDate;
    }

    /// <summary>
    /// Maximum capacity of current infrastructure.
    /// </summary>
    public string MaxCapacity { get; init; } = "";

    /// <summary>
    /// Projected date when current capacity is exhausted.
    /// Calculated by generator from CurrentLoad, GrowthRate, MaxCapacity.
    /// </summary>
    public string? ExhaustionDate { get; init; }
}

/// Declares scale-to-zero behavior for serverless-style scaling.
[AttributeUsage(AttributeTargets.Class)]
public sealed class ScaleToZeroAttribute : Attribute
{
    public string Target { get; }
    public string IdleTimeout { get; }
    public string WarmupDuration { get; }

    public ScaleToZeroAttribute(
        string target, string idleTimeout, string warmupDuration)
    {
        Target = target;
        IdleTimeout = idleTimeout;
        WarmupDuration = warmupDuration;
    }

    /// <summary>
    /// Environments where scale-to-zero is enabled.
    /// Production typically excluded.
    /// </summary>
    public string[] EnabledEnvironments { get; init; }
        = new[] { "Development", "Staging" };

    /// <summary>
    /// Minimum instances to keep warm (0 = true scale-to-zero).
    /// </summary>
    public int MinWarmInstances { get; init; } = 0;
}

Usage

[DeploymentApp("order-service")]

// Autoscaling: 3-20 replicas, CPU-based, max 4 per step
[AutoscaleRule(
    "order-service",
    minReplicas: 3,
    maxReplicas: 20,
    scaleUpMetric: "cpu_utilization",
    scaleDownMetric: "cpu_utilization",
    cooldownPeriod: "3m",
    ScaleUpThreshold = 75,
    ScaleDownThreshold = 30,
    ScaleDownStabilization = "10m",
    MaxScaleUpStep = 4,
    VpaMode = "Initial")]

// Throttle: public API — 100 req/s per IP, burst of 20
[ThrottlePolicy(
    "/api/orders",
    requestsPerSecond: 100,
    burstSize: 20,
    retryAfterHeader: true,
    PartitionBy = "PerIp",
    RejectStatusCode = 429)]

// Throttle: internal API — 1000 req/s per service, no burst limit
[ThrottlePolicy(
    "/internal/orders",
    requestsPerSecond: 1000,
    burstSize: 100,
    retryAfterHeader: false,
    PartitionBy = "PerApiKey",
    ExemptRoles = new[] { "service-account" })]

// Throttle: expensive endpoint — 10 req/s per user, queue 5
[ThrottlePolicy(
    "/api/orders/export",
    requestsPerSecond: 10,
    burstSize: 5,
    retryAfterHeader: true,
    PartitionBy = "PerUser",
    QueueSize = 5)]

// Capacity projection: 15k RPM now, growing 8%/month
[CapacityProjection(
    "order-service",
    currentLoad: "15000 rpm",
    growthRate: "8% monthly",
    reviewDate: "2026-07-01",
    MaxCapacity = "50000 rpm")]

// Scale-to-zero for non-production
[ScaleToZero(
    "order-service",
    idleTimeout: "15m",
    warmupDuration: "3s",
    EnabledEnvironments = new[] { "Development", "Staging" },
    MinWarmInstances = 0)]
public partial class OrderServiceCapacity { }

InProcess Tier

The InProcess tier generates ASP.NET rate limiting middleware and unit tests that verify throttle behavior.

The generator reads every [ThrottlePolicy] and emits a registration method that configures the built-in System.Threading.RateLimiting middleware. No third-party package. No manual configuration. The attributes are the configuration.

For autoscaling, the InProcess tier does not generate Kubernetes manifests. Instead, it generates a test that verifies the autoscaling configuration is internally consistent: MinReplicas <= MaxReplicas, cooldown period is parseable, metrics reference observability endpoints that exist.

Container Tier

The Container tier generates a metrics-server mock that simulates CPU and memory metrics. The HPA controller in a local Kind or k3d cluster reads these metrics and scales accordingly. This lets developers test autoscaling behavior locally without a cloud provider.

The mock allows injecting specific metric values:

# metrics-server-mock.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: metrics-mock-config
  namespace: order-service
data:
  scenarios.json: |
    {
      "baseline": { "cpu": 40, "memory": 55 },
      "spike": { "cpu": 90, "memory": 75 },
      "idle": { "cpu": 5, "memory": 20 }
    }

Cloud Tier

The Cloud tier generates production Kubernetes HPA, VPA, and KEDA manifests. It generates Terraform resources for cloud-native autoscaling (AWS Application Auto Scaling, Azure VMSS). It generates the full rate limiting infrastructure.


Three-tier projection

[AutoscaleRule], [ThrottlePolicy], and [ScaleToZero] declarations project across all three tiers. Local runs get ASP.NET rate-limiting middleware and consistency tests. Container runs get a Compose overlay with replicas + resource limits/reservations that mirrors the Cloud-tier HPA bounds, so developers can reproduce horizontal scaling on a laptop. Cloud runs get the full HPA + VPA + KEDA + Terraform suite. VPA has no Container-tier analog -- Docker Compose has no per-container resource recommender; the chapter says so explicitly below.

Diagram
Capacity attributes project to ASP.NET rate limiting and consistency tests locally, compose-based replicas with a metrics-server mock in containers, and the full HPA plus VPA plus KEDA plus Terraform suite in the cloud.

VPA at the Container tier: not applicable. Docker Compose has no per-container resource recommender (VPA is a Kubernetes-only construct backed by the metrics-server pipeline). The Container-tier docker-compose.scale.yaml below covers replicas and resource limits/reservations, but recommendation-driven right-sizing only exists at the Cloud tier.

InProcess: RateLimitRegistration.g.cs

// <auto-generated by Ops.Capacity.Generator />
using System.Threading.RateLimiting;
using Microsoft.AspNetCore.RateLimiting;

namespace OrderService.Generated;

public static class RateLimitRegistration
{
    public static IServiceCollection AddGeneratedRateLimiting(
        this IServiceCollection services)
    {
        services.AddRateLimiter(options =>
        {
            // /api/orders — 100 req/s per IP, burst 20
            options.AddPolicy("orders-api", httpContext =>
            {
                var remoteIp = httpContext.Connection.RemoteIpAddress?
                    .ToString() ?? "unknown";

                return RateLimitPartition.GetTokenBucketLimiter(
                    partitionKey: remoteIp,
                    factory: _ => new TokenBucketRateLimiterOptions
                    {
                        TokenLimit = 20,          // burst size
                        ReplenishmentPeriod =
                            TimeSpan.FromSeconds(1),
                        TokensPerPeriod = 100,    // requests/second
                        QueueLimit = 0,
                        QueueProcessingOrder =
                            QueueProcessingOrder.OldestFirst,
                        AutoReplenishment = true,
                    });
            });

            // /internal/orders — 1000 req/s per API key, burst 100
            options.AddPolicy("orders-internal", httpContext =>
            {
                var apiKey = httpContext.Request.Headers["X-Api-Key"]
                    .FirstOrDefault() ?? "unknown";

                // Exempt service accounts
                if (httpContext.User.IsInRole("service-account"))
                {
                    return RateLimitPartition
                        .GetNoLimiter("exempt");
                }

                return RateLimitPartition.GetTokenBucketLimiter(
                    partitionKey: apiKey,
                    factory: _ => new TokenBucketRateLimiterOptions
                    {
                        TokenLimit = 100,
                        ReplenishmentPeriod =
                            TimeSpan.FromSeconds(1),
                        TokensPerPeriod = 1000,
                        QueueLimit = 0,
                        AutoReplenishment = true,
                    });
            });

            // /api/orders/export — 10 req/s per user, burst 5, queue 5
            options.AddPolicy("orders-export", httpContext =>
            {
                var userId = httpContext.User.FindFirst("sub")?.Value
                    ?? "anonymous";

                return RateLimitPartition.GetTokenBucketLimiter(
                    partitionKey: userId,
                    factory: _ => new TokenBucketRateLimiterOptions
                    {
                        TokenLimit = 5,
                        ReplenishmentPeriod =
                            TimeSpan.FromSeconds(1),
                        TokensPerPeriod = 10,
                        QueueLimit = 5,
                        QueueProcessingOrder =
                            QueueProcessingOrder.OldestFirst,
                        AutoReplenishment = true,
                    });
            });

            // Global rejection behavior
            options.RejectionStatusCode = 429;
            options.OnRejected = async (context, ct) =>
            {
                context.HttpContext.Response.Headers
                    .RetryAfter = "1";

                await context.HttpContext.Response.WriteAsJsonAsync(
                    new
                    {
                        error = "rate_limit_exceeded",
                        retryAfter = 1
                    }, ct);
            };
        });

        return services;
    }

    /// <summary>
    /// Maps endpoints to rate limit policies.
    /// Call in Program.cs: app.UseGeneratedRateLimiting();
    /// </summary>
    public static IEndpointConventionBuilder
        UseGeneratedRateLimiting(this WebApplication app)
    {
        app.UseRateLimiter();

        // The generator also emits endpoint-specific middleware:
        app.MapGroup("/api/orders")
            .RequireRateLimiting("orders-api");
        app.MapGroup("/internal/orders")
            .RequireRateLimiting("orders-internal");
        app.MapGroup("/api/orders/export")
            .RequireRateLimiting("orders-export");

        return app.MapGroup("/"); // return for chaining
    }
}

Usage in Program.cs is two lines:

builder.Services.AddGeneratedRateLimiting();
// ...
app.UseGeneratedRateLimiting();

Container: docker-compose.scale.yaml

For local Docker Compose runs the same [AutoscaleRule] declaration projects to a Compose overlay with deploy.replicas set to the autoscale floor and deploy.resources.limits/reservations matching the Cloud-tier HPA target metrics. Developers run docker compose up --scale order-api=N and observe the same resource ceilings the production HPA enforces.

# <auto-generated by Ops.Capacity.Generator />
# docker-compose.scale.yaml -- Capacity overlay for OrderServiceCapacity
# Source: [AutoscaleRule] + [ThrottlePolicy] on OrderServiceCapacity
# Usage: docker compose -f docker-compose.ops.yaml -f docker-compose.scale.yaml up

services:
  order-api:
    deploy:
      # Replica floor mirrors HPA minReplicas (3); to test scale-out
      # use `docker compose up --scale order-api=10`
      replicas: 3
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
        window: 120s
      resources:
        limits:
          cpus: "1.0"            # mirrors HPA targetCPUUtilizationPercentage upper bound
          memory: 1024M
        reservations:
          cpus: "0.25"
          memory: 256M
      # Cooldown approximation -- compose has no native cooldown,
      # so we set restart_policy.window to the same 60s the HPA uses
    labels:
      ops.capacity/min-replicas: "3"
      ops.capacity/max-replicas: "20"        # for documentation; not enforced locally
      ops.capacity/source: "OrderServiceCapacity.cs"

  # Co-deployed metrics mock so HPA-style scaling decisions can be observed locally
  metrics-mock:
    image: ghcr.io/example/metrics-server-mock:latest
    configs:
      - source: scale_scenarios
        target: /etc/scenarios.json

configs:
  scale_scenarios:
    file: ./metrics-server-mock.yaml

The ops.capacity/max-replicas label is informational only -- Docker Compose has no controller to enforce upper bounds. Developers who need real auto-scale-out behavior switch to k3d/Kind and use the Cloud-tier HPA manifest below. The point of the Container tier is fidelity of resource limits and replicas, not auto-scaling.

Cloud: hpa.yaml

# auto-generated by Ops.Capacity.Generator
# Source: OrderServiceCapacity [AutoscaleRule]
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
  namespace: order-service
  labels:
    ops.dsl/generator: capacity
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 75     # ScaleUpThreshold
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 180  # CooldownPeriod: 3m
      policies:
        - type: Pods
          value: 4                     # MaxScaleUpStep
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 600  # ScaleDownStabilization: 10m
      policies:
        - type: Percent
          value: 25                    # scale down 25% at a time
          periodSeconds: 120
      selectPolicy: Min

This HPA would have prevented Incident 1. MaxReplicas: 20 instead of 100 caps the cost. MaxScaleUpStep: 4 prevents jumping from 3 to 100 in one step. ScaleDownStabilization: 10m prevents flapping but still allows scale-down within a reasonable window.

Cloud: vpa.yaml

# auto-generated by Ops.Capacity.Generator
# Source: OrderServiceCapacity [AutoscaleRule(VpaMode = "Initial")]
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: order-service-vpa
  namespace: order-service
  labels:
    ops.dsl/generator: capacity
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  updatePolicy:
    updateMode: "Initial"   # set resources at pod creation, no live resize
  resourcePolicy:
    containerPolicies:
      - containerName: order-service
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "4"
          memory: "8Gi"
        controlledResources: ["cpu", "memory"]

Cloud: keda-scaledobject.yaml

When the ScaleUpMetric references an event-driven source (queue depth, message count, etc.), the generator emits a KEDA ScaledObject instead of (or alongside) an HPA:

# auto-generated by Ops.Capacity.Generator
# Source: OrderServiceCapacity [AutoscaleRule]
# Emitted when scaleUpMetric matches a KEDA-supported trigger
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-service-keda
  namespace: order-service
  labels:
    ops.dsl/generator: capacity
spec:
  scaleTargetRef:
    name: order-service
  minReplicaCount: 3
  maxReplicaCount: 20
  cooldownPeriod: 180             # CooldownPeriod: 3m
  pollingInterval: 15
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 180
          policies:
            - type: Pods
              value: 4
              periodSeconds: 60
        scaleDown:
          stabilizationWindowSeconds: 600
          policies:
            - type: Percent
              value: 25
              periodSeconds: 120
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_per_second
        query: |
          sum(rate(http_requests_total{
            service="order-service"
          }[2m]))
        threshold: "50000"        # MaxCapacity: 50000 rpm / 60

Cloud: terraform/capacity/main.tf

# auto-generated by Ops.Capacity.Generator
# Source: OrderServiceCapacity

# --- Scale-to-Zero for Non-Production ---

resource "kubernetes_cron_job_v1" "scale_to_zero" {
  metadata {
    name      = "order-service-scale-to-zero"
    namespace = "order-service-staging"
  }

  spec {
    # Scale down after business hours (idle timeout: 15m, but
    # implemented via cron for predictability)
    schedule = "0 19 * * 1-5"  # 7 PM weekdays

    job_template {
      spec {
        template {
          spec {
            container {
              name    = "scaler"
              image   = "bitnami/kubectl:latest"
              command = [
                "kubectl", "scale", "deployment",
                "order-service", "--replicas=0",
                "-n", "order-service-staging"
              ]
            }
            restart_policy = "OnFailure"
          }
        }
      }
    }
  }
}

resource "kubernetes_cron_job_v1" "scale_from_zero" {
  metadata {
    name      = "order-service-scale-from-zero"
    namespace = "order-service-staging"
  }

  spec {
    # Scale up before business hours (warmup: 3s, so 8:55 AM is safe)
    schedule = "55 7 * * 1-5"  # 7:55 AM weekdays

    job_template {
      spec {
        template {
          spec {
            container {
              name    = "scaler"
              image   = "bitnami/kubectl:latest"
              command = [
                "kubectl", "scale", "deployment",
                "order-service", "--replicas=3",
                "-n", "order-service-staging"
              ]
            }
            restart_policy = "OnFailure"
          }
        }
      }
    }
  }
}

Analyzers

ID Severity Rule
CAP001 Warning [DeploymentApp] without [AutoscaleRule]
CAP002 Error Autoscale metric not declared in Observability DSL
CAP003 Warning [ScaleToZero] without WarmupDuration
CAP004 Error MaxReplicas without corresponding [ResourceBudget]

CAP001 -- Deployment Without Autoscaling

warning CAP001: Service 'order-service' has [DeploymentApp] but no
[AutoscaleRule]. The service will run at a fixed replica count. Add
[AutoscaleRule("order-service", ...)] or document why fixed scaling
is intentional.

CAP002 -- Metric Not in Observability

The autoscale metric must be a metric that the Observability DSL is already collecting. If scaleUpMetric: "custom_queue_depth" is declared but no [MetricDefinition("custom_queue_depth")] exists in the Observability DSL, the scaling will not work because the metric does not exist.

error CAP002: AutoscaleRule for 'order-service' references metric
'custom_queue_depth' but no [MetricDefinition] for this metric exists
in the Observability DSL. The HPA cannot scale on a metric that is not
being collected. Add [MetricDefinition("custom_queue_depth", ...)] to
your observability class.

CAP003 -- Scale-to-Zero Without Warmup

warning CAP003: ScaleToZero for 'order-service' has IdleTimeout='15m'
but WarmupDuration is not set. The first request after idle will
experience cold start latency. Set WarmupDuration to document the
expected cold start time.

CAP004 -- Max Replicas Without Budget

This is the analyzer that prevents the $47,000 surprise. If MaxReplicas is declared but no [ResourceBudget] exists in the Cost DSL, the team does not know what peak scaling costs.

error CAP004: AutoscaleRule for 'order-service' declares MaxReplicas=20
but no [ResourceBudget] exists in the Cost DSL. At 2 vCPU / 4Gi per
replica, peak cost is approximately $2,900/month. Add
[ResourceBudget("order-service", ...)] to your cost class.

Capacity --> Observability

Every metric referenced in ScaleUpMetric or ScaleDownMetric must exist in the Observability DSL. The cross-DSL analyzer verifies this at compile time. The generator also adds the scaling metrics to the observability dashboard:

  • Current replica count vs. min/max
  • Scale events over time
  • Throttled request count per endpoint
  • Queue depth for KEDA-triggered scaling

Capacity --> Cost

MaxReplicas multiplied by per-replica cost must not exceed the [ResourceBudget]. The Capacity and Cost DSLs create a feedback loop: capacity defines the scaling envelope, cost defines the spending envelope. If they conflict, the analyzer says so at compile time, not when the invoice arrives.

Capacity --> Performance

The Performance DSL declares SLOs ([Slo(p99Latency: "200ms")]). The Capacity DSL's throttle policies affect those SLOs. If a [ThrottlePolicy] queues 5 requests with a queue size of 5 at 10 req/s, that adds up to 500ms of queue wait time. The cross-DSL analyzer checks whether the throttle policy's queue time plus the service's baseline latency exceeds the SLO:

warning CAP-PERF001: ThrottlePolicy for '/api/orders/export' has
QueueSize=5 at 10 req/s (max queue time: 500ms). The Performance DSL
declares p99 latency SLO of 200ms. Queued requests will likely breach
the SLO. Consider reducing QueueSize or increasing RequestsPerSecond.

Capacity --> LoadTesting

The LoadTesting DSL declares load test scenarios. The Capacity DSL's autoscaling rules should be exercised by those tests. The cross-DSL analyzer verifies that at least one load test scenario generates enough traffic to trigger a scale-up event:

info CAP-LT001: AutoscaleRule for 'order-service' scales up at 75%
CPU utilization with 3 replicas. Load test 'peak-traffic' generates
5000 rpm. Estimated CPU at 5000 rpm: 65% per replica. This load test
will not trigger scale-up. Consider adding a scenario that reaches
75% CPU utilization (approximately 6000 rpm).

The Payoff

The two incidents at the start of this post came from the same root cause: autoscaling configuration was a number in a YAML file, disconnected from cost, performance, and capacity planning.

After the Capacity DSL:

  • Autoscaling is bounded. MaxReplicas: 20 with MaxScaleUpStep: 4 prevents 0-to-100 jumps. The cost analyzer verifies the budget can handle peak replicas.
  • Rate limiting is generated. The ASP.NET middleware comes from attributes. No manual configuration. No debate about limits -- they are typed values attached to endpoints, reviewable in pull requests.
  • Capacity is projected. "15,000 RPM now, growing 8% monthly, max capacity 50,000 RPM" tells the team they have roughly 15 months before they need to scale the infrastructure. The review date is set. The conversation is scheduled.
  • Scale-to-zero saves money. Staging runs 0 replicas outside business hours. The warmup duration is documented at 3 seconds. Nobody is afraid of cold starts because the cold start time is a typed value, not a rumor.
  • Every DSL cross-checks. The capacity rules connect to observability (metrics exist), cost (budget covers peak), performance (throttling respects SLOs), and load testing (tests exercise scaling). A change in any DSL triggers validation in all connected DSLs.

The YAML file with maxReplicas: 100 that cost $47,000 cannot happen when MaxReplicas is a typed attribute connected to a ResourceBudget connected to an analyzer that says "this will cost $2,900/month at peak" at compile time.

⬇ Download