Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Cloud Tier -- Terraform Generation

InProcess proved the circuit breaker works with simulated faults. Container proved the database driver handles real network latency. Neither can answer this question: what happens when an entire availability zone goes down while 500 users are placing orders?

That requires real cloud infrastructure. Real load generators. Real region failover. The Cloud tier generates the Terraform modules, Kubernetes manifests, and load test scripts from the same attribute pattern used in the other two tiers.


Step 1: The Scenario

OrderService runs in Azure Kubernetes Service across three availability zones. The payment provider is in a different region. We need to verify two things:

  1. Failover. When one AZ goes down, the remaining two handle traffic within 60 seconds of disruption. Kubernetes reschedules pods, the load balancer drains the failed zone, and requests continue.
  2. Peak load. 500 concurrent users placing orders for 15 minutes. The p95 response time stays under 300ms and the error rate stays under 1%.

Neither can be tested locally. The first requires Azure Chaos Studio to simulate AZ failure. The second requires distributed load generation that a single machine cannot produce.


Step 2: Declare the Experiments

Two experiments, both Cloud tier:

// AzureFailoverChaos.cs
using Ops.Chaos;

[ChaosExperiment("AzFailover", Tier = OpsExecutionTier.Cloud,
    Hypothesis = "OrderService recovers within 60s after AZ failure")]
[CloudProvider(CloudProvider.Azure, Region = "westeurope")]
[KubernetesTarget(Namespace = "order-system", Deployment = "order-api")]
[FaultInjection(FaultKind.AvailabilityZoneFailure, Duration = "5m",
    Zone = "westeurope-1")]
[SteadyStateProbe(Metric = "order.api.availability", Expected = "> 99%")]
[SteadyStateProbe(Metric = "order.api.p99", Expected = "< 2000ms")]
[AbortCondition(Metric = "order.api.error_rate", Threshold = "30%")]
public partial class AzureFailoverChaos { }
// PeakTrafficLoadTest.cs
using Ops.LoadTesting;

[LoadTest("PeakTraffic", Tier = OpsExecutionTier.Cloud)]
[KubernetesTarget(Namespace = "order-system", Deployment = "order-api")]
[LoadProfile(ConcurrentUsers = 500, RampUp = "120s", Duration = "15m")]
[LoadTestEndpoint("POST", "/api/orders",
    PayloadGenerator = nameof(GenerateOrder),
    Headers = new[] { "Authorization: Bearer {{token}}" })]
[LoadTestEndpoint("GET", "/api/orders/{{orderId}}",
    Weight = 3)]
[LoadTestThreshold(P95 = "300ms", P99 = "1s", MaxErrorRate = 0.01)]
public partial class PeakTrafficLoadTest
{
    public static object GenerateOrder(int iteration) => new
    {
        customerId = $"customer-{iteration % 100}",
        items = new[]
        {
            new { productId = $"product-{iteration % 50}", quantity = 1 + (iteration % 5) }
        },
        paymentMethod = iteration % 2 == 0 ? "credit_card" : "bank_transfer"
    };
}

The [CloudProvider] attribute determines which Terraform provider modules are generated. The [KubernetesTarget] tells the generator which deployment to target for chaos injection and load testing. The [LoadProfile] attributes translate directly into k6 scenario configuration.


Step 3: Generated Terraform -- Chaos Infrastructure

The generator produces terraform/chaos-az-failover/main.tf:

# Auto-generated by Ops.Chaos.Generator from AzureFailoverChaos
# Do not edit. Regenerate by building the project.

terraform {
  required_version = ">= 1.5"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.100"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.30"
    }
  }
}

variable "resource_group_name" {
  type        = string
  description = "Resource group containing the AKS cluster"
}

variable "aks_cluster_name" {
  type        = string
  description = "Name of the AKS cluster"
}

variable "region" {
  type    = string
  default = "westeurope"
}

# --- Data Sources ---

data "azurerm_resource_group" "main" {
  name = var.resource_group_name
}

data "azurerm_kubernetes_cluster" "main" {
  name                = var.aks_cluster_name
  resource_group_name = data.azurerm_resource_group.main.name
}

# --- Azure Chaos Studio ---

resource "azurerm_chaos_studio_target" "aks" {
  location            = var.region
  target_resource_id  = data.azurerm_kubernetes_cluster.main.id
  target_type         = "Microsoft-AzureKubernetesService"
}

resource "azurerm_chaos_studio_experiment" "az_failover" {
  name                = "chaos-az-failover"
  location            = var.region
  resource_group_name = data.azurerm_resource_group.main.name

  identity {
    type = "SystemAssigned"
  }

  selectors {
    name                    = "aks-selector"
    chaos_studio_target_ids = [azurerm_chaos_studio_target.aks.id]
  }

  step {
    name = "az-failure-step"

    branch {
      name = "az-failure-branch"

      actions {
        action_type = "continuous"
        duration    = "PT5M"
        selector_name = "aks-selector"

        parameters = {
          jsonParameters = jsonencode({
            action            = "zone-failure"
            availabilityZone  = "westeurope-1"
          })
        }
      }
    }
  }

  tags = {
    experiment = "AzFailover"
    generator  = "Ops.Chaos"
    managed_by = "terraform"
  }
}

# --- Role Assignment ---
# Chaos Studio needs Contributor on the AKS cluster

resource "azurerm_role_assignment" "chaos_aks" {
  scope                = data.azurerm_kubernetes_cluster.main.id
  role_definition_name = "Azure Kubernetes Service Cluster Admin Role"
  principal_id         = azurerm_chaos_studio_experiment.az_failover.identity[0].principal_id
}

# --- Azure Monitor Workspace (for results) ---

resource "azurerm_log_analytics_workspace" "chaos" {
  name                = "chaos-az-failover-logs"
  location            = var.region
  resource_group_name = data.azurerm_resource_group.main.name
  sku                 = "PerGB2018"
  retention_in_days   = 30

  tags = {
    experiment = "AzFailover"
    managed_by = "terraform"
  }
}

# --- Outputs ---

output "experiment_id" {
  value = azurerm_chaos_studio_experiment.az_failover.id
}

output "log_workspace_id" {
  value = azurerm_log_analytics_workspace.chaos.id
}

The generated Terraform uses data sources for the existing AKS cluster and resource group -- it does not create the cluster. Chaos experiments are ephemeral infrastructure layered on top of existing production or staging environments. The Chaos Studio experiment targets a specific AZ and runs for 5 minutes.


Step 4: Generated Terraform -- Load Test Infrastructure

The generator produces terraform/load-test/main.tf:

# Auto-generated by Ops.LoadTesting.Generator from PeakTrafficLoadTest
# Do not edit. Regenerate by building the project.

terraform {
  required_version = ">= 1.5"
  required_providers {
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.30"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.14"
    }
  }
}

variable "kubeconfig_path" {
  type    = string
  default = "~/.kube/config"
}

variable "k6_runner_count" {
  type    = number
  default = 10
  description = "Number of distributed k6 runner pods"
}

# --- k6 Operator ---

resource "helm_release" "k6_operator" {
  name             = "k6-operator"
  repository       = "https://grafana.github.io/helm-charts"
  chart            = "k6-operator"
  version          = "3.7.0"
  namespace        = "k6-system"
  create_namespace = true

  set {
    name  = "runner.replicas"
    value = var.k6_runner_count
  }
}

# --- k6 Test Script ConfigMap ---

resource "kubernetes_config_map" "k6_script" {
  metadata {
    name      = "k6-peak-traffic-script"
    namespace = "k6-system"
  }

  data = {
    "peak-traffic.js" = file("${path.module}/k6-peak-traffic.js")
  }

  depends_on = [helm_release.k6_operator]
}

# --- k6 TestRun CRD ---

resource "kubernetes_manifest" "k6_test_run" {
  manifest = {
    apiVersion = "k6.io/v1alpha1"
    kind       = "TestRun"
    metadata = {
      name      = "peak-traffic-run"
      namespace = "k6-system"
    }
    spec = {
      parallelism = var.k6_runner_count
      script = {
        configMap = {
          name = "k6-peak-traffic-script"
          file = "peak-traffic.js"
        }
      }
      runner = {
        env = [
          { name = "K6_OUT", value = "experimental-prometheus-rw" },
          { name = "K6_PROMETHEUS_RW_SERVER_URL",
            value = "http://prometheus.monitoring:9090/api/v1/write" }
        ]
      }
    }
  }

  depends_on = [kubernetes_config_map.k6_script]
}

Ten k6 runner pods distribute the load. Each pod runs a fraction of the 500 concurrent users. Results stream to Prometheus via the experimental remote-write output.


Step 5: Generated k6 Script

The generator produces terraform/load-test/k6-peak-traffic.js:

// Auto-generated by Ops.LoadTesting.Generator from PeakTrafficLoadTest
// Do not edit. Regenerate by building the project.

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Counter, Rate, Trend } from 'k6/metrics';

// --- Custom Metrics ---
const orderCreateDuration = new Trend('order_create_duration', true);
const orderGetDuration = new Trend('order_get_duration', true);
const orderErrors = new Counter('order_errors');
const orderSuccessRate = new Rate('order_success_rate');

// --- Thresholds ---
export const options = {
  scenarios: {
    peak_traffic: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '120s', target: 500 },   // Ramp up to 500 users
        { duration: '15m',  target: 500 },   // Hold at 500 users
        { duration: '30s',  target: 0 },     // Ramp down
      ],
    },
  },
  thresholds: {
    'http_req_duration{endpoint:create_order}': ['p(95)<300', 'p(99)<1000'],
    'http_req_duration{endpoint:get_order}':    ['p(95)<300', 'p(99)<1000'],
    'order_success_rate':                       ['rate>0.99'],
    'http_req_failed':                          ['rate<0.01'],
  },
};

// --- Payload Generator ---
function generateOrder(iteration) {
  return JSON.stringify({
    customerId: `customer-${iteration % 100}`,
    items: [
      {
        productId: `product-${iteration % 50}`,
        quantity: 1 + (iteration % 5),
      },
    ],
    paymentMethod: iteration % 2 === 0 ? 'credit_card' : 'bank_transfer',
  });
}

// --- Base URL ---
const BASE_URL = __ENV.TARGET_URL || 'http://order-api.order-system.svc.cluster.local:8080';

// --- Test Logic ---
export default function () {
  const iteration = __ITER;

  // POST /api/orders (weight: 1)
  const createPayload = generateOrder(iteration);
  const createRes = http.post(`${BASE_URL}/api/orders`, createPayload, {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.AUTH_TOKEN || 'test-token'}`,
    },
    tags: { endpoint: 'create_order' },
  });

  orderCreateDuration.add(createRes.timings.duration);

  const createOk = check(createRes, {
    'create: status is 201': (r) => r.status === 201,
    'create: has orderId': (r) => {
      try { return JSON.parse(r.body).orderId !== undefined; }
      catch { return false; }
    },
  });

  if (createOk) {
    orderSuccessRate.add(1);
  } else {
    orderErrors.add(1);
    orderSuccessRate.add(0);
  }

  // GET /api/orders/{orderId} (weight: 3 -- 3x more reads than writes)
  if (createOk && createRes.status === 201) {
    const orderId = JSON.parse(createRes.body).orderId;

    for (let i = 0; i < 3; i++) {
      const getRes = http.get(`${BASE_URL}/api/orders/${orderId}`, {
        headers: {
          'Authorization': `Bearer ${__ENV.AUTH_TOKEN || 'test-token'}`,
        },
        tags: { endpoint: 'get_order' },
      });

      orderGetDuration.add(getRes.timings.duration);

      check(getRes, {
        'get: status is 200': (r) => r.status === 200,
      });

      sleep(0.5);
    }
  }

  sleep(1);
}

The [LoadTestEndpoint] attributes translate to HTTP calls. The Weight = 3 on the GET endpoint means three reads per write. The payload generator mirrors the C# method GenerateOrder -- the generator transpiles the simple object initializer to JavaScript. The thresholds map directly from [LoadTestThreshold].


Step 6: Generated Litmus Experiment

For the AZ failover chaos, the generator also produces a LitmusChaos CRD at k8s/litmus-az-failover.yaml:

# Auto-generated by Ops.Chaos.Generator from AzureFailoverChaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: az-failover-engine
  namespace: order-system
spec:
  engineState: active
  appinfo:
    appns: order-system
    applabel: app=order-api
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: az-failure
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "300"    # 5 minutes
            - name: AVAILABILITY_ZONE
              value: "westeurope-1"
            - name: CLOUD_PROVIDER
              value: "azure"
        probe:
          - name: order-api-availability
            type: httpProbe
            mode: Continuous
            httpProbe/inputs:
              url: http://order-api.order-system:8080/health/live
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            runProperties:
              probeTimeout: 5
              interval: 10
              retry: 3
              probePollingInterval: 2
          - name: order-api-p99-check
            type: promProbe
            mode: Edge
            promProbe/inputs:
              endpoint: http://prometheus.monitoring:9090
              query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service=\"order-api\"}[5m]))"
              comparator:
                type: float
                criteria: "<="
                value: "2.0"
            runProperties:
              probeTimeout: 10
              interval: 30
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-admin
  namespace: order-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: litmus-chaos-admin
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "services"]
    verbs: ["get", "list", "watch", "delete", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: litmus-chaos-admin-binding
subjects:
  - kind: ServiceAccount
    name: litmus-admin
    namespace: order-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: litmus-chaos-admin

The [SteadyStateProbe] attributes become Litmus probes. The HTTP probe checks availability continuously. The Prometheus probe checks p99 latency at the start and end of the experiment (Edge mode). The [AbortCondition] attribute maps to the Litmus abort criteria.


Step 7: Generated Execution Script

This script is generated. The source generator produces run-cloud-experiments.g.sh alongside the Terraform modules, Litmus CRDs, and k6 scripts. You don't write terraform apply or kubectl apply by hand. The developer writes C# attributes; dotnet build generates every artifact, including the orchestration script that ties them together.

To run it: dotnet ops run --tier cloud (which executes the generated script below).

#!/usr/bin/env bash
# <auto-generated by Ops.Chaos.Generators + Ops.LoadTesting.Generators />
# Generated from: AzureFailoverChaos + PeakTrafficLoadTest
# Run via: dotnet ops run --tier cloud

set -euo pipefail

# 1. Apply chaos infrastructure (Chaos Studio, monitoring)
terraform -chdir=terraform/chaos-az-failover init
terraform -chdir=terraform/chaos-az-failover apply -auto-approve \
  -var="resource_group_name=order-platform-rg" \
  -var="aks_cluster_name=order-aks-westeurope"

# 2. Apply load test infrastructure (k6 operator, test script)
terraform -chdir=terraform/load-test init
terraform -chdir=terraform/load-test apply -auto-approve

# 3. Start the Litmus chaos experiment
kubectl apply -f k8s/litmus-az-failover.yaml

# 4. Wait for chaos to begin, then start the load test
sleep 30
kubectl apply -f terraform/load-test/k6-peak-traffic-testrun.yaml

# 5. Monitor the experiment (streams to stdout)
kubectl logs -f -n k6-system -l k6_cr=peak-traffic-run --tail=50 &
kubectl get chaosresult az-failover-engine-az-failure -n order-system -w &

# 6. Wait for completion
kubectl wait --for=condition=complete testrun/peak-traffic-run -n k6-system --timeout=30m
kubectl wait --for=jsonpath='{.status.experimentStatus.phase}'=Completed \
  chaosengine/az-failover-engine -n order-system --timeout=15m

# 7. Collect results
kubectl get chaosresult az-failover-engine-az-failure -n order-system -o json > chaos-results.json
kubectl logs -n k6-system -l k6_cr=peak-traffic-run --tail=1000 > k6-results.txt

# 8. Destroy ephemeral infrastructure
terraform -chdir=terraform/load-test destroy -auto-approve
terraform -chdir=terraform/chaos-az-failover destroy -auto-approve
kubectl delete -f k8s/litmus-az-failover.yaml

Every line in this script was derived from the C# attributes. The Terraform paths come from [CloudProvider]. The kubectl targets come from [ChaosExperiment] and [LoadTest]. The variable values come from the attribute properties. Change an attribute, rebuild, and the script regenerates.

Steps 3 and 4 overlap deliberately. The chaos experiment starts first, destabilizing the AZ. Then the load test begins, proving that the system handles 500 concurrent users even with a failed zone. This is the scenario that cannot be simulated at any other tier.


Step 8: Results and Integration

The chaos results feed into the other DSLs:

Observability DSL. The SLO burn rate (from Part 9) tracks availability during the experiment. If the burn rate exceeds the threshold, the Observability DSL fires an alert that the steady-state probe failed.

SLO: order.api.availability
  Target: 99.9%
  Burn rate during chaos: 2.1x (brief spike at AZ failure, recovered in 47s)
  Result: PASS (availability 99.2% over 5m, within error budget for chaos window)

Cost DSL. The ephemeral Terraform infrastructure has a cost. The Cost DSL (from Part 19) tracks it:

Chaos experiment: AzFailover
  Duration: 23 minutes (including setup/teardown)
  Resources: Chaos Studio experiment, Log Analytics workspace
  Estimated cost: $0.12

Load test: PeakTraffic
  Duration: 19 minutes
  Resources: 10 k6 runner pods (0.5 CPU, 256Mi each)
  Estimated cost: $0.08

Total cloud tier cost: $0.20

Lifecycle DSL. The experiment results are recorded as evidence for the next compliance audit. The ApiContract DSL verifies that the API responded correctly under load. The Incident DSL records the recovery time (47 seconds) as the baseline for the on-call runbook.

Everything generated. Everything from attributes. The developer wrote two C# classes with a combined total of 15 attributes and one payload generator method. The source generator produced: 2 Terraform modules (95 lines of HCL), 1 Litmus ChaosEngine CRD (75 lines of YAML), 1 k6 test script (100 lines of JavaScript), 1 Kubernetes TestRun manifest, and the execution script. All consistent. All derived from the same source of truth.

⬇ Download