The 3-Tier Execution Model -- InProcess, Container, Cloud
"The fastest chaos experiment runs in your test suite. The most realistic one runs in production. The trick is knowing which one you need."
The Insight
Most operational testing frameworks assume infrastructure. Chaos Monkey needs a cluster. Litmus needs Kubernetes. Gremlin needs an agent on a VM. k6 Cloud needs a cloud account.
This means the feedback loop is slow, expensive, and gated by infrastructure access. A developer who wants to test "what happens when the payment gateway times out" must either:
- Deploy to a staging environment, configure a proxy, inject a fault, run the test, tear down -- 20 minutes minimum
- Or just... not test it
Option 2 wins 95% of the time. Not because developers are lazy, but because the cost-to-feedback ratio is wrong.
The 3-Tier Execution Model fixes this by recognizing a simple truth: not every operational concern needs infrastructure. Most fault scenarios can be simulated in-process. Some need real network behavior. Very few need real cloud resources.
Coverage
────────
Tier 1: InProcess ████████████████████ 80% ← dotnet test, zero infra
Tier 2: Container █████ 15% ← Docker, Toxiproxy
Tier 3: Cloud ██ 5% ← Terraform, k6 Cloud Coverage
────────
Tier 1: InProcess ████████████████████ 80% ← dotnet test, zero infra
Tier 2: Container █████ 15% ← Docker, Toxiproxy
Tier 3: Cloud ██ 5% ← Terraform, k6 CloudThe developer progresses from fast/free to realistic/costly. The DSL attributes are the same at every tier. Only the generated artifacts change.
The Enum
Every Ops attribute declares its execution tier:
/// The execution tier for an operational concern.
/// Determines what infrastructure is required and what artifacts are generated.
public enum OpsExecutionTier
{
/// Pure in-process simulation. No Docker, no VMs, no cloud.
/// Uses Injectable decorators for fault injection, latency, rate limiting.
/// Runs in dotnet test. Feedback in seconds.
InProcess = 0,
/// Docker Compose + Toxiproxy for real network behavior.
/// Uses Testcontainers for lifecycle management.
/// Runs in CI or local Docker. Feedback in minutes.
Container = 1,
/// Real cloud infrastructure via Terraform/Pulumi.
/// For load testing at scale, multi-AZ failover, geographic latency.
/// Runs in cloud. Feedback in minutes to hours.
Cloud = 2
}/// The execution tier for an operational concern.
/// Determines what infrastructure is required and what artifacts are generated.
public enum OpsExecutionTier
{
/// Pure in-process simulation. No Docker, no VMs, no cloud.
/// Uses Injectable decorators for fault injection, latency, rate limiting.
/// Runs in dotnet test. Feedback in seconds.
InProcess = 0,
/// Docker Compose + Toxiproxy for real network behavior.
/// Uses Testcontainers for lifecycle management.
/// Runs in CI or local Docker. Feedback in minutes.
Container = 1,
/// Real cloud infrastructure via Terraform/Pulumi.
/// For load testing at scale, multi-AZ failover, geographic latency.
/// Runs in cloud. Feedback in minutes to hours.
Cloud = 2
}This enum appears on every Ops attribute that generates executable artifacts. It is the single most important design decision in the ecosystem: it determines what gets generated, where it runs, and what it costs.
The Mechanism: Injectable Decorators
The Injectable framework generates decorator classes for every registered service interface. These decorators wrap the real service and intercept every method call. The Ops DSLs exploit this: instead of decorating for logging or caching, they decorate for fault injection.
A chaos experiment declaration:
[ChaosExperiment("PaymentTimeout",
Tier = OpsExecutionTier.InProcess,
Description = "Simulates payment gateway timeout under normal load")]
[TargetService(typeof(IPaymentGateway))]
[FaultInjection(FaultKind.Timeout,
Probability = 0.3,
TimeoutMs = 5000)]
[SteadyStateHypothesis(
Metric = "order.completion.rate",
Condition = ThresholdCondition.GreaterThan,
Value = 0.95,
Description = "95% of orders still complete despite payment timeouts")]
public sealed class PaymentTimeoutExperiment { }[ChaosExperiment("PaymentTimeout",
Tier = OpsExecutionTier.InProcess,
Description = "Simulates payment gateway timeout under normal load")]
[TargetService(typeof(IPaymentGateway))]
[FaultInjection(FaultKind.Timeout,
Probability = 0.3,
TimeoutMs = 5000)]
[SteadyStateHypothesis(
Metric = "order.completion.rate",
Condition = ThresholdCondition.GreaterThan,
Value = 0.95,
Description = "95% of orders still complete despite payment timeouts")]
public sealed class PaymentTimeoutExperiment { }The Source Generator produces an Injectable decorator:
// <auto-generated by Ops.Chaos.Generators />
namespace Ops.Chaos.Generated;
/// Chaos decorator for IPaymentGateway.
/// Experiment: PaymentTimeout
/// Fault: Timeout with 30% probability, 5000ms delay.
[InjectableDecorator(typeof(IPaymentGateway))]
public sealed class PaymentTimeoutChaosDecorator : IPaymentGateway
{
private readonly IPaymentGateway _inner;
private readonly IChaosController _chaos;
private readonly IMeterFactory _meters;
public PaymentTimeoutChaosDecorator(
IPaymentGateway inner,
IChaosController chaos,
IMeterFactory meters)
{
_inner = inner;
_chaos = chaos;
_meters = meters;
}
public async Task<PaymentResult> ProcessPaymentAsync(
PaymentRequest request,
CancellationToken ct)
{
if (_chaos.IsExperimentActive("PaymentTimeout")
&& _chaos.ShouldInjectFault(probability: 0.3))
{
_meters.CreateCounter<long>("chaos.fault.injected")
.Add(1, new("experiment", "PaymentTimeout"),
new("fault", "Timeout"));
// Simulate timeout: delay then throw
await Task.Delay(5000, ct);
throw new TimeoutException(
"Chaos: PaymentTimeout injected Timeout fault on ProcessPaymentAsync");
}
return await _inner.ProcessPaymentAsync(request, ct);
}
// ... remaining IPaymentGateway methods decorated similarly
}// <auto-generated by Ops.Chaos.Generators />
namespace Ops.Chaos.Generated;
/// Chaos decorator for IPaymentGateway.
/// Experiment: PaymentTimeout
/// Fault: Timeout with 30% probability, 5000ms delay.
[InjectableDecorator(typeof(IPaymentGateway))]
public sealed class PaymentTimeoutChaosDecorator : IPaymentGateway
{
private readonly IPaymentGateway _inner;
private readonly IChaosController _chaos;
private readonly IMeterFactory _meters;
public PaymentTimeoutChaosDecorator(
IPaymentGateway inner,
IChaosController chaos,
IMeterFactory meters)
{
_inner = inner;
_chaos = chaos;
_meters = meters;
}
public async Task<PaymentResult> ProcessPaymentAsync(
PaymentRequest request,
CancellationToken ct)
{
if (_chaos.IsExperimentActive("PaymentTimeout")
&& _chaos.ShouldInjectFault(probability: 0.3))
{
_meters.CreateCounter<long>("chaos.fault.injected")
.Add(1, new("experiment", "PaymentTimeout"),
new("fault", "Timeout"));
// Simulate timeout: delay then throw
await Task.Delay(5000, ct);
throw new TimeoutException(
"Chaos: PaymentTimeout injected Timeout fault on ProcessPaymentAsync");
}
return await _inner.ProcessPaymentAsync(request, ct);
}
// ... remaining IPaymentGateway methods decorated similarly
}The test uses it:
[Fact]
public async Task Orders_complete_despite_payment_timeouts()
{
// Arrange: register the chaos decorator via DI
var services = new ServiceCollection();
services.AddSingleton<IPaymentGateway, RealPaymentGateway>();
services.AddChaosExperiment<PaymentTimeoutExperiment>();
services.AddSingleton<IChaosController>(
new ChaosController(activeExperiments: ["PaymentTimeout"]));
var provider = services.BuildServiceProvider();
var gateway = provider.GetRequiredService<IPaymentGateway>();
var orderService = new OrderService(gateway);
// Act: process 100 orders with 30% payment timeout probability
var results = new List<OrderResult>();
for (int i = 0; i < 100; i++)
{
try
{
results.Add(await orderService.PlaceOrderAsync(CreateTestOrder()));
}
catch (TimeoutException)
{
results.Add(OrderResult.TimedOut);
}
}
// Assert: steady state hypothesis — 95% completion rate
var completionRate = results.Count(r => r.IsSuccess) / (double)results.Count;
Assert.True(completionRate >= 0.95,
$"Completion rate {completionRate:P1} below 95% threshold");
}[Fact]
public async Task Orders_complete_despite_payment_timeouts()
{
// Arrange: register the chaos decorator via DI
var services = new ServiceCollection();
services.AddSingleton<IPaymentGateway, RealPaymentGateway>();
services.AddChaosExperiment<PaymentTimeoutExperiment>();
services.AddSingleton<IChaosController>(
new ChaosController(activeExperiments: ["PaymentTimeout"]));
var provider = services.BuildServiceProvider();
var gateway = provider.GetRequiredService<IPaymentGateway>();
var orderService = new OrderService(gateway);
// Act: process 100 orders with 30% payment timeout probability
var results = new List<OrderResult>();
for (int i = 0; i < 100; i++)
{
try
{
results.Add(await orderService.PlaceOrderAsync(CreateTestOrder()));
}
catch (TimeoutException)
{
results.Add(OrderResult.TimedOut);
}
}
// Assert: steady state hypothesis — 95% completion rate
var completionRate = results.Count(r => r.IsSuccess) / (double)results.Count;
Assert.True(completionRate >= 0.95,
$"Completion rate {completionRate:P1} below 95% threshold");
}No Docker. No network. No infrastructure. Pure dotnet test. Feedback in seconds. The fault injection is deterministic (controlled by IChaosController) and measurable (metrics via IMeterFactory).
What Works InProcess
The InProcess tier covers a surprisingly large surface area:
| Scenario | Mechanism | Generated Artifact |
|---|---|---|
| Service timeout | Decorator adds Task.Delay + TimeoutException |
ChaosDecorator.g.cs |
| Service exception | Decorator throws configured exception type | ChaosDecorator.g.cs |
| Intermittent failure | Decorator fails with configured probability | ChaosDecorator.g.cs |
| Slow response | Decorator adds latency without failure | LatencyDecorator.g.cs |
| Rate limiting | Decorator tracks call count, rejects excess | RateLimitDecorator.g.cs |
| Data corruption | Decorator mutates response fields | CorruptionDecorator.g.cs |
| Circuit breaker | Decorator tracks failures, opens circuit | CircuitBreakerDecorator.g.cs |
| Retry exhaustion | Decorator fails N times then succeeds | RetryExhaustionDecorator.g.cs |
| Fallback activation | Decorator triggers fallback path | FallbackDecorator.g.cs |
| Performance budget | Decorator measures elapsed time, asserts | BudgetDecorator.g.cs |
| SLI/SLO validation | Decorator measures latency/error rate | SliDecorator.g.cs |
| Resource exhaustion | Decorator simulates memory/thread pool pressure | ResourcePressureDecorator.g.cs |
That is 12 operational scenarios testable with zero infrastructure. Each one is declared with an attribute, generated by a Source Generator, and validated via dotnet test. The coverage estimate of 80% is conservative -- for most microservices, the InProcess tier catches every fault scenario except true network partitions and infrastructure failures.
The InProcess Guarantee
The Ops Analyzers enforce a critical constraint: InProcess declarations must not reference Container or Cloud resources.
// This will NOT compile:
[ChaosExperiment("BadMix", Tier = OpsExecutionTier.InProcess)]
[ToxiProxy("pg-proxy", Upstream = "postgres:5432")] // ← OPS014 error
public sealed class BadMixExperiment { }// This will NOT compile:
[ChaosExperiment("BadMix", Tier = OpsExecutionTier.InProcess)]
[ToxiProxy("pg-proxy", Upstream = "postgres:5432")] // ← OPS014 error
public sealed class BadMixExperiment { }Analyzer diagnostic:
OPS014: InProcess experiment 'BadMix' references Container-tier resource 'ToxiProxy'.
InProcess experiments can only use Injectable decorators.
Change Tier to OpsExecutionTier.Container or remove the ToxiProxy attribute.OPS014: InProcess experiment 'BadMix' references Container-tier resource 'ToxiProxy'.
InProcess experiments can only use Injectable decorators.
Change Tier to OpsExecutionTier.Container or remove the ToxiProxy attribute.This is enforced at compile time. You cannot accidentally write an InProcess test that requires Docker.
When InProcess Is Not Enough
InProcess decorators simulate faults at the application boundary. They intercept the IPaymentGateway call and inject a timeout. But they cannot simulate:
- Network partition: the TCP connection drops mid-request
- DNS failure: the hostname does not resolve
- TLS handshake failure: the certificate is expired or mismatched
- Database crash: PostgreSQL process killed mid-transaction
- Broker slowdown: RabbitMQ accepts messages but delivers them 10x slower
- Partial network degradation: 50% packet loss on one specific connection
These require a real network. The Container tier provides one.
The Mechanism: Docker Compose + Toxiproxy
[ChaosExperiment("DatabasePartition",
Tier = OpsExecutionTier.Container,
Description = "Network partition between app and PostgreSQL")]
[Container("postgres",
Image = "postgres:16",
Ports = ["5432:5432"],
Environment = ["POSTGRES_PASSWORD=test", "POSTGRES_DB=orders"])]
[Container("app",
Image = "order-service:latest",
Ports = ["8080:8080"],
DependsOn = ["postgres"])]
[ToxiProxy("pg-proxy",
Listen = "0.0.0.0:15432",
Upstream = "postgres:5432")]
[NetworkFault(
Proxy = "pg-proxy",
FaultKind = NetworkFaultKind.Timeout,
Latency = 30_000,
Toxicity = 1.0)]
[SteadyStateHypothesis(
Metric = "app.health.status",
Condition = ThresholdCondition.Equals,
Value = 1, // healthy
Description = "App remains healthy during DB partition via circuit breaker")]
public sealed class DatabasePartitionExperiment { }[ChaosExperiment("DatabasePartition",
Tier = OpsExecutionTier.Container,
Description = "Network partition between app and PostgreSQL")]
[Container("postgres",
Image = "postgres:16",
Ports = ["5432:5432"],
Environment = ["POSTGRES_PASSWORD=test", "POSTGRES_DB=orders"])]
[Container("app",
Image = "order-service:latest",
Ports = ["8080:8080"],
DependsOn = ["postgres"])]
[ToxiProxy("pg-proxy",
Listen = "0.0.0.0:15432",
Upstream = "postgres:5432")]
[NetworkFault(
Proxy = "pg-proxy",
FaultKind = NetworkFaultKind.Timeout,
Latency = 30_000,
Toxicity = 1.0)]
[SteadyStateHypothesis(
Metric = "app.health.status",
Condition = ThresholdCondition.Equals,
Value = 1, // healthy
Description = "App remains healthy during DB partition via circuit breaker")]
public sealed class DatabasePartitionExperiment { }The Source Generator produces three artifacts:
1. docker-compose.chaos.yaml
# <auto-generated by Ops.Chaos.Generators />
# Experiment: DatabasePartition
version: "3.8"
services:
postgres:
image: postgres:16
ports:
- "5432:5432"
environment:
- POSTGRES_PASSWORD=test
- POSTGRES_DB=orders
toxiproxy:
image: ghcr.io/shopify/toxiproxy:2.9
ports:
- "8474:8474" # Toxiproxy API
- "15432:15432" # pg-proxy listen port
depends_on:
- postgres
app:
image: order-service:latest
ports:
- "8080:8080"
environment:
- ConnectionStrings__OrderDb=Host=toxiproxy;Port=15432;Database=orders;Username=postgres;Password=test
depends_on:
- toxiproxy# <auto-generated by Ops.Chaos.Generators />
# Experiment: DatabasePartition
version: "3.8"
services:
postgres:
image: postgres:16
ports:
- "5432:5432"
environment:
- POSTGRES_PASSWORD=test
- POSTGRES_DB=orders
toxiproxy:
image: ghcr.io/shopify/toxiproxy:2.9
ports:
- "8474:8474" # Toxiproxy API
- "15432:15432" # pg-proxy listen port
depends_on:
- postgres
app:
image: order-service:latest
ports:
- "8080:8080"
environment:
- ConnectionStrings__OrderDb=Host=toxiproxy;Port=15432;Database=orders;Username=postgres;Password=test
depends_on:
- toxiproxy2. ToxiProxySetup.g.cs
// <auto-generated by Ops.Chaos.Generators />
namespace Ops.Chaos.Generated;
public static class DatabasePartitionToxiProxySetup
{
public static async Task ConfigureAsync(HttpClient toxiproxyApi)
{
// Create proxy: pg-proxy
await toxiproxyApi.PostAsJsonAsync("/proxies", new
{
name = "pg-proxy",
listen = "0.0.0.0:15432",
upstream = "postgres:5432",
enabled = true
});
}
public static async Task InjectFaultAsync(HttpClient toxiproxyApi)
{
// Add timeout toxic to pg-proxy
await toxiproxyApi.PostAsJsonAsync("/proxies/pg-proxy/toxics", new
{
name = "db-partition-timeout",
type = "timeout",
attributes = new { timeout = 30_000 },
toxicity = 1.0
});
}
public static async Task RemoveFaultAsync(HttpClient toxiproxyApi)
{
await toxiproxyApi.DeleteAsync(
"/proxies/pg-proxy/toxics/db-partition-timeout");
}
}// <auto-generated by Ops.Chaos.Generators />
namespace Ops.Chaos.Generated;
public static class DatabasePartitionToxiProxySetup
{
public static async Task ConfigureAsync(HttpClient toxiproxyApi)
{
// Create proxy: pg-proxy
await toxiproxyApi.PostAsJsonAsync("/proxies", new
{
name = "pg-proxy",
listen = "0.0.0.0:15432",
upstream = "postgres:5432",
enabled = true
});
}
public static async Task InjectFaultAsync(HttpClient toxiproxyApi)
{
// Add timeout toxic to pg-proxy
await toxiproxyApi.PostAsJsonAsync("/proxies/pg-proxy/toxics", new
{
name = "db-partition-timeout",
type = "timeout",
attributes = new { timeout = 30_000 },
toxicity = 1.0
});
}
public static async Task RemoveFaultAsync(HttpClient toxiproxyApi)
{
await toxiproxyApi.DeleteAsync(
"/proxies/pg-proxy/toxics/db-partition-timeout");
}
}3. TestInfraFixture.g.cs
// <auto-generated by Ops.Chaos.Generators />
namespace Ops.Chaos.Generated;
public sealed class DatabasePartitionFixture : IAsyncLifetime
{
private readonly DockerComposeEnvironment _env;
private HttpClient? _toxiproxyApi;
public DatabasePartitionFixture()
{
_env = new DockerComposeEnvironment(
composePath: "docker-compose.chaos.yaml",
projectName: "chaos-db-partition");
}
public async Task InitializeAsync()
{
await _env.StartAsync();
await _env.WaitForHealthyAsync("postgres",
check: () => TcpCheck("localhost", 5432),
timeout: TimeSpan.FromSeconds(30));
await _env.WaitForHealthyAsync("toxiproxy",
check: () => HttpCheck("http://localhost:8474/version"),
timeout: TimeSpan.FromSeconds(10));
_toxiproxyApi = new HttpClient
{
BaseAddress = new Uri("http://localhost:8474")
};
await DatabasePartitionToxiProxySetup.ConfigureAsync(_toxiproxyApi);
}
public HttpClient ToxiProxyApi => _toxiproxyApi
?? throw new InvalidOperationException("Fixture not initialized");
public async Task DisposeAsync()
{
_toxiproxyApi?.Dispose();
await _env.StopAsync();
}
}// <auto-generated by Ops.Chaos.Generators />
namespace Ops.Chaos.Generated;
public sealed class DatabasePartitionFixture : IAsyncLifetime
{
private readonly DockerComposeEnvironment _env;
private HttpClient? _toxiproxyApi;
public DatabasePartitionFixture()
{
_env = new DockerComposeEnvironment(
composePath: "docker-compose.chaos.yaml",
projectName: "chaos-db-partition");
}
public async Task InitializeAsync()
{
await _env.StartAsync();
await _env.WaitForHealthyAsync("postgres",
check: () => TcpCheck("localhost", 5432),
timeout: TimeSpan.FromSeconds(30));
await _env.WaitForHealthyAsync("toxiproxy",
check: () => HttpCheck("http://localhost:8474/version"),
timeout: TimeSpan.FromSeconds(10));
_toxiproxyApi = new HttpClient
{
BaseAddress = new Uri("http://localhost:8474")
};
await DatabasePartitionToxiProxySetup.ConfigureAsync(_toxiproxyApi);
}
public HttpClient ToxiProxyApi => _toxiproxyApi
?? throw new InvalidOperationException("Fixture not initialized");
public async Task DisposeAsync()
{
_toxiproxyApi?.Dispose();
await _env.StopAsync();
}
}The test:
public class DatabasePartitionTests : IClassFixture<DatabasePartitionFixture>
{
private readonly DatabasePartitionFixture _infra;
public DatabasePartitionTests(DatabasePartitionFixture infra)
=> _infra = infra;
[Fact]
public async Task App_stays_healthy_during_db_partition()
{
// Arrange: app is running, connected to postgres via toxiproxy
using var appClient = new HttpClient
{
BaseAddress = new Uri("http://localhost:8080")
};
// Verify baseline: app is healthy
var health = await appClient.GetAsync("/health");
Assert.Equal(HttpStatusCode.OK, health.StatusCode);
// Act: inject network partition
await DatabasePartitionToxiProxySetup.InjectFaultAsync(
_infra.ToxiProxyApi);
// Wait for circuit breaker to detect the fault
await Task.Delay(TimeSpan.FromSeconds(5));
// Assert: app is still healthy (circuit breaker activated)
health = await appClient.GetAsync("/health");
Assert.Equal(HttpStatusCode.OK, health.StatusCode);
// Cleanup
await DatabasePartitionToxiProxySetup.RemoveFaultAsync(
_infra.ToxiProxyApi);
}
}public class DatabasePartitionTests : IClassFixture<DatabasePartitionFixture>
{
private readonly DatabasePartitionFixture _infra;
public DatabasePartitionTests(DatabasePartitionFixture infra)
=> _infra = infra;
[Fact]
public async Task App_stays_healthy_during_db_partition()
{
// Arrange: app is running, connected to postgres via toxiproxy
using var appClient = new HttpClient
{
BaseAddress = new Uri("http://localhost:8080")
};
// Verify baseline: app is healthy
var health = await appClient.GetAsync("/health");
Assert.Equal(HttpStatusCode.OK, health.StatusCode);
// Act: inject network partition
await DatabasePartitionToxiProxySetup.InjectFaultAsync(
_infra.ToxiProxyApi);
// Wait for circuit breaker to detect the fault
await Task.Delay(TimeSpan.FromSeconds(5));
// Assert: app is still healthy (circuit breaker activated)
health = await appClient.GetAsync("/health");
Assert.Equal(HttpStatusCode.OK, health.StatusCode);
// Cleanup
await DatabasePartitionToxiProxySetup.RemoveFaultAsync(
_infra.ToxiProxyApi);
}
}What Needs Containers
| Scenario | Why InProcess Fails | Container Mechanism |
|---|---|---|
| Network partition | No real TCP connection to drop | Toxiproxy timeout toxic |
| Network latency | Decorator latency is application-layer only | Toxiproxy latency toxic |
| Packet loss | No real packets | Toxiproxy slice/reset_peer toxic |
| DB crash | No real database process | docker stop postgres |
| Broker slowdown | No real message queue | Toxiproxy bandwidth toxic on RabbitMQ |
| DNS failure | No real DNS resolution | Docker network disconnect |
| TLS failure | No real TLS handshake | Toxiproxy with invalid upstream cert |
| Connection pool exhaustion | Decorator cannot saturate a real pool | Multiple concurrent Container connections |
| Disk full | No real disk | Docker volume with size limit |
When Containers Are Not Enough
Docker Compose runs on one machine. It cannot simulate:
- Multi-AZ failover: killing an availability zone
- Geographic latency: 200ms cross-continent round trip on real infrastructure
- Load at scale: 10,000 concurrent users distributed across regions
- Real autoscaling: HPA responding to actual CPU/memory pressure in a cluster
- Network saturation: saturating a real NIC or load balancer
- CDN behavior: cache invalidation propagation across edge nodes
These need real cloud resources. The Cloud tier generates infrastructure-as-code.
[ChaosExperiment("AzFailover",
Tier = OpsExecutionTier.Cloud,
Description = "Simulate Azure availability zone failure")]
[CloudProvider(CloudProvider.Azure)]
[CloudRegion("westeurope")]
[AzFailure(Zone = "westeurope-1", Duration = "PT10M")]
[SteadyStateHypothesis(
Metric = "request.success.rate",
Condition = ThresholdCondition.GreaterThan,
Value = 0.99,
Description = "99% success rate maintained during AZ failure")]
public sealed class AzFailoverExperiment { }[ChaosExperiment("AzFailover",
Tier = OpsExecutionTier.Cloud,
Description = "Simulate Azure availability zone failure")]
[CloudProvider(CloudProvider.Azure)]
[CloudRegion("westeurope")]
[AzFailure(Zone = "westeurope-1", Duration = "PT10M")]
[SteadyStateHypothesis(
Metric = "request.success.rate",
Condition = ThresholdCondition.GreaterThan,
Value = 0.99,
Description = "99% success rate maintained during AZ failure")]
public sealed class AzFailoverExperiment { }Generated: terraform/chaos-az-failover/main.tf
# <auto-generated by Ops.Chaos.Generators />
# Experiment: AzFailover
# Description: Simulate Azure availability zone failure
terraform {
required_providers {
azurerm = { source = "hashicorp/azurerm", version = "~> 3.0" }
}
}
resource "azurerm_chaos_studio_target" "aks" {
location = "westeurope"
target_resource_id = var.aks_cluster_id
target_type = "Microsoft-AzureKubernetesService"
}
resource "azurerm_chaos_studio_experiment" "az_failover" {
name = "chaos-az-failover"
location = "westeurope"
resource_group_name = var.resource_group_name
identity { type = "SystemAssigned" }
selectors {
name = "aks-selector"
chaos_studio_target_ids = [
azurerm_chaos_studio_target.aks.id
]
}
step {
name = "az-failure"
branch {
name = "zone-1-failure"
actions {
action_type = "continuous"
duration = "PT10M"
parameters = {
jsonParameters = jsonencode({
zone = "westeurope-1"
})
}
selector_name = "aks-selector"
}
}
}
}
# Steady-state validation
resource "azurerm_monitor_metric_alert" "success_rate" {
name = "chaos-az-failover-steady-state"
resource_group_name = var.resource_group_name
scopes = [var.app_insights_id]
criteria {
metric_namespace = "microsoft.insights/components"
metric_name = "requests/success"
aggregation = "Average"
operator = "LessThan"
threshold = 0.99
}
action {
action_group_id = var.alert_action_group_id
}
}# <auto-generated by Ops.Chaos.Generators />
# Experiment: AzFailover
# Description: Simulate Azure availability zone failure
terraform {
required_providers {
azurerm = { source = "hashicorp/azurerm", version = "~> 3.0" }
}
}
resource "azurerm_chaos_studio_target" "aks" {
location = "westeurope"
target_resource_id = var.aks_cluster_id
target_type = "Microsoft-AzureKubernetesService"
}
resource "azurerm_chaos_studio_experiment" "az_failover" {
name = "chaos-az-failover"
location = "westeurope"
resource_group_name = var.resource_group_name
identity { type = "SystemAssigned" }
selectors {
name = "aks-selector"
chaos_studio_target_ids = [
azurerm_chaos_studio_target.aks.id
]
}
step {
name = "az-failure"
branch {
name = "zone-1-failure"
actions {
action_type = "continuous"
duration = "PT10M"
parameters = {
jsonParameters = jsonencode({
zone = "westeurope-1"
})
}
selector_name = "aks-selector"
}
}
}
}
# Steady-state validation
resource "azurerm_monitor_metric_alert" "success_rate" {
name = "chaos-az-failover-steady-state"
resource_group_name = var.resource_group_name
scopes = [var.app_insights_id]
criteria {
metric_namespace = "microsoft.insights/components"
metric_name = "requests/success"
aggregation = "Average"
operator = "LessThan"
threshold = 0.99
}
action {
action_group_id = var.alert_action_group_id
}
}For load testing, the Cloud tier generates distributed k6 scripts:
[PerformanceTest("OrderServiceLoad",
Tier = OpsExecutionTier.Cloud,
Description = "Distributed load test — 10K concurrent users")]
[CloudProvider(CloudProvider.Azure)]
[LoadProfile(
VirtualUsers = 10_000,
RampUpDuration = "PT5M",
SteadyStateDuration = "PT30M",
RampDownDuration = "PT2M")]
[PerformanceBudget(
P95LatencyMs = 200,
P99LatencyMs = 500,
ErrorRatePercent = 0.1)]
[DistributedFrom(Regions = ["westeurope", "eastus", "southeastasia"])]
public sealed class OrderServiceLoadTest { }[PerformanceTest("OrderServiceLoad",
Tier = OpsExecutionTier.Cloud,
Description = "Distributed load test — 10K concurrent users")]
[CloudProvider(CloudProvider.Azure)]
[LoadProfile(
VirtualUsers = 10_000,
RampUpDuration = "PT5M",
SteadyStateDuration = "PT30M",
RampDownDuration = "PT2M")]
[PerformanceBudget(
P95LatencyMs = 200,
P99LatencyMs = 500,
ErrorRatePercent = 0.1)]
[DistributedFrom(Regions = ["westeurope", "eastus", "southeastasia"])]
public sealed class OrderServiceLoadTest { }Generated: k6/order-service-load.js + terraform/k6-distributed/main.tf
The Terraform module provisions k6 Cloud runners in three regions, executes the load test, and collects results against the declared performance budgets.
The Tier Mapping Table
The same Ops DSL generates different artifacts depending on the tier. Here is the complete mapping:
| DSL | InProcess Artifact | Container Artifact | Cloud Artifact |
|---|---|---|---|
| Chaos | ChaosDecorator.g.cs | docker-compose.chaos.yaml + Toxiproxy | Litmus ChaosEngine / Azure Chaos Studio |
| Performance | BenchmarkDotNet config | k6 script + docker-compose | k6 Cloud + Terraform runners |
| Resilience | Polly policy decorator | Docker health check + restart | Kubernetes PDB + HPA |
| Observability | In-memory metrics | Prometheus + Grafana containers | Cloud monitoring (App Insights, CloudWatch) |
| SLA | SLI assertion in tests | SLO burn-rate on Prometheus | Status page + alert routing |
| Network | No-op (InProcess N/A) | Docker network policies | Kubernetes NetworkPolicy / Azure NSG |
| Security | Middleware + header tests | OWASP ZAP in container | Cloud security scanner |
| Scaling | No-op (InProcess N/A) | Docker resource limits | HPA / Azure Autoscale |
Some DSLs are not applicable at the InProcess tier (Network, Scaling). The analyzer enforces this:
OPS015: Scaling rule 'AutoscaleOrderService' has Tier = InProcess,
but Ops.Scaling requires at least Container tier.
Scaling cannot be simulated in-process.OPS015: Scaling rule 'AutoscaleOrderService' has Tier = InProcess,
but Ops.Scaling requires at least Container tier.
Scaling cannot be simulated in-process.Does Terraform Belong on a Dev Machine?
No.
Terraform talks to cloud APIs. It creates real resources that cost real money. It requires credentials with significant permissions. It is not a development tool -- it is a deployment tool.
The 3-Tier Model makes this explicit:
| Location | Tier | Tool |
|---|---|---|
| Developer workstation | InProcess | dotnet test |
| Developer workstation | Container | docker compose up |
| CI pipeline | InProcess + Container | dotnet test + Docker |
| Cloud environment | Cloud | terraform apply via pipeline |
A developer never runs terraform apply from their machine. They declare [ChaosExperiment("AzFailover", Tier = Cloud)] in C#, the Source Generator produces the Terraform module, and the CI/CD pipeline applies it in a controlled environment.
The developer's experience is always the same: write attributes, run dotnet build, get generated artifacts. Whether those artifacts are InProcess decorators, Docker Compose files, or Terraform modules is determined by the tier declaration, not by the developer's environment.
Every artifact at every tier is generated. The docker-compose.yaml is generated. The Terraform module is generated. The k6 script is generated. The Litmus CRD is generated. The Prometheus alert rules are generated. The bash orchestration script that ties them together is generated. The developer writes C# attributes and nothing else.
To execute: dotnet ops run --tier inprocess runs the generated InProcess tests. dotnet ops run --tier container runs the generated docker-compose + Toxiproxy experiments. dotnet ops run --tier cloud runs the generated Terraform + Litmus + k6 pipeline. One command, any tier, all generated.
Tier Constraints: Compile-Time Enforcement
The tier system is not advisory. It is enforced by Roslyn analyzers.
Rule: InProcess Must Not Reference Infrastructure
// OPS014: InProcess experiment references Container resource
[ChaosExperiment("Bad", Tier = OpsExecutionTier.InProcess)]
[Container("postgres", Image = "postgres:16")] // ← compile error
public sealed class BadExperiment { }// OPS014: InProcess experiment references Container resource
[ChaosExperiment("Bad", Tier = OpsExecutionTier.InProcess)]
[Container("postgres", Image = "postgres:16")] // ← compile error
public sealed class BadExperiment { }Rule: Container Must Not Reference Cloud Resources
// OPS016: Container experiment references Cloud resource
[ChaosExperiment("Bad", Tier = OpsExecutionTier.Container)]
[CloudProvider(CloudProvider.Azure)] // ← compile error
public sealed class BadExperiment { }// OPS016: Container experiment references Cloud resource
[ChaosExperiment("Bad", Tier = OpsExecutionTier.Container)]
[CloudProvider(CloudProvider.Azure)] // ← compile error
public sealed class BadExperiment { }Rule: Cloud Must Declare a Provider
// OPS017: Cloud experiment must specify CloudProvider
[ChaosExperiment("Bad", Tier = OpsExecutionTier.Cloud)]
// Missing [CloudProvider] ← compile error
public sealed class BadExperiment { }// OPS017: Cloud experiment must specify CloudProvider
[ChaosExperiment("Bad", Tier = OpsExecutionTier.Cloud)]
// Missing [CloudProvider] ← compile error
public sealed class BadExperiment { }Rule: Target Service Required for InProcess
// OPS018: InProcess chaos experiment must declare TargetService
[ChaosExperiment("Bad", Tier = OpsExecutionTier.InProcess)]
// Missing [TargetService] ← compile error
// (InProcess needs a service interface to generate the decorator)
public sealed class BadExperiment { }// OPS018: InProcess chaos experiment must declare TargetService
[ChaosExperiment("Bad", Tier = OpsExecutionTier.InProcess)]
// Missing [TargetService] ← compile error
// (InProcess needs a service interface to generate the decorator)
public sealed class BadExperiment { }These are not warnings. They are errors. The solution does not compile if the tier constraints are violated. This means every Ops declaration is guaranteed to be executable in its declared tier -- the artifact generation cannot produce invalid output because the input is validated at compile time.
The Progression Path
A team adopts the Ops DSL Ecosystem in stages:
Week 1: Add Ops.Chaos.Lib and Ops.Resilience.Lib. Write InProcess chaos experiments for the 3 most critical service dependencies. Run them in dotnet test. Cost: zero. Time to first result: 30 minutes.
Week 4: The InProcess experiments have caught 2 bugs in retry logic. Add Container-tier experiments for the database connection. Run them in CI with docker compose. Cost: Docker image pulls. Time to feedback: 5 minutes in CI.
Month 3: Add Ops.Performance.Lib. Write InProcess benchmarks with BenchmarkDotNet. Set performance budgets. The analyzer fails the build when a PR regresses p95 by more than 10%. Cost: zero (InProcess).
Month 6: Add Cloud-tier load tests. The Source Generator produces k6 scripts and Terraform modules. The CI/CD pipeline runs distributed load tests before major releases. Cost: cloud compute for the test duration.
At no point does the team need to "set up chaos engineering infrastructure." They start with dotnet test and escalate as needed. The DSL attributes are the same at every stage -- the tier declaration is the only thing that changes.
Summary
The 3-Tier Execution Model is the architectural foundation of the Ops DSL Ecosystem. It makes three guarantees:
Every Ops concern is testable today -- even without Docker, even without cloud access. InProcess decorators cover 80% of fault scenarios.
The cost matches the need -- InProcess is free, Container is cheap, Cloud costs real money. The developer chooses the minimum tier that validates their hypothesis.
The constraints are enforced by the compiler -- you cannot mix tiers. An InProcess experiment cannot reference a Container resource. A Container experiment cannot reference a Cloud provider. The generated artifacts are guaranteed to be executable in their declared tier.
The rest of this series will show each DSL across all three tiers. Every attribute definition will include its tier. Every generated artifact will be annotated with where it runs. The tier is not an afterthought -- it is the first design decision for every Ops DSL.