Ops.Chaos -- Fault Injection Across Three Tiers
This is the most important part of the series. The three-tier model exists for this DSL. Chaos engineering benefits from tier progression more than any other operational concern: start with a deterministic unit test, scale to a container network partition, graduate to a cloud region failover. Same hypothesis, same abort conditions, different blast radius.
The Problem
"We have never tested what happens when the payment gateway times out." -- said at 3 AM during an incident that lasted 4 hours.
The payment gateway returned 504 for 12 minutes. The circuit breaker was configured with a 30-second timeout, but nobody had verified that the timeout actually triggered a fallback. It did not. The circuit breaker was decorating the wrong interface — IPaymentClient instead of IPaymentGateway. Every order attempt hung for 30 seconds, then threw an unhandled TaskCanceledException that the global error handler logged as a 500 with no context.
The post-mortem action items:
- "Add chaos testing for payment gateway timeout." Status: Open. Assignee: TBD.
- "Verify circuit breaker configuration." Status: Open. Assignee: TBD.
- "Add fallback for payment timeout." Status: Open. Assignee: TBD.
Six months later, all three are still open.
What is missing:
- Chaos experiments as code. A hypothesis, a fault injection, a steady-state assertion, and an abort condition — declared in the same codebase as the service, compiled with the service, run in CI.
- Tier progression. The payment timeout experiment should run as a unit test (InProcess), as a Docker network fault (Container), and as a cloud provider failure injection (Cloud). One declaration, three generated artifacts.
- Mandatory coverage. Every
[CircuitBreaker]from the Resilience DSL must have a corresponding chaos experiment. If it does not, the analyzer fires CHS001 and the build fails.
The key principle: you write zero infrastructure code. No Docker Compose. No Toxiproxy config. No Terraform. No Litmus YAML. No bash scripts. You declare C# attributes on a class. dotnet build generates every artifact — the DI decorator, the docker-compose file, the Toxiproxy config, the Terraform module, the Litmus CRD, and the orchestration script that runs them. dotnet ops run --tier inprocess runs the InProcess decorator tests. dotnet ops run --tier container runs docker-compose up + Toxiproxy faults. dotnet ops run --tier cloud runs terraform apply + litmus + k6. All generated. All from attributes.
The Chaos Engineering Scientific Method
Every chaos experiment follows the same structure:
- Hypothesis. "When the payment gateway times out, the circuit breaker trips within 10 seconds and the order service returns a 503 with a retry-after header."
- Steady-state definition. The metrics that must remain within bounds during the experiment. If they go outside bounds, the experiment has revealed a failure.
- Fault injection. The specific failure condition introduced: timeout, exception, latency, packet loss, process kill.
- Abort conditions. The thresholds that trigger an emergency stop. If the error rate exceeds 50%, stop the experiment immediately — the system is failing in a way that harms real users.
- Observation. Did the steady-state hold? If yes, the system is resilient. If no, the experiment found a bug.
This DSL encodes all five steps as attributes.
Attribute Definitions
// =================================================================
// Ops.Chaos.Lib -- Chaos Engineering DSL Attributes
// =================================================================
/// The kind of fault to inject.
public enum FaultKind
{
Timeout, // delay exceeds timeout threshold
Exception, // throw a specific exception type
Latency, // add random delay (not necessarily timeout)
PacketLoss, // drop N% of network packets
ProcessKill, // kill a container/process
CpuStress, // consume CPU cycles
MemoryStress, // allocate memory until pressure
DiskFull, // fill disk to capacity
DnsFailure, // DNS resolution fails
DependencyTimeout, // downstream service times out
ClockSkew, // system clock drifts
DataCorruption // garble response payloads
}
/// Declares a chaos experiment.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = false)]
public sealed class ChaosExperimentAttribute : Attribute
{
public string Name { get; }
public OpsExecutionTier Tier { get; init; } = OpsExecutionTier.InProcess;
public string Hypothesis { get; init; } = "";
public string Owner { get; init; } = "";
public string Schedule { get; init; } = "";
public ChaosExperimentAttribute(string name) => Name = name;
}
/// Identifies the service or interface under attack.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class TargetServiceAttribute : Attribute
{
public Type ServiceType { get; }
public string Instance { get; init; } = "";
public TargetServiceAttribute(Type serviceType) => ServiceType = serviceType;
}
/// Defines the fault to inject.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class FaultInjectionAttribute : Attribute
{
public FaultKind Kind { get; }
public double Probability { get; init; } = 1.0;
public int DurationSeconds { get; init; } = 30;
public int DelayMs { get; init; } = 0;
public double BlastRadius { get; init; } = 1.0;
public string ExceptionType { get; init; } = "";
public FaultInjectionAttribute(FaultKind kind) => Kind = kind;
}
/// A metric that must remain within bounds during the experiment.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class SteadyStateProbeAttribute : Attribute
{
public string Metric { get; }
public string Operator { get; init; } = "<=";
public double Expected { get; }
public int ProbeIntervalSeconds { get; init; } = 5;
public SteadyStateProbeAttribute(string metric, double expected)
{
Metric = metric;
Expected = expected;
}
}
/// Emergency stop: if this threshold is breached, abort immediately.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class AbortConditionAttribute : Attribute
{
public string Metric { get; }
public double Threshold { get; }
public string Action { get; init; } = "abort";
public AbortConditionAttribute(string metric, double threshold)
{
Metric = metric;
Threshold = threshold;
}
}
/// Container-tier: declare a Toxiproxy for network fault injection.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class ToxiProxyAttribute : Attribute
{
public string Name { get; }
public string Upstream { get; }
public int ListenPort { get; init; }
public int UpstreamPort { get; init; }
public ToxiProxyAttribute(string name, string upstream)
{
Name = name;
Upstream = upstream;
}
}
/// Container-tier: declare a Docker container dependency.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class ContainerAttribute : Attribute
{
public string Name { get; }
public string Image { get; }
public int Port { get; init; }
public string[] Environment { get; init; } = [];
public ContainerAttribute(string name, string image)
{
Name = name;
Image = image;
}
}
/// Cloud-tier: declare the cloud provider for experiment execution.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = false)]
public sealed class CloudProviderAttribute : Attribute
{
public string Provider { get; }
public string Region { get; init; } = "";
public string SubscriptionId { get; init; } = "";
public CloudProviderAttribute(string provider) => Provider = provider;
}// =================================================================
// Ops.Chaos.Lib -- Chaos Engineering DSL Attributes
// =================================================================
/// The kind of fault to inject.
public enum FaultKind
{
Timeout, // delay exceeds timeout threshold
Exception, // throw a specific exception type
Latency, // add random delay (not necessarily timeout)
PacketLoss, // drop N% of network packets
ProcessKill, // kill a container/process
CpuStress, // consume CPU cycles
MemoryStress, // allocate memory until pressure
DiskFull, // fill disk to capacity
DnsFailure, // DNS resolution fails
DependencyTimeout, // downstream service times out
ClockSkew, // system clock drifts
DataCorruption // garble response payloads
}
/// Declares a chaos experiment.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = false)]
public sealed class ChaosExperimentAttribute : Attribute
{
public string Name { get; }
public OpsExecutionTier Tier { get; init; } = OpsExecutionTier.InProcess;
public string Hypothesis { get; init; } = "";
public string Owner { get; init; } = "";
public string Schedule { get; init; } = "";
public ChaosExperimentAttribute(string name) => Name = name;
}
/// Identifies the service or interface under attack.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class TargetServiceAttribute : Attribute
{
public Type ServiceType { get; }
public string Instance { get; init; } = "";
public TargetServiceAttribute(Type serviceType) => ServiceType = serviceType;
}
/// Defines the fault to inject.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class FaultInjectionAttribute : Attribute
{
public FaultKind Kind { get; }
public double Probability { get; init; } = 1.0;
public int DurationSeconds { get; init; } = 30;
public int DelayMs { get; init; } = 0;
public double BlastRadius { get; init; } = 1.0;
public string ExceptionType { get; init; } = "";
public FaultInjectionAttribute(FaultKind kind) => Kind = kind;
}
/// A metric that must remain within bounds during the experiment.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class SteadyStateProbeAttribute : Attribute
{
public string Metric { get; }
public string Operator { get; init; } = "<=";
public double Expected { get; }
public int ProbeIntervalSeconds { get; init; } = 5;
public SteadyStateProbeAttribute(string metric, double expected)
{
Metric = metric;
Expected = expected;
}
}
/// Emergency stop: if this threshold is breached, abort immediately.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class AbortConditionAttribute : Attribute
{
public string Metric { get; }
public double Threshold { get; }
public string Action { get; init; } = "abort";
public AbortConditionAttribute(string metric, double threshold)
{
Metric = metric;
Threshold = threshold;
}
}
/// Container-tier: declare a Toxiproxy for network fault injection.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class ToxiProxyAttribute : Attribute
{
public string Name { get; }
public string Upstream { get; }
public int ListenPort { get; init; }
public int UpstreamPort { get; init; }
public ToxiProxyAttribute(string name, string upstream)
{
Name = name;
Upstream = upstream;
}
}
/// Container-tier: declare a Docker container dependency.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class ContainerAttribute : Attribute
{
public string Name { get; }
public string Image { get; }
public int Port { get; init; }
public string[] Environment { get; init; } = [];
public ContainerAttribute(string name, string image)
{
Name = name;
Image = image;
}
}
/// Cloud-tier: declare the cloud provider for experiment execution.
[AttributeUsage(AttributeTargets.Class, AllowMultiple = false)]
public sealed class CloudProviderAttribute : Attribute
{
public string Provider { get; }
public string Region { get; init; } = "";
public string SubscriptionId { get; init; } = "";
public CloudProviderAttribute(string provider) => Provider = provider;
}The Experiment
[ChaosExperiment("payment-timeout",
Tier = OpsExecutionTier.InProcess,
Hypothesis = "When IPaymentGateway.ChargeAsync times out, the circuit breaker " +
"trips within 10 seconds and OrderService returns a 503 with retry-after header")]
[TargetService(typeof(IPaymentGateway))]
[FaultInjection(FaultKind.Timeout,
Probability = 1.0,
DurationSeconds = 30,
DelayMs = 35_000)] // 35s delay against a 30s timeout
[SteadyStateProbe("order_error_rate", 0.05,
Operator = "<=", ProbeIntervalSeconds = 2)]
[SteadyStateProbe("circuit_breaker_state", 1.0,
Operator = "==")] // 1.0 = Open
[AbortCondition("order_error_rate", 0.50)]
public class PaymentTimeoutChaos { }[ChaosExperiment("payment-timeout",
Tier = OpsExecutionTier.InProcess,
Hypothesis = "When IPaymentGateway.ChargeAsync times out, the circuit breaker " +
"trips within 10 seconds and OrderService returns a 503 with retry-after header")]
[TargetService(typeof(IPaymentGateway))]
[FaultInjection(FaultKind.Timeout,
Probability = 1.0,
DurationSeconds = 30,
DelayMs = 35_000)] // 35s delay against a 30s timeout
[SteadyStateProbe("order_error_rate", 0.05,
Operator = "<=", ProbeIntervalSeconds = 2)]
[SteadyStateProbe("circuit_breaker_state", 1.0,
Operator = "==")] // 1.0 = Open
[AbortCondition("order_error_rate", 0.50)]
public class PaymentTimeoutChaos { }Generated: PaymentGatewayChaosDecorator.g.cs
The generator produces a full Injectable decorator that wraps every method of IPaymentGateway with probabilistic fault injection, controlled at runtime via IChaosConfiguration.
// <auto-generated by Ops.Chaos.Generator />
// Experiment: payment-timeout
// Target: IPaymentGateway
// Fault: Timeout (P=1.0, Delay=35000ms)
using System.Diagnostics;
public sealed class PaymentGatewayChaosDecorator : IPaymentGateway
{
private readonly IPaymentGateway _inner;
private readonly IChaosConfiguration _config;
private readonly IMeterFactory _meterFactory;
private readonly ILogger<PaymentGatewayChaosDecorator> _logger;
private static readonly Meter s_meter = new("Ops.Chaos.PaymentTimeout");
private static readonly Counter<long> s_faultsInjected =
s_meter.CreateCounter<long>("chaos.faults_injected");
private static readonly Counter<long> s_faultsSkipped =
s_meter.CreateCounter<long>("chaos.faults_skipped");
public PaymentGatewayChaosDecorator(
IPaymentGateway inner,
IChaosConfiguration config,
IMeterFactory meterFactory,
ILogger<PaymentGatewayChaosDecorator> logger)
{
_inner = inner;
_config = config;
_meterFactory = meterFactory;
_logger = logger;
}
public async Task<ChargeResult> ChargeAsync(
PaymentRequest request, CancellationToken ct = default)
{
if (!_config.IsExperimentEnabled("payment-timeout"))
return await _inner.ChargeAsync(request, ct);
if (Random.Shared.NextDouble() > 1.0) // Probability = 1.0
{
s_faultsSkipped.Add(1);
return await _inner.ChargeAsync(request, ct);
}
// Inject fault: Timeout after 35000ms
_logger.LogWarning(
"Chaos: injecting Timeout fault on IPaymentGateway.ChargeAsync " +
"(experiment=payment-timeout, delay=35000ms)");
s_faultsInjected.Add(1);
await Task.Delay(TimeSpan.FromMilliseconds(35_000), ct);
// The caller's CancellationToken or Polly timeout should cancel
// before this line is reached. If it does not, that is a finding.
return await _inner.ChargeAsync(request, ct);
}
public async Task<RefundResult> RefundAsync(
RefundRequest request, CancellationToken ct = default)
{
// Same fault injection logic for all interface methods
if (!_config.IsExperimentEnabled("payment-timeout"))
return await _inner.RefundAsync(request, ct);
if (Random.Shared.NextDouble() > 1.0)
{
s_faultsSkipped.Add(1);
return await _inner.RefundAsync(request, ct);
}
_logger.LogWarning(
"Chaos: injecting Timeout fault on IPaymentGateway.RefundAsync " +
"(experiment=payment-timeout, delay=35000ms)");
s_faultsInjected.Add(1);
await Task.Delay(TimeSpan.FromMilliseconds(35_000), ct);
return await _inner.RefundAsync(request, ct);
}
}
/// Runtime configuration for chaos experiments.
public interface IChaosConfiguration
{
bool IsExperimentEnabled(string experimentName);
void EnableExperiment(string experimentName);
void DisableExperiment(string experimentName);
void DisableAll();
}// <auto-generated by Ops.Chaos.Generator />
// Experiment: payment-timeout
// Target: IPaymentGateway
// Fault: Timeout (P=1.0, Delay=35000ms)
using System.Diagnostics;
public sealed class PaymentGatewayChaosDecorator : IPaymentGateway
{
private readonly IPaymentGateway _inner;
private readonly IChaosConfiguration _config;
private readonly IMeterFactory _meterFactory;
private readonly ILogger<PaymentGatewayChaosDecorator> _logger;
private static readonly Meter s_meter = new("Ops.Chaos.PaymentTimeout");
private static readonly Counter<long> s_faultsInjected =
s_meter.CreateCounter<long>("chaos.faults_injected");
private static readonly Counter<long> s_faultsSkipped =
s_meter.CreateCounter<long>("chaos.faults_skipped");
public PaymentGatewayChaosDecorator(
IPaymentGateway inner,
IChaosConfiguration config,
IMeterFactory meterFactory,
ILogger<PaymentGatewayChaosDecorator> logger)
{
_inner = inner;
_config = config;
_meterFactory = meterFactory;
_logger = logger;
}
public async Task<ChargeResult> ChargeAsync(
PaymentRequest request, CancellationToken ct = default)
{
if (!_config.IsExperimentEnabled("payment-timeout"))
return await _inner.ChargeAsync(request, ct);
if (Random.Shared.NextDouble() > 1.0) // Probability = 1.0
{
s_faultsSkipped.Add(1);
return await _inner.ChargeAsync(request, ct);
}
// Inject fault: Timeout after 35000ms
_logger.LogWarning(
"Chaos: injecting Timeout fault on IPaymentGateway.ChargeAsync " +
"(experiment=payment-timeout, delay=35000ms)");
s_faultsInjected.Add(1);
await Task.Delay(TimeSpan.FromMilliseconds(35_000), ct);
// The caller's CancellationToken or Polly timeout should cancel
// before this line is reached. If it does not, that is a finding.
return await _inner.ChargeAsync(request, ct);
}
public async Task<RefundResult> RefundAsync(
RefundRequest request, CancellationToken ct = default)
{
// Same fault injection logic for all interface methods
if (!_config.IsExperimentEnabled("payment-timeout"))
return await _inner.RefundAsync(request, ct);
if (Random.Shared.NextDouble() > 1.0)
{
s_faultsSkipped.Add(1);
return await _inner.RefundAsync(request, ct);
}
_logger.LogWarning(
"Chaos: injecting Timeout fault on IPaymentGateway.RefundAsync " +
"(experiment=payment-timeout, delay=35000ms)");
s_faultsInjected.Add(1);
await Task.Delay(TimeSpan.FromMilliseconds(35_000), ct);
return await _inner.RefundAsync(request, ct);
}
}
/// Runtime configuration for chaos experiments.
public interface IChaosConfiguration
{
bool IsExperimentEnabled(string experimentName);
void EnableExperiment(string experimentName);
void DisableExperiment(string experimentName);
void DisableAll();
}Generated: DI Registration
// <auto-generated by Ops.Chaos.Generator />
public static class PaymentTimeoutChaosRegistration
{
public static IServiceCollection AddPaymentTimeoutChaos(this IServiceCollection services)
{
services.AddSingleton<IChaosConfiguration, InMemoryChaosConfiguration>();
services.Decorate<IPaymentGateway, PaymentGatewayChaosDecorator>();
return services;
}
}// <auto-generated by Ops.Chaos.Generator />
public static class PaymentTimeoutChaosRegistration
{
public static IServiceCollection AddPaymentTimeoutChaos(this IServiceCollection services)
{
services.AddSingleton<IChaosConfiguration, InMemoryChaosConfiguration>();
services.Decorate<IPaymentGateway, PaymentGatewayChaosDecorator>();
return services;
}
}Test That Verifies the Circuit Breaker
[Fact]
public async Task PaymentTimeout_CircuitBreaker_Trips_Within_10_Seconds()
{
// Arrange
var services = new ServiceCollection();
services.AddSingleton<IPaymentGateway, RealPaymentGateway>();
services.AddPaymentTimeoutChaos();
services.AddPollyCircuitBreaker<IPaymentGateway>(options =>
{
options.TimeoutSeconds = 30;
options.FailureThreshold = 3;
options.BreakDurationSeconds = 60;
});
var provider = services.BuildServiceProvider();
var config = provider.GetRequiredService<IChaosConfiguration>();
var gateway = provider.GetRequiredService<IPaymentGateway>();
// Act: enable chaos and send requests
config.EnableExperiment("payment-timeout");
var sw = Stopwatch.StartNew();
var tasks = Enumerable.Range(0, 5).Select(async _ =>
{
try
{
await gateway.ChargeAsync(new PaymentRequest { Amount = 100 });
}
catch (BrokenCircuitException)
{
// Expected after the circuit breaker trips
}
catch (TimeoutRejectedException)
{
// Expected during the timeout phase
}
});
await Task.WhenAll(tasks);
sw.Stop();
// Assert: circuit breaker tripped (steady-state probe: circuit_breaker_state == Open)
var breaker = provider.GetRequiredService<ICircuitBreakerStateProvider>();
Assert.Equal(CircuitState.Open, breaker.GetState<IPaymentGateway>());
// Assert: it happened within 10 seconds (not 35s * 5 = 175s)
Assert.True(sw.Elapsed < TimeSpan.FromSeconds(10),
$"Circuit breaker took {sw.Elapsed.TotalSeconds:F1}s to trip, expected < 10s");
config.DisableAll();
}[Fact]
public async Task PaymentTimeout_CircuitBreaker_Trips_Within_10_Seconds()
{
// Arrange
var services = new ServiceCollection();
services.AddSingleton<IPaymentGateway, RealPaymentGateway>();
services.AddPaymentTimeoutChaos();
services.AddPollyCircuitBreaker<IPaymentGateway>(options =>
{
options.TimeoutSeconds = 30;
options.FailureThreshold = 3;
options.BreakDurationSeconds = 60;
});
var provider = services.BuildServiceProvider();
var config = provider.GetRequiredService<IChaosConfiguration>();
var gateway = provider.GetRequiredService<IPaymentGateway>();
// Act: enable chaos and send requests
config.EnableExperiment("payment-timeout");
var sw = Stopwatch.StartNew();
var tasks = Enumerable.Range(0, 5).Select(async _ =>
{
try
{
await gateway.ChargeAsync(new PaymentRequest { Amount = 100 });
}
catch (BrokenCircuitException)
{
// Expected after the circuit breaker trips
}
catch (TimeoutRejectedException)
{
// Expected during the timeout phase
}
});
await Task.WhenAll(tasks);
sw.Stop();
// Assert: circuit breaker tripped (steady-state probe: circuit_breaker_state == Open)
var breaker = provider.GetRequiredService<ICircuitBreakerStateProvider>();
Assert.Equal(CircuitState.Open, breaker.GetState<IPaymentGateway>());
// Assert: it happened within 10 seconds (not 35s * 5 = 175s)
Assert.True(sw.Elapsed < TimeSpan.FromSeconds(10),
$"Circuit breaker took {sw.Elapsed.TotalSeconds:F1}s to trip, expected < 10s");
config.DisableAll();
}This test runs in CI. Every build. No Docker. No network. Pure DI decorator injection. If someone changes the circuit breaker configuration and breaks the timeout behavior, this test catches it before the code reaches any environment.
The Experiment
[ChaosExperiment("database-partition",
Tier = OpsExecutionTier.Container,
Hypothesis = "When the database connection is severed, read replicas serve " +
"stale data and writes queue for retry. No data loss occurs.")]
[TargetService(typeof(IOrderRepository))]
[FaultInjection(FaultKind.PacketLoss,
Probability = 1.0,
DurationSeconds = 120)]
[SteadyStateProbe("orders_created_count", 0,
Operator = ">=")]
[SteadyStateProbe("data_loss_events", 0,
Operator = "==")]
[AbortCondition("unhandled_exception_count", 10)]
// Container infrastructure
[ToxiProxy("postgres-proxy", "postgres",
ListenPort = 5433, UpstreamPort = 5432)]
[Container("postgres", "postgres:16-alpine", Port = 5432,
Environment = ["POSTGRES_DB=orders", "POSTGRES_PASSWORD=test"])]
[Container("order-api", "order-service:latest", Port = 5000,
Environment = ["ConnectionStrings__Orders=Host=toxiproxy;Port=5433;Database=orders"])]
public class DatabasePartitionChaos { }[ChaosExperiment("database-partition",
Tier = OpsExecutionTier.Container,
Hypothesis = "When the database connection is severed, read replicas serve " +
"stale data and writes queue for retry. No data loss occurs.")]
[TargetService(typeof(IOrderRepository))]
[FaultInjection(FaultKind.PacketLoss,
Probability = 1.0,
DurationSeconds = 120)]
[SteadyStateProbe("orders_created_count", 0,
Operator = ">=")]
[SteadyStateProbe("data_loss_events", 0,
Operator = "==")]
[AbortCondition("unhandled_exception_count", 10)]
// Container infrastructure
[ToxiProxy("postgres-proxy", "postgres",
ListenPort = 5433, UpstreamPort = 5432)]
[Container("postgres", "postgres:16-alpine", Port = 5432,
Environment = ["POSTGRES_DB=orders", "POSTGRES_PASSWORD=test"])]
[Container("order-api", "order-service:latest", Port = 5000,
Environment = ["ConnectionStrings__Orders=Host=toxiproxy;Port=5433;Database=orders"])]
public class DatabasePartitionChaos { }Generated: docker-compose.chaos.yaml
# Auto-generated by Ops.Chaos.Generator
# Experiment: database-partition
services:
postgres:
image: postgres:16-alpine
ports:
- "5432:5432"
environment:
POSTGRES_DB: orders
POSTGRES_PASSWORD: test
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
toxiproxy:
image: ghcr.io/shopify/toxiproxy:2.9.0
ports:
- "5433:5433"
- "8474:8474" # Toxiproxy API
depends_on:
postgres:
condition: service_healthy
order-api:
image: order-service:latest
ports:
- "5000:5000"
environment:
ConnectionStrings__Orders: "Host=toxiproxy;Port=5433;Database=orders;Username=postgres;Password=test"
depends_on:
toxiproxy:
condition: service_started
postgres:
condition: service_healthy# Auto-generated by Ops.Chaos.Generator
# Experiment: database-partition
services:
postgres:
image: postgres:16-alpine
ports:
- "5432:5432"
environment:
POSTGRES_DB: orders
POSTGRES_PASSWORD: test
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
toxiproxy:
image: ghcr.io/shopify/toxiproxy:2.9.0
ports:
- "5433:5433"
- "8474:8474" # Toxiproxy API
depends_on:
postgres:
condition: service_healthy
order-api:
image: order-service:latest
ports:
- "5000:5000"
environment:
ConnectionStrings__Orders: "Host=toxiproxy;Port=5433;Database=orders;Username=postgres;Password=test"
depends_on:
toxiproxy:
condition: service_started
postgres:
condition: service_healthyGenerated: ToxiProxyClient.g.cs
A fluent client for the Toxiproxy HTTP API, generated from the [ToxiProxy] attribute.
// <auto-generated by Ops.Chaos.Generator />
// Experiment: database-partition
public sealed class DatabasePartitionToxiProxyClient : IAsyncDisposable
{
private readonly HttpClient _http;
private readonly string _proxyName;
public DatabasePartitionToxiProxyClient(string toxiproxyHost = "localhost", int port = 8474)
{
_http = new HttpClient { BaseAddress = new Uri($"http://{toxiproxyHost}:{port}") };
_proxyName = "postgres-proxy";
}
public async Task CreateProxyAsync()
{
var proxy = new
{
name = _proxyName,
listen = "0.0.0.0:5433",
upstream = "postgres:5432",
enabled = true,
};
var response = await _http.PostAsJsonAsync("/proxies", proxy);
response.EnsureSuccessStatusCode();
}
/// Inject 100% packet loss — complete network partition.
public async Task InjectPacketLossAsync(double toxicity = 1.0)
{
var toxic = new
{
name = "packet_loss_downstream",
type = "limit_data",
stream = "downstream",
toxicity,
attributes = new { bytes = 0 },
};
await _http.PostAsJsonAsync($"/proxies/{_proxyName}/toxics", toxic);
}
/// Remove all toxics — restore normal connectivity.
public async Task RemoveAllToxicsAsync()
{
var response = await _http.GetAsync($"/proxies/{_proxyName}/toxics");
var toxics = await response.Content.ReadFromJsonAsync<JsonArray>();
if (toxics is not null)
{
foreach (var toxic in toxics)
{
var name = toxic!["name"]!.GetValue<string>();
await _http.DeleteAsync($"/proxies/{_proxyName}/toxics/{name}");
}
}
}
/// Disable the proxy entirely — hard cut.
public async Task DisableProxyAsync()
{
var payload = new { enabled = false };
await _http.PostAsJsonAsync($"/proxies/{_proxyName}", payload);
}
public async Task EnableProxyAsync()
{
var payload = new { enabled = true };
await _http.PostAsJsonAsync($"/proxies/{_proxyName}", payload);
}
public async ValueTask DisposeAsync()
{
try { await RemoveAllToxicsAsync(); } catch { /* cleanup best effort */ }
_http.Dispose();
}
}// <auto-generated by Ops.Chaos.Generator />
// Experiment: database-partition
public sealed class DatabasePartitionToxiProxyClient : IAsyncDisposable
{
private readonly HttpClient _http;
private readonly string _proxyName;
public DatabasePartitionToxiProxyClient(string toxiproxyHost = "localhost", int port = 8474)
{
_http = new HttpClient { BaseAddress = new Uri($"http://{toxiproxyHost}:{port}") };
_proxyName = "postgres-proxy";
}
public async Task CreateProxyAsync()
{
var proxy = new
{
name = _proxyName,
listen = "0.0.0.0:5433",
upstream = "postgres:5432",
enabled = true,
};
var response = await _http.PostAsJsonAsync("/proxies", proxy);
response.EnsureSuccessStatusCode();
}
/// Inject 100% packet loss — complete network partition.
public async Task InjectPacketLossAsync(double toxicity = 1.0)
{
var toxic = new
{
name = "packet_loss_downstream",
type = "limit_data",
stream = "downstream",
toxicity,
attributes = new { bytes = 0 },
};
await _http.PostAsJsonAsync($"/proxies/{_proxyName}/toxics", toxic);
}
/// Remove all toxics — restore normal connectivity.
public async Task RemoveAllToxicsAsync()
{
var response = await _http.GetAsync($"/proxies/{_proxyName}/toxics");
var toxics = await response.Content.ReadFromJsonAsync<JsonArray>();
if (toxics is not null)
{
foreach (var toxic in toxics)
{
var name = toxic!["name"]!.GetValue<string>();
await _http.DeleteAsync($"/proxies/{_proxyName}/toxics/{name}");
}
}
}
/// Disable the proxy entirely — hard cut.
public async Task DisableProxyAsync()
{
var payload = new { enabled = false };
await _http.PostAsJsonAsync($"/proxies/{_proxyName}", payload);
}
public async Task EnableProxyAsync()
{
var payload = new { enabled = true };
await _http.PostAsJsonAsync($"/proxies/{_proxyName}", payload);
}
public async ValueTask DisposeAsync()
{
try { await RemoveAllToxicsAsync(); } catch { /* cleanup best effort */ }
_http.Dispose();
}
}Generated: TestInfraFixture.g.cs
An xUnit IAsyncLifetime fixture that starts Docker Compose, creates the proxy, runs the experiment, and tears everything down.
// <auto-generated by Ops.Chaos.Generator />
// Experiment: database-partition
public sealed class DatabasePartitionChaosFixture : IAsyncLifetime
{
private readonly string _composeFile;
public DatabasePartitionToxiProxyClient ToxiProxy { get; private set; } = null!;
public HttpClient OrderApi { get; private set; } = null!;
public DatabasePartitionChaosFixture()
{
_composeFile = Path.Combine(
AppContext.BaseDirectory, "chaos", "docker-compose.chaos.yaml");
}
public async Task InitializeAsync()
{
// Start infrastructure
await RunAsync("docker", $"compose -f {_composeFile} up -d --wait");
// Wait for services
await WaitForHealthy("http://localhost:5000/health", timeout: TimeSpan.FromSeconds(30));
// Create ToxiProxy proxy
ToxiProxy = new DatabasePartitionToxiProxyClient();
await ToxiProxy.CreateProxyAsync();
// Create API client
OrderApi = new HttpClient { BaseAddress = new Uri("http://localhost:5000") };
}
public async Task DisposeAsync()
{
await ToxiProxy.DisposeAsync();
OrderApi.Dispose();
await RunAsync("docker", $"compose -f {_composeFile} down -v");
}
private static async Task RunAsync(string command, string args)
{
var psi = new ProcessStartInfo(command, args)
{
RedirectStandardOutput = true,
RedirectStandardError = true,
};
var process = Process.Start(psi)!;
await process.WaitForExitAsync();
if (process.ExitCode != 0)
throw new InvalidOperationException(
$"{command} {args} exited with code {process.ExitCode}");
}
private static async Task WaitForHealthy(string url, TimeSpan timeout)
{
using var http = new HttpClient();
var deadline = DateTime.UtcNow + timeout;
while (DateTime.UtcNow < deadline)
{
try
{
var response = await http.GetAsync(url);
if (response.IsSuccessStatusCode) return;
}
catch { /* retry */ }
await Task.Delay(1000);
}
throw new TimeoutException($"{url} did not become healthy within {timeout}");
}
}
[CollectionDefinition("DatabasePartitionChaos")]
public class DatabasePartitionChaosCollection
: ICollectionFixture<DatabasePartitionChaosFixture> { }// <auto-generated by Ops.Chaos.Generator />
// Experiment: database-partition
public sealed class DatabasePartitionChaosFixture : IAsyncLifetime
{
private readonly string _composeFile;
public DatabasePartitionToxiProxyClient ToxiProxy { get; private set; } = null!;
public HttpClient OrderApi { get; private set; } = null!;
public DatabasePartitionChaosFixture()
{
_composeFile = Path.Combine(
AppContext.BaseDirectory, "chaos", "docker-compose.chaos.yaml");
}
public async Task InitializeAsync()
{
// Start infrastructure
await RunAsync("docker", $"compose -f {_composeFile} up -d --wait");
// Wait for services
await WaitForHealthy("http://localhost:5000/health", timeout: TimeSpan.FromSeconds(30));
// Create ToxiProxy proxy
ToxiProxy = new DatabasePartitionToxiProxyClient();
await ToxiProxy.CreateProxyAsync();
// Create API client
OrderApi = new HttpClient { BaseAddress = new Uri("http://localhost:5000") };
}
public async Task DisposeAsync()
{
await ToxiProxy.DisposeAsync();
OrderApi.Dispose();
await RunAsync("docker", $"compose -f {_composeFile} down -v");
}
private static async Task RunAsync(string command, string args)
{
var psi = new ProcessStartInfo(command, args)
{
RedirectStandardOutput = true,
RedirectStandardError = true,
};
var process = Process.Start(psi)!;
await process.WaitForExitAsync();
if (process.ExitCode != 0)
throw new InvalidOperationException(
$"{command} {args} exited with code {process.ExitCode}");
}
private static async Task WaitForHealthy(string url, TimeSpan timeout)
{
using var http = new HttpClient();
var deadline = DateTime.UtcNow + timeout;
while (DateTime.UtcNow < deadline)
{
try
{
var response = await http.GetAsync(url);
if (response.IsSuccessStatusCode) return;
}
catch { /* retry */ }
await Task.Delay(1000);
}
throw new TimeoutException($"{url} did not become healthy within {timeout}");
}
}
[CollectionDefinition("DatabasePartitionChaos")]
public class DatabasePartitionChaosCollection
: ICollectionFixture<DatabasePartitionChaosFixture> { }The Container-Tier Test
[Collection("DatabasePartitionChaos")]
public class DatabasePartitionChaosTests
{
private readonly DatabasePartitionChaosFixture _fixture;
public DatabasePartitionChaosTests(DatabasePartitionChaosFixture fixture)
=> _fixture = fixture;
[Fact]
public async Task Database_Partition_Does_Not_Cause_Data_Loss()
{
// 1. Create an order (baseline)
var createResponse = await _fixture.OrderApi.PostAsJsonAsync("/api/orders", new
{
customerId = "cust-001",
items = new[] { new { productId = "prod-001", quantity = 1 } },
});
Assert.Equal(HttpStatusCode.Created, createResponse.StatusCode);
var orderId = (await createResponse.Content.ReadFromJsonAsync<JsonObject>())!["id"]!.ToString();
// 2. Inject network partition
await _fixture.ToxiProxy.InjectPacketLossAsync(toxicity: 1.0);
// 3. Attempt writes during partition — should queue or return 503
var writeResponse = await _fixture.OrderApi.PostAsJsonAsync("/api/orders", new
{
customerId = "cust-002",
items = new[] { new { productId = "prod-002", quantity = 1 } },
});
// Accept 503 (Service Unavailable) or 202 (Accepted, queued for retry)
Assert.True(
writeResponse.StatusCode is HttpStatusCode.ServiceUnavailable
or HttpStatusCode.Accepted,
$"Expected 503 or 202, got {writeResponse.StatusCode}");
// 4. Reads should still work (read replica or cache)
var readResponse = await _fixture.OrderApi.GetAsync($"/api/orders/{orderId}");
Assert.Equal(HttpStatusCode.OK, readResponse.StatusCode);
// 5. Restore connectivity
await _fixture.ToxiProxy.RemoveAllToxicsAsync();
await Task.Delay(5000); // allow retry queue to drain
// 6. Verify no data loss — the queued write should have completed
if (writeResponse.StatusCode == HttpStatusCode.Accepted)
{
var queuedOrderId = writeResponse.Headers.Location?.Segments.Last();
var verifyResponse = await _fixture.OrderApi.GetAsync($"/api/orders/{queuedOrderId}");
Assert.Equal(HttpStatusCode.OK, verifyResponse.StatusCode);
}
}
}[Collection("DatabasePartitionChaos")]
public class DatabasePartitionChaosTests
{
private readonly DatabasePartitionChaosFixture _fixture;
public DatabasePartitionChaosTests(DatabasePartitionChaosFixture fixture)
=> _fixture = fixture;
[Fact]
public async Task Database_Partition_Does_Not_Cause_Data_Loss()
{
// 1. Create an order (baseline)
var createResponse = await _fixture.OrderApi.PostAsJsonAsync("/api/orders", new
{
customerId = "cust-001",
items = new[] { new { productId = "prod-001", quantity = 1 } },
});
Assert.Equal(HttpStatusCode.Created, createResponse.StatusCode);
var orderId = (await createResponse.Content.ReadFromJsonAsync<JsonObject>())!["id"]!.ToString();
// 2. Inject network partition
await _fixture.ToxiProxy.InjectPacketLossAsync(toxicity: 1.0);
// 3. Attempt writes during partition — should queue or return 503
var writeResponse = await _fixture.OrderApi.PostAsJsonAsync("/api/orders", new
{
customerId = "cust-002",
items = new[] { new { productId = "prod-002", quantity = 1 } },
});
// Accept 503 (Service Unavailable) or 202 (Accepted, queued for retry)
Assert.True(
writeResponse.StatusCode is HttpStatusCode.ServiceUnavailable
or HttpStatusCode.Accepted,
$"Expected 503 or 202, got {writeResponse.StatusCode}");
// 4. Reads should still work (read replica or cache)
var readResponse = await _fixture.OrderApi.GetAsync($"/api/orders/{orderId}");
Assert.Equal(HttpStatusCode.OK, readResponse.StatusCode);
// 5. Restore connectivity
await _fixture.ToxiProxy.RemoveAllToxicsAsync();
await Task.Delay(5000); // allow retry queue to drain
// 6. Verify no data loss — the queued write should have completed
if (writeResponse.StatusCode == HttpStatusCode.Accepted)
{
var queuedOrderId = writeResponse.Headers.Location?.Segments.Last();
var verifyResponse = await _fixture.OrderApi.GetAsync($"/api/orders/{queuedOrderId}");
Assert.Equal(HttpStatusCode.OK, verifyResponse.StatusCode);
}
}
}The Experiment
[ChaosExperiment("azure-region-failover",
Tier = OpsExecutionTier.Cloud,
Hypothesis = "When the primary Azure region fails, traffic manager routes to " +
"secondary within 60 seconds. No requests are lost during failover.",
Owner = "platform-team",
Schedule = "0 3 * * 0")] // 3 AM every Sunday
[TargetService(typeof(IOrderService))]
[FaultInjection(FaultKind.ProcessKill,
DurationSeconds = 300,
BlastRadius = 1.0)] // kill the entire primary region
[SteadyStateProbe("http_success_rate", 0.95, Operator = ">=")]
[SteadyStateProbe("failover_duration_seconds", 60, Operator = "<=")]
[AbortCondition("http_success_rate", 0.50)]
[CloudProvider("azure", Region = "westeurope", SubscriptionId = "sub-prod-001")]
public class AzureRegionFailoverChaos { }[ChaosExperiment("azure-region-failover",
Tier = OpsExecutionTier.Cloud,
Hypothesis = "When the primary Azure region fails, traffic manager routes to " +
"secondary within 60 seconds. No requests are lost during failover.",
Owner = "platform-team",
Schedule = "0 3 * * 0")] // 3 AM every Sunday
[TargetService(typeof(IOrderService))]
[FaultInjection(FaultKind.ProcessKill,
DurationSeconds = 300,
BlastRadius = 1.0)] // kill the entire primary region
[SteadyStateProbe("http_success_rate", 0.95, Operator = ">=")]
[SteadyStateProbe("failover_duration_seconds", 60, Operator = "<=")]
[AbortCondition("http_success_rate", 0.50)]
[CloudProvider("azure", Region = "westeurope", SubscriptionId = "sub-prod-001")]
public class AzureRegionFailoverChaos { }Generated: terraform/chaos-az-failover/main.tf
# Auto-generated by Ops.Chaos.Generator
# Experiment: azure-region-failover
# Schedule: Sundays at 03:00 UTC
variable "subscription_id" {
type = string
default = "sub-prod-001"
}
variable "primary_region" {
type = string
default = "westeurope"
}
variable "resource_group_name" {
type = string
}
variable "app_service_name" {
type = string
}
# Stop the primary region App Service to simulate region failure
resource "null_resource" "stop_primary" {
triggers = {
experiment = "azure-region-failover"
timestamp = timestamp()
}
provisioner "local-exec" {
command = <<-EOT
az webapp stop \
--name ${var.app_service_name} \
--resource-group ${var.resource_group_name} \
--subscription ${var.subscription_id}
EOT
}
}
# Monitor: probe the traffic manager endpoint every 5 seconds
resource "null_resource" "steady_state_probe" {
depends_on = [null_resource.stop_primary]
provisioner "local-exec" {
command = <<-EOT
#!/bin/bash
START=$(date +%s)
FAILOVER_DETECTED=false
ABORT=false
for i in $(seq 1 60); do
RESPONSE=$(curl -s -o /dev/null -w "%%{http_code}" \
https://order-service.trafficmanager.net/health)
if [ "$RESPONSE" = "200" ] && [ "$FAILOVER_DETECTED" = false ]; then
FAILOVER_DETECTED=true
NOW=$(date +%s)
DURATION=$((NOW - START))
echo "Failover detected after ${DURATION}s"
if [ $DURATION -gt 60 ]; then
echo "FAIL: failover took ${DURATION}s, budget is 60s"
exit 1
fi
fi
# Abort condition: success rate below 50% after 120s
if [ $i -gt 24 ] && [ "$FAILOVER_DETECTED" = false ]; then
echo "ABORT: no failover after 120s, success rate critical"
ABORT=true
break
fi
sleep 5
done
if [ "$FAILOVER_DETECTED" = false ]; then
echo "FAIL: failover never completed"
exit 1
fi
EOT
}
}
# Restore: restart the primary region after experiment
resource "null_resource" "restore_primary" {
depends_on = [null_resource.steady_state_probe]
provisioner "local-exec" {
command = <<-EOT
az webapp start \
--name ${var.app_service_name} \
--resource-group ${var.resource_group_name} \
--subscription ${var.subscription_id}
EOT
}
}# Auto-generated by Ops.Chaos.Generator
# Experiment: azure-region-failover
# Schedule: Sundays at 03:00 UTC
variable "subscription_id" {
type = string
default = "sub-prod-001"
}
variable "primary_region" {
type = string
default = "westeurope"
}
variable "resource_group_name" {
type = string
}
variable "app_service_name" {
type = string
}
# Stop the primary region App Service to simulate region failure
resource "null_resource" "stop_primary" {
triggers = {
experiment = "azure-region-failover"
timestamp = timestamp()
}
provisioner "local-exec" {
command = <<-EOT
az webapp stop \
--name ${var.app_service_name} \
--resource-group ${var.resource_group_name} \
--subscription ${var.subscription_id}
EOT
}
}
# Monitor: probe the traffic manager endpoint every 5 seconds
resource "null_resource" "steady_state_probe" {
depends_on = [null_resource.stop_primary]
provisioner "local-exec" {
command = <<-EOT
#!/bin/bash
START=$(date +%s)
FAILOVER_DETECTED=false
ABORT=false
for i in $(seq 1 60); do
RESPONSE=$(curl -s -o /dev/null -w "%%{http_code}" \
https://order-service.trafficmanager.net/health)
if [ "$RESPONSE" = "200" ] && [ "$FAILOVER_DETECTED" = false ]; then
FAILOVER_DETECTED=true
NOW=$(date +%s)
DURATION=$((NOW - START))
echo "Failover detected after ${DURATION}s"
if [ $DURATION -gt 60 ]; then
echo "FAIL: failover took ${DURATION}s, budget is 60s"
exit 1
fi
fi
# Abort condition: success rate below 50% after 120s
if [ $i -gt 24 ] && [ "$FAILOVER_DETECTED" = false ]; then
echo "ABORT: no failover after 120s, success rate critical"
ABORT=true
break
fi
sleep 5
done
if [ "$FAILOVER_DETECTED" = false ]; then
echo "FAIL: failover never completed"
exit 1
fi
EOT
}
}
# Restore: restart the primary region after experiment
resource "null_resource" "restore_primary" {
depends_on = [null_resource.steady_state_probe]
provisioner "local-exec" {
command = <<-EOT
az webapp start \
--name ${var.app_service_name} \
--resource-group ${var.resource_group_name} \
--subscription ${var.subscription_id}
EOT
}
}Generated: litmus-experiment.yaml
For Kubernetes environments, the generator produces a LitmusChaos CRD instead of Terraform:
# Auto-generated by Ops.Chaos.Generator
# Experiment: azure-region-failover (Kubernetes variant)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: azure-region-failover
namespace: order-service
labels:
ops.chaos/experiment: azure-region-failover
ops.chaos/tier: cloud
ops.chaos/owner: platform-team
spec:
appinfo:
appns: order-service
applabel: app=order-service
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
monitoring: true
experiments:
- name: pod-delete
spec:
probe:
- name: http-success-rate
type: httpProbe
mode: Continuous
runProperties:
probeTimeout: 5
retry: 60
interval: 5
httpProbe/inputs:
url: https://order-service.trafficmanager.net/health
method:
get:
criteria: ==
responseCode: "200"
- name: failover-duration
type: cmdProbe
mode: OnChaos
runProperties:
probeTimeout: 60
retry: 1
cmdProbe/inputs:
command: ./scripts/measure-failover-duration.sh
comparator:
type: int
criteria: "<="
value: "60"
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"
- name: PODS_AFFECTED_PERC
value: "100" # BlastRadius = 1.0
annotationCheck: "true"
---
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: azure-region-failover-schedule
namespace: order-service
spec:
schedule:
repeat:
properties:
minChaosInterval:
hour:
everyNthHour: 168 # weekly
startTime:
hour: 3
weekDay: "Sun"
engineTemplateSpec:
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete# Auto-generated by Ops.Chaos.Generator
# Experiment: azure-region-failover (Kubernetes variant)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: azure-region-failover
namespace: order-service
labels:
ops.chaos/experiment: azure-region-failover
ops.chaos/tier: cloud
ops.chaos/owner: platform-team
spec:
appinfo:
appns: order-service
applabel: app=order-service
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
monitoring: true
experiments:
- name: pod-delete
spec:
probe:
- name: http-success-rate
type: httpProbe
mode: Continuous
runProperties:
probeTimeout: 5
retry: 60
interval: 5
httpProbe/inputs:
url: https://order-service.trafficmanager.net/health
method:
get:
criteria: ==
responseCode: "200"
- name: failover-duration
type: cmdProbe
mode: OnChaos
runProperties:
probeTimeout: 60
retry: 1
cmdProbe/inputs:
command: ./scripts/measure-failover-duration.sh
comparator:
type: int
criteria: "<="
value: "60"
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"
- name: PODS_AFFECTED_PERC
value: "100" # BlastRadius = 1.0
annotationCheck: "true"
---
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: azure-region-failover-schedule
namespace: order-service
spec:
schedule:
repeat:
properties:
minChaosInterval:
hour:
everyNthHour: 168 # weekly
startTime:
hour: 3
weekDay: "Sun"
engineTemplateSpec:
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-deleteAnalyzer Diagnostics
| ID | Severity | Rule | Example |
|---|---|---|---|
| CHS001 | Warning | CircuitBreaker without chaos experiment | [CircuitBreaker] on IPaymentGateway but no [ChaosExperiment] targets typeof(IPaymentGateway). The circuit breaker has never been tested. |
| CHS002 | Error | Experiment without abort conditions | [ChaosExperiment("x")] has no [AbortCondition]. Without a kill switch, a chaos experiment can cause an outage. |
| CHS003 | Error | Container-tier without ToxiProxy | [ChaosExperiment("x", Tier = Container)] with FaultKind.PacketLoss but no [ToxiProxy] declared. Network faults require a proxy. |
| CHS004 | Error | Cloud-tier without CloudProvider | [ChaosExperiment("x", Tier = Cloud)] but no [CloudProvider] attribute. The generator does not know which provider to target. |
| CHS005 | Warning | FaultKind.DataCorruption without test assertion | A DataCorruption fault is injected but no test method asserts data integrity after the experiment. |
| CHS006 | Info | Experiment not scheduled | [ChaosExperiment("x", Tier = Cloud)] has no Schedule. Cloud experiments should run on a recurring basis, not just manually. |
CHS001 is the integration point with the Resilience DSL. Every [CircuitBreaker], every [RetryPolicy], every [Timeout] from Ops.Resilience should have a matching chaos experiment that verifies the policy actually works. If it does not, the pattern is the same as having a fire extinguisher that has never been inspected — it might work, but you do not know.
Chaos to Resilience
Bidirectional. The Resilience DSL declares the policies. The Chaos DSL verifies them.
// Resilience DSL declares the circuit breaker
[CircuitBreaker(typeof(IPaymentGateway),
TimeoutSeconds = 30, FailureThreshold = 3, BreakDurationSeconds = 60)]
// Chaos DSL tests it — CHS001 fires if this experiment does not exist
[ChaosExperiment("payment-timeout", Tier = OpsExecutionTier.InProcess)]
[TargetService(typeof(IPaymentGateway))]
[FaultInjection(FaultKind.Timeout, DelayMs = 35_000)]// Resilience DSL declares the circuit breaker
[CircuitBreaker(typeof(IPaymentGateway),
TimeoutSeconds = 30, FailureThreshold = 3, BreakDurationSeconds = 60)]
// Chaos DSL tests it — CHS001 fires if this experiment does not exist
[ChaosExperiment("payment-timeout", Tier = OpsExecutionTier.InProcess)]
[TargetService(typeof(IPaymentGateway))]
[FaultInjection(FaultKind.Timeout, DelayMs = 35_000)]Chaos to Observability
Every [SteadyStateProbe] references a metric from the Observability DSL. The generator verifies that the metric exists and has the correct type:
[SteadyStateProbe("http_success_rate", 0.95)]
// Requires: [OpsMetric("http_success_rate", MetricKind.Gauge)] in Observability DSL[SteadyStateProbe("http_success_rate", 0.95)]
// Requires: [OpsMetric("http_success_rate", MetricKind.Gauge)] in Observability DSLChaos to Requirements
Chaos experiments link to resilience acceptance criteria:
[ChaosExperiment("payment-timeout")]
[OpsRequirementLink("FEATURE-456-AC3",
"System must handle payment gateway timeouts without data loss")][ChaosExperiment("payment-timeout")]
[OpsRequirementLink("FEATURE-456-AC3",
"System must handle payment gateway timeouts without data loss")]The compliance report shows which resilience requirements have been verified by chaos experiments and which have not.
Chaos to LoadTesting
The combination of chaos and load testing is the most powerful cross-DSL scenario. A spike load test running simultaneously with a latency fault injection tests the system under realistic stress:
// Load test injects traffic
[LoadTest("order-spike", Tier = OpsExecutionTier.Container)]
[LoadProfile(ConcurrentUsers = 100)]
[TrafficPattern(TrafficShape.Spike, PeakMultiplier = 5.0)]
// Chaos experiment injects faults during the load test
[ChaosExperiment("payment-timeout-under-load", Tier = OpsExecutionTier.Container)]
[FaultInjection(FaultKind.Latency, DelayMs = 5000, Probability = 0.3)]// Load test injects traffic
[LoadTest("order-spike", Tier = OpsExecutionTier.Container)]
[LoadProfile(ConcurrentUsers = 100)]
[TrafficPattern(TrafficShape.Spike, PeakMultiplier = 5.0)]
// Chaos experiment injects faults during the load test
[ChaosExperiment("payment-timeout-under-load", Tier = OpsExecutionTier.Container)]
[FaultInjection(FaultKind.Latency, DelayMs = 5000, Probability = 0.3)]The generated docker-compose file merges both: k6 sends traffic while Toxiproxy injects latency. The steady-state probes verify that the error rate stays below the threshold even under combined stress.