Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Ops.Incident -- On-Call, Escalation, and Status Pages

"Who's on call?" "Check the spreadsheet." "Which spreadsheet?" -- said during a P1 incident while the error rate climbed to 47%.


The Problem

The payment service went down at 2:17 AM on a Saturday. The alerting system fired a PagerDuty notification to the on-call engineer. The on-call engineer was Sarah, who had left the company three weeks ago. Nobody updated the rotation. The alert went unacknowledged for 22 minutes.

The secondary on-call was a shared Slack channel. The channel had 847 unread messages. The alert was lost in the noise.

The engineering manager finally noticed because a customer tweeted about the outage. He called the last person who committed to the payment service -- David, who was on vacation in a different timezone. David opened his laptop at a hotel, VPN'd in, and spent 40 minutes figuring out what was broken because the alert said "PaymentService: CRITICAL" with no runbook link, no severity definition, no response procedure.

The status page still showed "All Systems Operational" at 3:15 AM because the person responsible for updating it was Sarah, who no longer worked there. Customers found out about the outage from Twitter, not from the company's own status page.

The post-mortem was due the following Monday. It was written as a Google Doc with no template. The doc said "payment service went down, we fixed it" in two paragraphs. No timeline. No root cause analysis. No action items. Three months later, the same failure mode caused another outage.

Every piece of this is a knowledge management failure:

  • On-call rotations lived in a Google Sheet that was updated manually. Team changes, departures, and vacations were not reflected.
  • Escalation policies existed in PagerDuty but had not been reviewed since the team was reorganized six months ago. The tiers referenced roles that no longer existed.
  • Status page components were created during the initial launch and never updated. Three new services had been added with no corresponding status page components.
  • Post-mortem quality depended entirely on who wrote it. There was no template, no required sections, no deadline enforcement.
  • Severity definitions were tribal knowledge. "P1 means the site is down" was the extent of the documentation. Response time expectations, notification channels, and escalation timelines were undefined.

The Incident DSL makes all of this compiled, validated, and generated.


OnCallRotation

[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class OnCallRotationAttribute : Attribute
{
    public OnCallRotationAttribute(string team,
        string[] members) { }

    /// <summary>
    /// How long each person is on call before rotation.
    /// Default: 7 days (weekly rotation).
    /// </summary>
    public string RotationPeriod { get; init; } = "7d";

    /// <summary>
    /// How long before an unacknowledged alert escalates.
    /// Default: 15 minutes.
    /// </summary>
    public string EscalationTimeout { get; init; } = "15m";

    /// <summary>
    /// Time zone for the rotation schedule.
    /// Members in different zones get adjusted windows.
    /// </summary>
    public string TimeZone { get; init; } = "UTC";

    /// <summary>
    /// Hours during which this rotation is active.
    /// Null = 24/7. Format: "09:00-18:00" for business hours.
    /// </summary>
    public string? ActiveHours { get; init; }

    /// <summary>
    /// Override schedule source. Null = use member list as-is.
    /// "pagerduty://team-backend" = sync from PagerDuty.
    /// </summary>
    public string? ScheduleSource { get; init; }
}

EscalationPolicy

[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class EscalationPolicyAttribute : Attribute
{
    public EscalationPolicyAttribute(string name,
        string[] tiers) { }

    /// <summary>
    /// Timeout per tier in minutes.
    /// Length must match Tiers length.
    /// e.g., [15, 30, 60] = 15 min for tier 1, 30 for tier 2, 60 for tier 3.
    /// </summary>
    public int[] TimeoutPerTierMinutes { get; init; }
        = new[] { 15, 30, 60 };

    /// <summary>
    /// What happens after all tiers are exhausted.
    /// </summary>
    public EscalationFallback Fallback { get; init; }
        = EscalationFallback.NotifyAll;

    /// <summary>
    /// Number of times to repeat the escalation cycle
    /// before triggering the fallback.
    /// </summary>
    public int RepeatCycles { get; init; } = 2;

    /// <summary>
    /// Notification channels at each tier.
    /// Format: "tier-name:channel" e.g., "oncall:pagerduty", "eng-manager:phone".
    /// Null = use default channel for each tier.
    /// </summary>
    public string[]? NotificationChannels { get; init; }
}

public enum EscalationFallback
{
    /// <summary>Notify all members of all tiers simultaneously.</summary>
    NotifyAll,

    /// <summary>Page the CTO/VP Engineering.</summary>
    ExecutiveEscalation,

    /// <summary>Trigger automated remediation (rollback, failover).</summary>
    AutoRemediate
}

StatusPage

[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class StatusPageAttribute : Attribute
{
    public StatusPageAttribute(string name,
        string[] components) { }

    /// <summary>
    /// Status page provider.
    /// </summary>
    public StatusPageProvider Provider { get; init; }
        = StatusPageProvider.Statuspage;

    /// <summary>
    /// Whether component status should be updated automatically
    /// based on health check results from the Observability DSL.
    /// </summary>
    public bool AutoUpdateFromHealthChecks { get; init; } = true;

    /// <summary>
    /// Minimum incident duration before status page is updated.
    /// Prevents flapping from briefly triggering status changes.
    /// </summary>
    public string MinDurationBeforeUpdate { get; init; } = "5m";

    /// <summary>
    /// Component groups for visual organization on the status page.
    /// Format: "group-name:component1,component2".
    /// </summary>
    public string[]? ComponentGroups { get; init; }
}

public enum StatusPageProvider
{
    Statuspage,    // Atlassian Statuspage
    Cachet,        // Open-source Cachet
    Instatus,      // Instatus
    Custom         // Custom implementation via IStatusPageUpdater
}

PostMortemTemplate

[AttributeUsage(AttributeTargets.Class, AllowMultiple = false)]
public sealed class PostMortemTemplateAttribute : Attribute
{
    public PostMortemTemplateAttribute(
        string[] requiredSections) { }

    /// <summary>
    /// Maximum time after incident resolution for post-mortem to be filed.
    /// Default: 3 business days.
    /// </summary>
    public string DueWithin { get; init; } = "3bd";

    /// <summary>
    /// Whether the post-mortem must include at least one action item.
    /// A post-mortem without action items is just a story.
    /// </summary>
    public bool RequiresActionItems { get; init; } = true;

    /// <summary>
    /// Minimum number of action items required.
    /// </summary>
    public int MinActionItems { get; init; } = 1;

    /// <summary>
    /// Whether action items must have owners assigned.
    /// </summary>
    public bool ActionItemsRequireOwners { get; init; } = true;

    /// <summary>
    /// Whether action items must have deadlines.
    /// </summary>
    public bool ActionItemsRequireDeadlines { get; init; } = true;

    /// <summary>
    /// Required reviewers for the post-mortem document.
    /// Null = no review gate.
    /// </summary>
    public string[]? Reviewers { get; init; }
}

IncidentSeverity

[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class IncidentSeverityAttribute : Attribute
{
    public IncidentSeverityAttribute(SeverityLevel level,
        string description) { }

    /// <summary>
    /// Maximum time to first response after alert fires.
    /// </summary>
    public string ResponseTime { get; init; }

    /// <summary>
    /// Notification channels for this severity.
    /// e.g., ["pagerduty", "slack:#incidents", "phone:eng-manager"].
    /// </summary>
    public string[] NotifyChannels { get; init; }
        = Array.Empty<string>();

    /// <summary>
    /// Whether an incident commander must be designated.
    /// </summary>
    public bool RequiresIncidentCommander { get; init; } = false;

    /// <summary>
    /// Whether a status page update is required.
    /// </summary>
    public bool RequiresStatusPageUpdate { get; init; } = false;

    /// <summary>
    /// Whether a post-mortem is required after resolution.
    /// </summary>
    public bool RequiresPostMortem { get; init; } = true;

    /// <summary>
    /// Maximum time to resolution. Null = no target.
    /// Used for SLA tracking, not for automatic escalation.
    /// </summary>
    public string? ResolutionTarget { get; init; }
}

public enum SeverityLevel
{
    /// <summary>
    /// Complete service outage affecting all users.
    /// Revenue impact. Executive visibility.
    /// </summary>
    P1,

    /// <summary>
    /// Major feature degradation affecting many users.
    /// Significant user impact but service partially functional.
    /// </summary>
    P2,

    /// <summary>
    /// Minor feature degradation affecting some users.
    /// Workaround available.
    /// </summary>
    P3,

    /// <summary>
    /// Cosmetic issue or minor inconvenience.
    /// No immediate user impact.
    /// </summary>
    P4
}

Declaration Example

[OnCallRotation("payments-backend",
    new[] { "alice@company.com", "bob@company.com",
            "carol@company.com", "david@company.com" },
    RotationPeriod = "7d",
    EscalationTimeout = "10m",
    TimeZone = "America/New_York")]

[EscalationPolicy("payments-critical",
    new[] { "oncall", "team-lead", "eng-manager", "vp-engineering" },
    TimeoutPerTierMinutes = new[] { 10, 20, 30, 60 },
    NotificationChannels = new[]
    {
        "oncall:pagerduty",
        "team-lead:pagerduty+slack",
        "eng-manager:phone+slack",
        "vp-engineering:phone"
    },
    Fallback = EscalationFallback.NotifyAll)]

[StatusPage("payments",
    new[] { "Payment Processing", "Payment Gateway",
            "Refund Service", "Invoice Generation" },
    Provider = StatusPageProvider.Statuspage,
    AutoUpdateFromHealthChecks = true,
    ComponentGroups = new[] { "Core:Payment Processing,Payment Gateway",
                              "Supporting:Refund Service,Invoice Generation" })]

[PostMortemTemplate(
    new[] { "Summary", "Impact", "Timeline", "Root Cause",
            "Contributing Factors", "Resolution",
            "Action Items", "Lessons Learned" },
    DueWithin = "3bd",
    RequiresActionItems = true,
    MinActionItems = 2,
    ActionItemsRequireOwners = true,
    ActionItemsRequireDeadlines = true,
    Reviewers = new[] { "team-lead", "eng-manager" })]

[IncidentSeverity(SeverityLevel.P1,
    "Complete payment processing outage",
    ResponseTime = "5m",
    NotifyChannels = new[] { "pagerduty", "slack:#p1-incidents",
                             "phone:eng-manager", "phone:vp-engineering" },
    RequiresIncidentCommander = true,
    RequiresStatusPageUpdate = true,
    RequiresPostMortem = true,
    ResolutionTarget = "1h")]

[IncidentSeverity(SeverityLevel.P2,
    "Payment processing degraded or single gateway down",
    ResponseTime = "15m",
    NotifyChannels = new[] { "pagerduty", "slack:#incidents" },
    RequiresIncidentCommander = false,
    RequiresStatusPageUpdate = true,
    RequiresPostMortem = true,
    ResolutionTarget = "4h")]

[IncidentSeverity(SeverityLevel.P3,
    "Invoice generation delayed or minor payment UI issues",
    ResponseTime = "1h",
    NotifyChannels = new[] { "slack:#incidents" },
    RequiresPostMortem = false,
    ResolutionTarget = "24h")]

[IncidentSeverity(SeverityLevel.P4,
    "Cosmetic payment page issues, non-blocking logging errors",
    ResponseTime = "4h",
    NotifyChannels = new[] { "slack:#incidents-low" },
    RequiresPostMortem = false)]

public partial class PaymentServiceOps { }

One class. The entire incident management posture for the payment service. Who is on call, when they rotate, how alerts escalate, what the status page shows, what the post-mortem must contain, and what each severity level means. Every piece validated at compile time.


PagerDuty Escalation Policy

The source generator reads [EscalationPolicy] and emits a PagerDuty-compatible JSON configuration:

{
  "escalation_policy": {
    "name": "payments-critical",
    "escalation_rules": [
      {
        "escalation_delay_in_minutes": 10,
        "targets": [
          {
            "type": "schedule_reference",
            "id": "{{resolve:pagerduty:schedule:payments-backend-oncall}}"
          }
        ]
      },
      {
        "escalation_delay_in_minutes": 20,
        "targets": [
          {
            "type": "user_reference",
            "id": "{{resolve:pagerduty:user:team-lead}}"
          }
        ]
      },
      {
        "escalation_delay_in_minutes": 30,
        "targets": [
          {
            "type": "user_reference",
            "id": "{{resolve:pagerduty:user:eng-manager}}"
          }
        ]
      },
      {
        "escalation_delay_in_minutes": 60,
        "targets": [
          {
            "type": "user_reference",
            "id": "{{resolve:pagerduty:user:vp-engineering}}"
          }
        ]
      }
    ],
    "num_loops": 2,
    "description": "Auto-generated from Ops.Incident DSL"
  }
}

The {{resolve:pagerduty:...}} placeholders are resolved during deployment by a thin CLI wrapper that calls the PagerDuty API. The generator does not need API keys -- it produces the shape, the deployment pipeline fills in the IDs.

An equivalent opsgenie-config.json is emitted when the provider is set to OpsGenie. Same attributes, different output format. The DSL abstracts the provider.

Status Page Component Definitions

{
  "components": [
    {
      "name": "Payment Processing",
      "group": "Core",
      "automation": {
        "health_check_ref": "PaymentServiceOps.PaymentProcessing",
        "min_duration_before_update": "5m"
      }
    },
    {
      "name": "Payment Gateway",
      "group": "Core",
      "automation": {
        "health_check_ref": "PaymentServiceOps.PaymentGateway",
        "min_duration_before_update": "5m"
      }
    },
    {
      "name": "Refund Service",
      "group": "Supporting",
      "automation": {
        "health_check_ref": "PaymentServiceOps.RefundService",
        "min_duration_before_update": "5m"
      }
    },
    {
      "name": "Invoice Generation",
      "group": "Supporting",
      "automation": {
        "health_check_ref": "PaymentServiceOps.InvoiceGeneration",
        "min_duration_before_update": "5m"
      }
    }
  ]
}

When AutoUpdateFromHealthChecks = true, the generator cross-references health checks from the Observability DSL. Each status page component maps to a health check endpoint. A thin sidecar polls the health checks and updates the status page provider API. No manual status page updates during incidents.

PostMortemTemplate.g.md

The generator emits a standardized post-mortem Markdown document:

# Post-Mortem: [Incident Title]

**Date:** [YYYY-MM-DD]
**Severity:** [P1/P2/P3/P4]
**Duration:** [Start Time] - [End Time] ([Total Duration])
**Incident Commander:** [Name]
**Author:** [Name]
**Due Date:** [Incident Resolution Date + 3 business days]

---

## Summary

[One paragraph describing what happened, the impact, and the resolution.]

## Impact

- **Users affected:** [Number or percentage]
- **Revenue impact:** [Estimated if applicable]
- **Duration of user-facing impact:** [Time]
- **Services affected:** [List]

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired: [alert name] |
| HH:MM | On-call acknowledged |
| HH:MM | [Key event] |
| HH:MM | Resolution applied |
| HH:MM | Monitoring confirmed recovery |

## Root Cause

[Detailed technical explanation of what caused the incident.]

## Contributing Factors

- [Factor 1: e.g., missing monitoring for X]
- [Factor 2: e.g., outdated runbook for Y]
- [Factor 3: e.g., lack of chaos testing for Z]

## Resolution

[What was done to resolve the incident. Include commands, config changes,
rollbacks, or hotfixes applied.]

## Action Items

| # | Action | Owner | Deadline | Priority | Status |
|---|--------|-------|----------|----------|--------|
| 1 | [Action description] | [Name] | [Date] | [P1-P4] | Open |
| 2 | [Action description] | [Name] | [Date] | [P1-P4] | Open |

**Minimum action items required: 2**
**All action items must have an owner and deadline.**

## Lessons Learned

- **What went well:** [e.g., alert fired quickly, team responded within SLA]
- **What went poorly:** [e.g., runbook was outdated, escalation path unclear]
- **Where we got lucky:** [e.g., happened during business hours, not peak traffic]

---

*Reviewers: team-lead, eng-manager*
*This template was generated by Ops.Incident DSL. Required sections are enforced at compile time.*

This is not a wiki page someone wrote and forgot. It is generated from the [PostMortemTemplate] attribute. The required sections match exactly. If someone adds a ninth required section to the attribute, the template regenerates. If someone removes the "Action Items" section from a completed post-mortem, the analyzer flags it.

IncidentResponseGuide.g.md

Per-severity response procedures, generated from [IncidentSeverity] attributes:

# Incident Response Guide -- PaymentServiceOps

## P1: Complete payment processing outage

- **Response time target:** 5 minutes
- **Notification channels:** PagerDuty, Slack #p1-incidents,
  Phone: eng-manager, Phone: vp-engineering
- **Incident commander required:** Yes
- **Status page update required:** Yes
- **Resolution target:** 1 hour
- **Post-mortem required:** Yes (due within 3 business days)

### Response Steps
1. Acknowledge the alert within 5 minutes.
2. Designate an incident commander.
3. Open a dedicated Slack channel: #incident-[date]-[short-desc].
4. Update the status page: Payment Processing -> Major Outage.
5. Begin diagnosis using the linked runbook.
6. Post updates to the status page every 15 minutes.
7. After resolution, schedule post-mortem within 3 business days.

### Escalation Path
| Tier | Role | Timeout | Channel |
|------|------|---------|---------|
| 1 | On-call engineer | 10 min | PagerDuty |
| 2 | Team lead | 20 min | PagerDuty + Slack |
| 3 | Engineering manager | 30 min | Phone + Slack |
| 4 | VP Engineering | 60 min | Phone |

---

## P2: Payment processing degraded or single gateway down

- **Response time target:** 15 minutes
- **Notification channels:** PagerDuty, Slack #incidents
- **Incident commander required:** No
- **Status page update required:** Yes
- **Resolution target:** 4 hours
- **Post-mortem required:** Yes (due within 3 business days)

[... similar structure ...]

## P3: Invoice generation delayed or minor payment UI issues

- **Response time target:** 1 hour
- **Notification channels:** Slack #incidents
- **Status page update required:** No
- **Resolution target:** 24 hours
- **Post-mortem required:** No

## P4: Cosmetic payment page issues, non-blocking logging errors

- **Response time target:** 4 hours
- **Notification channels:** Slack #incidents-low
- **Status page update required:** No
- **Resolution target:** No target
- **Post-mortem required:** No

Every on-call engineer gets the same response guide. It is not buried in a wiki. It is generated, versioned, and deployed with the service.


INC001: Critical Alert Without Escalation Policy

error INC001: IncidentSeverity P1 "Complete payment processing outage"
    has no matching EscalationPolicy. P1 and P2 severities require an
    escalation policy with at least 2 tiers.
    Add [EscalationPolicy] to PaymentServiceOps.

A P1 or P2 severity without an escalation policy means alerts fire into the void. The analyzer enforces that critical severities have a matching escalation path.

INC002: Deployed Service Without On-Call

error INC002: PaymentServiceOps has [DeploymentApp("payment-service")]
    but no [OnCallRotation]. Every deployed service must have an on-call
    rotation. Add [OnCallRotation] to PaymentServiceOps.

Cross-references the Deployment DSL. If a class has a [DeploymentApp] attribute, it must also have an [OnCallRotation]. A service without an on-call rotation is a service that nobody will fix when it breaks.

INC003: P1 Severity Without Response Time Target

error INC003: IncidentSeverity P1 "Complete payment processing outage"
    has no ResponseTime target. P1 incidents require a response time
    target for SLA tracking.

A P1 definition without a response time target is a P1 definition without accountability. The analyzer enforces that the most critical severity levels have explicit response time expectations.

INC004: Escalation Tier Count Mismatch

error INC004: EscalationPolicy "payments-critical" has 4 tiers but
    TimeoutPerTierMinutes has 3 entries. Each tier must have a
    corresponding timeout.

Array length mismatches between tiers and timeouts. A simple validation that prevents silent misconfiguration.

INC005: Post-Mortem Template Missing Required Section

warning INC005: PostMortemTemplate requires "Action Items" section but
    RequiresActionItems is false. This is contradictory -- either
    remove "Action Items" from required sections or set
    RequiresActionItems to true.

Catches contradictions between the required sections list and the boolean flags.


Incident to Observability

The Observability DSL declares alerts with [AlertRule]. The Incident DSL declares severity levels with [IncidentSeverity]. The connection:

// In the Observability DSL declaration:
[AlertRule("payment-error-rate",
    Query = "rate(http_requests_total{service='payment',status='5xx'}[5m]) > 0.05",
    Severity = AlertSeverity.Critical)]

// In the Incident DSL declaration (same class):
[IncidentSeverity(SeverityLevel.P1,
    "Complete payment processing outage",
    ResponseTime = "5m")]

The generator links AlertSeverity.Critical alerts to the escalation policy for P1 incidents. When the alert fires, the escalation policy activates. No manual mapping. No PagerDuty console configuration. The alert and the escalation are declared together, compiled together, generated together.

Incident to Deployment

When the Deployment DSL's [CanaryStrategy] detects a failed deployment (error rate exceeds threshold during canary rollout), the Incident DSL generates an automatic incident creation hook:

  • Severity: P2 (degraded service, not full outage -- canary caught it)
  • Notification: the escalation policy for the service
  • Status page: component set to "Degraded Performance"
  • Auto-action: trigger rollback from the Resilience DSL's [RollbackPlan]

The generated deployment pipeline includes this logic. A failed canary is not just a failed deployment -- it is an incident with a severity, an escalation path, and a response procedure.

Incident to Resilience

The Resilience DSL's [RollbackPlan] is an incident response action. When a P1 incident is declared, the response guide includes:

### Automated Remediation Options
- **Rollback:** Execute rollback plan "payment-service-rollback"
  (rollback to previous version, estimated time: 3 minutes)
- **Circuit breaker:** Activate circuit breaker on payment gateway
  (from Resilience DSL, trips after 5 failures in 30s)
- **Feature flag:** Disable payment-v2 feature flag
  (from EnvironmentParity DSL, immediate effect)

These are not suggestions typed into a wiki. They are cross-references to typed attributes in other DSLs, resolved at compile time, included in the generated response guide.


The Before and After

Before the Incident DSL:

  • On-call rotation: Google Sheet, manually updated, stale within weeks
  • Escalation policy: PagerDuty console, configured once, never reviewed
  • Status page: Updated manually during incidents (if someone remembers)
  • Post-mortem: Google Doc, no template, variable quality, no deadline
  • Severity definitions: tribal knowledge, different per team
  • Response procedures: wiki page, last updated 8 months ago

After the Incident DSL:

  • On-call rotation: compiled attribute, validated against team membership, generates PagerDuty config
  • Escalation policy: compiled attribute, tier count validated against timeout count, generates provider config
  • Status page: compiled attribute, auto-updated from health checks, components match deployed services
  • Post-mortem: generated template with required sections, due date enforced, action items require owners and deadlines
  • Severity definitions: compiled attributes with response time targets, notification channels, and escalation paths
  • Response procedures: generated per-severity guide with cross-references to rollback plans, circuit breakers, and feature flags

The incident management posture ships with the service. When the service is deployed, the on-call rotation, escalation policy, status page components, and severity definitions are deployed with it. When the service is decommissioned, the incident management configuration is decommissioned with it. No orphaned PagerDuty services. No stale wiki pages. No spreadsheets.


What This Is Not

This DSL does not replace PagerDuty or OpsGenie. It generates their configuration. PagerDuty is still the runtime that pages people at 2 AM. The DSL is the source of truth for what PagerDuty should be configured to do.

This DSL does not automate incident response. It documents incident response in a way that is compiled, validated, and always current. The human still decides whether to roll back, the DSL ensures the rollback plan is documented and the escalation path is defined.

This DSL does not eliminate post-mortems. It makes them consistent. Every post-mortem has the same sections, the same quality bar, the same deadline. The humans still write the content. The DSL ensures the structure.

The compiler does not respond to incidents. But it ensures that when an incident happens, the response path is defined, the escalation is configured, the status page components exist, and the post-mortem template is ready. Everything that can be known before the incident is known. Everything that can be validated before the incident is validated. The 2 AM scramble becomes a 2 AM execution of a compiled plan.

⬇ Download