Incident Response Runbook Template for DevOps

Incidents are stressful when the team is improvising. A simple runbook reduces MTTR by making response repeatable, not heroic. ​ This post provides a ready to use incident response runbook template plus a practical Linux triage checklist you can run from any box.

What this runbook optimizes for

  • Fast acknowledgement and clear ownership (Incident Commander + roles).
  • Early impact assessment and severity assignment to avoid under/over‑reacting.
  • Communication cadence and “known/unknown/next update” structure that builds trust.
  • Evidence capture (commands + logs) to support post‑incident review.

The incident runbook template

Copy this into your internal wiki, README, Notion, or ops repo. ​

  1. Trigger

    Triggers:

    • Monitoring alert / SLO breach
    • Customer report escalated
    • Internal detection (logs, latency spikes, error spikes)
  2. Acknowledge (0–5 minutes)
    • Acknowledge page/alert in your paging system.
    • Create an incident channel: #inc-YYYYMMDD-service-shortdesc.
    • Assign Incident Commander (IC) and Comms Lead.
    • Start an incident document: timeline + links + decisions. ​
  3. Assess severity (5–10 minutes)

    Answer quickly:

    • What’s impacted (service, region, feature)?
    • How many users / revenue / compliance impact?
    • Is impact ongoing and spreading?

    Suggested severity:

    • SEV1: Major outage / severe user impact; immediate coordination.
    • SEV2: Partial outage / significant degradation; urgent but controlled.
    • SEV3: Minor impact; can be handled async.
  4. Stabilize first (10–30 minutes)

    Goal: stop the bleeding before chasing root cause.

    Typical mitigations:

    • Roll back the last deploy/config change.
    • Disable a feature flag.
    • Scale up/out temporarily.
    • Fail over if safe.
    • Rate-limit or block abusive traffic.
  5. Triage checklist (host-level) Run these to establish the baseline quickly (copy/paste friendly).

    CPU

     ps aux --sort=-%cpu | head -15
    

    Alert cue: any process >50% sustained.

    Memory

     free -h
    

    Alert cue: available <20% total RAM.

    Disk

     df -h
     du -sh /var/log/* 2>/dev/null | sort -h | tail -10
    

    Alert cue: any filesystem >90%. ​

    Disk I/O

     iostat -x 1 3
    

    Alert cue: %util >80%, await >20ms.

    Network listeners

     ss -tuln
    

    Alert cue: unexpected listeners/ports. ​ Logs (example: nginx)

     journalctl -u nginx -f
    

    Alert cue: 5xx errors spiking.

  6. Comms cadence (keep it boring)

    SEV1: updates every 10–15 minutes.​
    SEV2: updates every 30 minutes.
    SEV3: async updates acceptable.

    ​ Use this structure:

    • What we know
    • What we don’t know
    • What we’re doing now
    • Next update at: TIME ​
  7. Verify resolution
    • Confirm user impact is gone (synthetic checks + error rate + latency).
    • Confirm saturation is back to normal (CPU/memory/disk/I/O).
    • Watch for 30–60 minutes for regression.
  8. Close and learn (post-incident)
    • Write a brief timeline (detection → mitigation → resolution).
    • Capture what worked, what didn’t, and what to automate.
    • Create follow-ups: alerts tuning, runbook updates, tests, guardrails.

Bonus: “Golden signals” lens for incidents

When you’re lost, anchor on the four golden signals:

  • Latency (are requests slower?)
  • Traffic (is demand abnormal?)
  • Errors (is failure rate rising?)
  • Saturation (are resources hitting limits?)

This keeps triage focused on user impact and system limits, not vanity metrics. ​

Download / reuse

If you reuse this template internally, make one improvement immediately: add links to dashboards, logs, deploy history, and owners for each service. Your future self will thank you.