Incident Response Runbook Template for DevOps
Incidents are stressful when the team is improvising. A simple runbook reduces MTTR by making response repeatable, not heroic. This post provides a ready to use incident response runbook template plus a practical Linux triage checklist you can run from any box.
What this runbook optimizes for
- Fast acknowledgement and clear ownership (Incident Commander + roles).
- Early impact assessment and severity assignment to avoid under/over‑reacting.
- Communication cadence and “known/unknown/next update” structure that builds trust.
- Evidence capture (commands + logs) to support post‑incident review.
The incident runbook template
Copy this into your internal wiki, README, Notion, or ops repo.
-
Trigger
Triggers:
- Monitoring alert / SLO breach
- Customer report escalated
- Internal detection (logs, latency spikes, error spikes)
- Acknowledge (0–5 minutes)
- Acknowledge page/alert in your paging system.
- Create an incident channel: #inc-YYYYMMDD-service-shortdesc.
- Assign Incident Commander (IC) and Comms Lead.
- Start an incident document: timeline + links + decisions.
-
Assess severity (5–10 minutes)
Answer quickly:
- What’s impacted (service, region, feature)?
- How many users / revenue / compliance impact?
- Is impact ongoing and spreading?
Suggested severity:
- SEV1: Major outage / severe user impact; immediate coordination.
- SEV2: Partial outage / significant degradation; urgent but controlled.
- SEV3: Minor impact; can be handled async.
-
Stabilize first (10–30 minutes)
Goal: stop the bleeding before chasing root cause.
Typical mitigations:
- Roll back the last deploy/config change.
- Disable a feature flag.
- Scale up/out temporarily.
- Fail over if safe.
- Rate-limit or block abusive traffic.
-
Triage checklist (host-level) Run these to establish the baseline quickly (copy/paste friendly).
CPU
ps aux --sort=-%cpu | head -15Alert cue: any process >50% sustained.
Memory
free -hAlert cue: available <20% total RAM.
Disk
df -h du -sh /var/log/* 2>/dev/null | sort -h | tail -10Alert cue: any filesystem >90%.
Disk I/O
iostat -x 1 3Alert cue: %util >80%, await >20ms.
Network listeners
ss -tulnAlert cue: unexpected listeners/ports. Logs (example: nginx)
journalctl -u nginx -fAlert cue: 5xx errors spiking.
-
Comms cadence (keep it boring)
SEV1: updates every 10–15 minutes.
SEV2: updates every 30 minutes.
SEV3: async updates acceptable.
Use this structure:- What we know
- What we don’t know
- What we’re doing now
- Next update at: TIME
- Verify resolution
- Confirm user impact is gone (synthetic checks + error rate + latency).
- Confirm saturation is back to normal (CPU/memory/disk/I/O).
- Watch for 30–60 minutes for regression.
- Close and learn (post-incident)
- Write a brief timeline (detection → mitigation → resolution).
- Capture what worked, what didn’t, and what to automate.
- Create follow-ups: alerts tuning, runbook updates, tests, guardrails.
Bonus: “Golden signals” lens for incidents
When you’re lost, anchor on the four golden signals:
- Latency (are requests slower?)
- Traffic (is demand abnormal?)
- Errors (is failure rate rising?)
- Saturation (are resources hitting limits?)
This keeps triage focused on user impact and system limits, not vanity metrics.
Download / reuse
If you reuse this template internally, make one improvement immediately: add links to dashboards, logs, deploy history, and owners for each service. Your future self will thank you.