🚀 Building an AI Incident Copilot: How I Automated the First 15 Minutes of Every Production Incident

Every production incident follows the same painful ritual.

An alert fires at 2am. An engineer wakes up, SSH’s into a server, and begins the manual loop — pulling logs, scanning for errors, guessing what to check next. This loop can take 15 to 45 minutes before the real diagnosis even begins. Multiply that by every incident across every team in your organisation, and you have thousands of engineering hours lost every year to work that is repetitive, stressful, and largely automatable.

I’ve been on that on-call rotation. I know what it costs — not just in time, but in cognitive load, in missed context, and in the compounding pressure of an active incident. So I built incopilot: a CLI tool that automates the entire first-pass triage so engineers can skip straight to actual problem-solving.

This post walks through the architecture, the design decisions, and exactly how to build it yourself. Everything is open source at https://github.com/AutoShiftOps/incopilot.

Architecture

Project structure

incopilot/
  __init__.py
  cli.py          # argument parsing + console output
  collectors.py   # journalctl, docker logs, file, bundle
  analyzer.py     # pattern detection + line normalization
  reporter.py     # report.md / report.json generation
  config.py       # patterns, golden-signal map, safe-command list
scripts/
  demo_generate_sample_logs.py
posts/
requirements.txt
pyproject.toml
README.md

Setup

git clone https://github.com/AutoShiftOps/incopilot.git
cd incopilot
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Quick test (no real services needed)

python scripts/demo_generate_sample_logs.py
python -m incopilot file --path sample.log
ls out/

Systemd journal triage

python -m incopilot journal --unit nginx --since "30 min ago"

Docker triage

python -m incopilot docker --container my-api --since 1h

Both sources (bundle)

python -m incopilot bundle \
  --unit nginx \
  --container my-api \
  --since-journal "30 min ago" \
  --since-docker 1h

What you get

out/report.md — paste into your incident doc
out/report.json — attach to a ticket or POST to a webhook

What to improve next

Per-service pattern packs (nginx, postgres, java, node)
Slack/Teams webhook posting (--webhook <url>)
Unit tests + GitHub Actions CI
Scheduled timer (systemd timer unit) for proactive reports

Sudhakar Sajja is an Application Architect at TechMahindra with 13 years of experience across protocol testing, SDET, DevOps, and cloud architecture. He specialises in AI-powered DevOps operations — building tools that use LLMs to replace manual incident response and query diagnostics. He writes weekly at AutoShiftOps (autoshiftops.com) and built QueryTuner (querytuner.com), an AI-driven SQL query analysis tool. Based in Mississauga, Canada.