Build an AI Incident Copilot (CLI) in Python

When an incident hits, most engineers repeat the same manual loop: pull recent logs, scan for errors, and guess what to check next.

This post builds incopilot—a CLI tool that automates the first-pass triage:

  • Collect logs from systemd journal and/or Docker
  • Detect high-signal patterns (timeouts, OOM, disk full, 5xx, panics)
  • Map findings to the Four Golden Signals
  • Output report.md + report.json ready to paste into an incident doc

Safe by design: suggestions only — no destructive commands.

Architecture

Architecture diagram

Project structure

incopilot/
  __init__.py
  cli.py          # argument parsing + console output
  collectors.py   # journalctl, docker logs, file, bundle
  analyzer.py     # pattern detection + line normalization
  reporter.py     # report.md / report.json generation
  config.py       # patterns, golden-signal map, safe-command list
scripts/
  demo_generate_sample_logs.py
posts/
requirements.txt
pyproject.toml
README.md

Setup

git clone https://github.com/AutoShiftOps/incopilot.git
cd incopilot
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Quick test (no real services needed)

python scripts/demo_generate_sample_logs.py
python -m incopilot file --path sample.log
ls out/

Systemd journal triage

python -m incopilot journal --unit nginx --since "30 min ago"

Docker triage

python -m incopilot docker --container my-api --since 1h

Both sources (bundle)

python -m incopilot bundle \
  --unit nginx \
  --container my-api \
  --since-journal "30 min ago" \
  --since-docker 1h

What you get

out/report.md — paste into your incident doc
out/report.json — attach to a ticket or POST to a webhook

What to improve next

  • Per-service pattern packs (nginx, postgres, java, node)
  • Slack/Teams webhook posting (--webhook <url>)
  • Unit tests + GitHub Actions CI
  • Scheduled timer (systemd timer unit) for proactive reports