Most teams don’t have an alerting problem. They have a decision problem.
Over time, SRE and DevOps teams add more checks, more monitors, more “smart” tools. The result is often alert fatigue: hundreds of alerts, dozens of dashboards, and still no clear answer to the question “Do we wake someone up for this?”
In this post we’ll do two things:
- Design a simple Alert Decision Layer – an explicit step between alerts and humans.
- Build a small, working CLI in Python called
alertdeciderthat you can run today.
By the end you’ll have:
- A clear architecture for routing alerts to
page,ticket,aggregate, orsuppress. - A CLI that reads alerts from JSON, applies transparent rules, and outputs a Markdown + JSON report.
- A foundation you can later extend with AI without turning your incident pipeline into a black box.
The problem: alerts without decisions
Alert fatigue has been written about a lot, but the symptoms are consistent:
- On-call engineers get spammed by low-value alerts, become desensitized, and miss the important ones.
- Teams add more tools and even AI summaries, but humans still have to manually decide “page vs ticket vs ignore” for each signal.
- Runbooks and SLOs exist, but the mental model that turns raw alerts into decisions lives in people’s heads.
What’s missing is a small decision layer that sits between your alert sources and your incident response.
That layer should be:
- Transparent – decisions come from rules everyone can understand and modify.
- Context-aware – severity alone isn’t enough; you need service tier, environment, and history.
- Composable – easy to extend later with AI-based explanations instead of replacing human judgment.
That’s what we’ll build.
Design principles
Before touching code, let’s describe the design as an architect:
-
Normalize first, then decide
Every alert source has its own JSON schema. We’ll normalize into a simpleAlertmodel (id, name, service, severity, environment, fingerprint, timestamps) before any rules. -
Service profiles matter
Acriticalalert on a tier1, SLO-critical service is not the same as acriticalon a best-effort internal tool. We encode this in aServiceProfilemodel loaded fromservices.yml. -
History influences action
A fingerprint fired once is different from a fingerprint that fired 40 times today. We’ll use a smallHistorymodel to treat flapping/noisy alerts differently. -
Rules are code, not magic
TheAlertDecisionEngineis a small, explicit rule engine. The first version has no AI – just clear if/else policies you can review in code. -
Decisions must be explainable
Every decision carries areasonstring. This is where AI can plug in later to generate richer explanations, but today it’s hand-written and predictable.
Architecture
Here’s the high-level architecture of the Alert Decision Layer:
- Sources
- Prometheus Alertmanager, PagerDuty, or any system that can export alerts as JSON.
- Normalizer (
loader.py)- Reads raw JSON and maps each item into an
Alertdataclass with a consistent shape.
- Reads raw JSON and maps each item into an
- Context loaders
services.yml→ServiceProfileper service (tier, SLO criticality, owner).history.json→Historyper fingerprint (how often we’ve seen this alert recently).
- Decision engine (
engine.py)- For each
Alert, looks up the service profile and history and decides one of:page,ticket,aggregate,suppress.
- For each
- Reporters (
reporter.py)- Writes a human-friendly
decision_report.mdand a machine-friendlydecision_report.json. - Prints a table to the terminal so you can sanity-check the decisions quickly.
- Writes a human-friendly
This is intentionally simple: a single CLI binary that can be run locally, in CI, or as a cron/systemd job.
Project structure
alertdecider-agent/
__init__.py
__main__.py
cli.py # CLI entry point
models.py # Alert, ServiceProfile, History dataclasses
loader.py # Load alerts/services/history from JSON/YAML
engine.py # Decision rules (AlertDecisionEngine)
reporter.py # Console + Markdown/JSON output
examples/
alerts.json # Example alerts
services.yml # Service risk profiles
history.json # Simple alert history
You can clone the repo and run it against the example data to see how everything works before wiring it into your own tooling.
Setup
git clone https://github.com/AutoShiftOps/alertdecider.git
cd alertdecider
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The dependencies are minimal:
richfor nice CLI tables.PyYAMLforservices.yml.
Modelling alerts, services, and history
We start by defining three small dataclasses in models.py:
Alert– normalized alert coming from any source (Prometheus, PagerDuty, etc.).ServiceProfile– tier and SLO information for a service.History– how often an alert fingerprint has fired in the last 24 hours.
This gives us a clean domain model to talk about decisions.
Implementing the decision engine
engine.py contains the AlertDecisionEngine with a handful of clear rules.
Examples of policies encoded:
-
Low-severity non-prod noise
info/debugin non-prod →suppress. -
Critical on tier1, SLO-critical services
critical+tier1+slo_critical=true→page. -
Flapping alerts
count_24h >= 20→aggregate(don’t page for every occurrence; treat as noisy/frequent). -
Warnings on tier1
warningontier1→ticket(not a page, but still important to track). -
Default behavior
Prod alerts without a specific rule →ticket.
Non-prod alerts without a specific rule →suppress.
The whole point is that you can read this engine like a policy document.
Running the CLI with example data
With the repo set up, try the example:
python -m alertdecider-agent --alerts examples/alerts.json --services examples/services.yml --history examples/history.json --out-dir out
cat out/decision_report.md
You’ll see a table in your terminal and a Markdown report with each alert’s decision and reason.
With the provided examples, you should observe:
- A critical alert on a tier1, SLO-critical service in prod → page.
- A warning on a tier1 service → ticket.
- A flapping alert with high
count_24h→ aggregate. - A low-severity
infoalert in staging → suppress.
From here, you can plug in your own alerts JSON and adjust services.yml and engine.py to match your reality.
Where AI fits in (later)
Right now, alertdecider-agent is deliberately rule-based and transparent.
Once you’re happy with the structure, you can start experimenting with AI in safe, incremental ways:
-
Richer explanations
Feed the normalized alert + decision into an LLM to generate a human-friendly explanation and suggested next steps. -
Runbook suggestions
Use alert name, service name, and history to suggest a runbook link or dashboard. -
Rule tuning hints
Analyze real decision logs to recommend new rules (e.g., “this pattern is always suppressed but often escalated manually”).
The important part is: the control plane (what gets paged vs ticketed vs suppressed) remains in code you own.
Next steps for your team
If you try this out, here are some directions to evolve it:
-
Add time-of-day and on-call load
Don’t page at 03:00 for something that can safely be a ticket until business hours. -
Persist history more robustly
Replacehistory.jsonwith a small SQLite table or a lightweight time-series store. -
Integrate with your actual alert pipeline
Wirealertdeciderinto Alertmanager/PagerDuty via webhooks, or run it as part of your incident ingestion path. -
Measure impact
Track how many alerts are suppressed or aggregated, and how many pages you avoided without missing true incidents.
This is the kind of small, opinionated tool that can pay off quickly for SRE/DevOps teams drowning in alerts – and it’s a great foundation for more advanced AI-assisted incident management.