Manual restarts during incidents are reactive. Self-healing means your containers recover themselves between alerts.
This post shows how to build a lightweight bash watchdog that:
- Monitors container health via Docker health checks
- Restarts unhealthy containers
- Integrates with systemd for daemon-like behavior
- Logs everything for incident review
The Self-Healing Architecture
Docker has built-in restart policies (--restart unless-stopped), but they don’t respect health checks. A container can be “running” but unhealthy (app crashed, dependencies down, etc.).
Our script loops every 30 seconds:
- Query Docker API for container health
- Restart unhealthy containers
- Log actions to systemd journal
- Repeat
Step 1: Docker Health Checks (the foundation)
First, ensure your containers have proper health checks in docker-compose.yml or docker run:
services:
web:
image: nginx
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
This marks containers healthy, unhealthy, or starting.
Step 2: The Self-Healing Watchdog Script
Save this as /usr/local/bin/docker-autoheal.sh:
#!/usr/bin/env bash
set -euo pipefail
# Config
CHECK_INTERVAL=30
LOG_FILE="/var/log/docker-autoheal.log"
CONTAINER_LABEL="autoheal=true" # Label your containers
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
}
heal_containers() {
local unhealthy=($(docker ps --filter "health=unhealthy" --filter "label=$CONTAINER_LABEL" --format ""))
for container in "${unhealthy[@]}"; do
log "RESTARTING UNHEALTHY CONTAINER: $container"
# Graceful stop first
docker stop "$container" || true
# Hard kill after timeout
docker kill "$container" || true
# Restart with original command
docker start "$container"
log "RESTARTED: $container"
done
}
# Main loop
log "Docker Auto-Heal started (check interval: ${CHECK_INTERVAL}s)"
while true; do
heal_containers
sleep "$CHECK_INTERVAL"
done
Step 3: Run as a Systemd Service (production-ready)
Create /etc/systemd/system/docker-autoheal.service:
[Unit]
Description=Docker Container Auto-Healer
After=docker.service
Requires=docker.service
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/docker-autoheal.sh
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable docker-autoheal.service
sudo systemctl start docker-autoheal.service
Check status:
sudo systemctl status docker-autoheal.service
sudo journalctl -u docker-autoheal.service -f
Step 4: Label Your Containers
Tag containers you want to auto-heal:
docker run -d \
--label "autoheal=true" \
--restart unless-stopped \
--health-cmd="curl -f http://localhost:8080/health || exit 1" \
nginx
Step 5: Advanced Features
Grace Period (avoid restart loops)
# Add to script before restart
if docker inspect "$container" --format '' | grep -q "10"; then
log "SKIPPING: $container restart count too high (restart loop?)"
continue
fi
Webhook Alerts
curl -X POST -d "{\"container\":\"$container\",\"action\":\"restart\"}" "$WEBHOOK_URL"
Multi-Host (Docker Swarm)
Use labels + a central orchestrator or run the script on each host. ​
Testing It (safely)
Spin up a container with a failing health check:
docker run -d \
--name test-fail \
--label "autoheal=true" \
--health-cmd="exit 1" \
alpine sleep infinity
Watch it get restarted:
journalctl -u docker-autoheal.service -f
Limitations + When to Upgrade
This works great for:
- Small deployments / homelabs
- Edge services / single-host apps
- Dev/staging environments
Upgrade to:
- Kubernetes: liveness/readiness probes + pod disruption budgets
- Docker Swarm: service replicas + constraints
- Nomad: health checks + restart stanzas
Summary Table (Copy/Paste)
| Component | Command | Purpose |
|---|---|---|
| Health Check | docker ps --filter health=unhealthy |
Find broken containers |
| Watchdog | systemctl status docker-autoheal |
Confirm service running |
| Logs | journalctl -u docker-autoheal |
Review restart history |
| Container Labels | --label autoheal=true |
Target specific services |