Atlas Entry
Atlas: Liveness Probe Restarts
Pods restart repeatedly with events indicating failed liveness probes; the application may otherwise appear healthy at times.
Text
Symptom → evidence → resolution.
Symptom
Pods restart repeatedly with events indicating failed liveness probes; the application may otherwise appear healthy at times.
WorkloadsReliabilityOperations
What this usually means
Kubelet is killing containers because the liveness check reports failure. This is rarely a platform issue; it is usually a semantics mismatch or an unrealistic timeout under load.
Likely causes
Probe failures under load are common and predictable.
- Timeouts too aggressive for real p95/p99 response times.
- Probe endpoint does too much work (calls dependencies, hits databases, cold paths).
- No startupProbe for slow init, so liveness starts killing during warm-up.
- CPU throttling or GC pauses cause transient probe timeouts.
- NetworkPolicy/sidecar routing makes the probe path unreliable.
What to inspect first
Confirm that liveness is the restart cause. Then measure probe behavior.
- Look for repeated `Liveness probe failed` events.
- Confirm whether restarts are correlated with traffic spikes.
kubectl
shell
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o yaml | rg -n "livenessProbe|startupProbe|timeoutSeconds|failureThreshold"
kubectl logs <pod> -n <ns> --previous --all-containers=trueResolution guidance
Make liveness conservative; make readiness expressive.
- Move dependency checks to readiness, not liveness.
- Use startupProbe for warm-up; keep liveness simple (process loop) when possible.
- Tune timeouts and thresholds based on observed latency under realistic load.
- If CPU throttling is present, fix requests/limits; probes should not be the victim of starvation.
Related
Canonical link
Canonical URL: /atlas/liveness-probe-restarts