Atlas: Liveness Probe Restarts

Symptom

Pods restart repeatedly with events indicating failed liveness probes; the application may otherwise appear healthy at times.

WorkloadsReliabilityOperations

What this usually means

Kubelet is killing containers because the liveness check reports failure. This is rarely a platform issue; it is usually a semantics mismatch or an unrealistic timeout under load.

Likely causes

Probe failures under load are common and predictable.

Timeouts too aggressive for real p95/p99 response times.
Probe endpoint does too much work (calls dependencies, hits databases, cold paths).
No startupProbe for slow init, so liveness starts killing during warm-up.
CPU throttling or GC pauses cause transient probe timeouts.
NetworkPolicy/sidecar routing makes the probe path unreliable.

What to inspect first

Confirm that liveness is the restart cause. Then measure probe behavior.

Look for repeated `Liveness probe failed` events.
Confirm whether restarts are correlated with traffic spikes.

kubectl

shell

kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o yaml | rg -n "livenessProbe|startupProbe|timeoutSeconds|failureThreshold"
kubectl logs <pod> -n <ns> --previous --all-containers=true

Resolution guidance

Make liveness conservative; make readiness expressive.

Move dependency checks to readiness, not liveness.
Use startupProbe for warm-up; keep liveness simple (process loop) when possible.
Tune timeouts and thresholds based on observed latency under realistic load.
If CPU throttling is present, fix requests/limits; probes should not be the victim of starvation.

Text

What this usually means

Likely causes

What to inspect first

Resolution guidance