Skip to content

Atlas Entry

Atlas: Liveness Probe Restarts

Pods restart repeatedly with events indicating failed liveness probes; the application may otherwise appear healthy at times.

Text

Symptom → evidence → resolution.

Symptom

Pods restart repeatedly with events indicating failed liveness probes; the application may otherwise appear healthy at times.

WorkloadsReliabilityOperations

What this usually means

Kubelet is killing containers because the liveness check reports failure. This is rarely a platform issue; it is usually a semantics mismatch or an unrealistic timeout under load.

Likely causes

Probe failures under load are common and predictable.

  • Timeouts too aggressive for real p95/p99 response times.
  • Probe endpoint does too much work (calls dependencies, hits databases, cold paths).
  • No startupProbe for slow init, so liveness starts killing during warm-up.
  • CPU throttling or GC pauses cause transient probe timeouts.
  • NetworkPolicy/sidecar routing makes the probe path unreliable.

What to inspect first

Confirm that liveness is the restart cause. Then measure probe behavior.

  • Look for repeated `Liveness probe failed` events.
  • Confirm whether restarts are correlated with traffic spikes.

kubectl

shell

kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o yaml | rg -n "livenessProbe|startupProbe|timeoutSeconds|failureThreshold"
kubectl logs <pod> -n <ns> --previous --all-containers=true

Resolution guidance

Make liveness conservative; make readiness expressive.

  • Move dependency checks to readiness, not liveness.
  • Use startupProbe for warm-up; keep liveness simple (process loop) when possible.
  • Tune timeouts and thresholds based on observed latency under realistic load.
  • If CPU throttling is present, fix requests/limits; probes should not be the victim of starvation.

Canonical link

Canonical URL: /atlas/liveness-probe-restarts