Atlas: Readiness Flapping

Symptom

Pods alternate between Ready and NotReady; endpoints churn and traffic becomes unstable.

WorkloadsNetworkingReliabilityOperations

What this usually means

Your traffic gate is unstable. Either the workload cannot sustain load, the probe endpoint is too strict, or the probe is checking a dependency that is intermittently unhealthy.

What to inspect first

Correlate readiness transitions with load and probe failures.

Look for timeouts vs explicit failure responses.
If endpoints churn, validate whether upstream retries are amplifying pressure.

kubectl

shell

kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "Readiness probe failed" | tail -n 30 || true

Resolution guidance

Fix semantics first. Tune numbers second.

Ensure readiness means “can serve” and the endpoint is fast and stable under load.
Remove deep dependency checks from readiness unless required; otherwise you create cascading failures.
Tune timeouts based on measured p99 under expected load; add startupProbe when warm-up is slow.
If the service is genuinely overloaded, reduce load or add capacity—don’t tune probes to hide collapse.

Text

What this usually means

What to inspect first

Resolution guidance