Atlas Entry
Atlas: Readiness Flapping
Pods alternate between Ready and NotReady; endpoints churn and traffic becomes unstable.
Text
Symptom → evidence → resolution.
Symptom
Pods alternate between Ready and NotReady; endpoints churn and traffic becomes unstable.
WorkloadsNetworkingReliabilityOperations
What this usually means
Your traffic gate is unstable. Either the workload cannot sustain load, the probe endpoint is too strict, or the probe is checking a dependency that is intermittently unhealthy.
What to inspect first
Correlate readiness transitions with load and probe failures.
- Look for timeouts vs explicit failure responses.
- If endpoints churn, validate whether upstream retries are amplifying pressure.
kubectl
shell
kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "Readiness probe failed" | tail -n 30 || trueResolution guidance
Fix semantics first. Tune numbers second.
- Ensure readiness means “can serve” and the endpoint is fast and stable under load.
- Remove deep dependency checks from readiness unless required; otherwise you create cascading failures.
- Tune timeouts based on measured p99 under expected load; add startupProbe when warm-up is slow.
- If the service is genuinely overloaded, reduce load or add capacity—don’t tune probes to hide collapse.
Related
Canonical link
Canonical URL: /atlas/readiness-flapping