Atlas Entry
Atlas: Deployment Rollout Stalled
A Deployment rollout hangs or reports ProgressDeadlineExceeded; new pods don’t become Ready or traffic shifts incorrectly.
Text
Symptom → evidence → resolution.
Symptom
A Deployment rollout hangs or reports ProgressDeadlineExceeded; new pods don’t become Ready or traffic shifts incorrectly.
What this usually means
The Deployment controller is creating new pods but cannot advance the rollout: pods fail readiness, availability gates cannot be satisfied, or the cluster cannot schedule the desired replicas.
Likely causes
Rollouts stall for a small set of reasons. Identify the gate that is holding.
- Readiness never becomes true (probe mis-specified, dependency down, wrong port/path).
- Scheduling cannot place new pods (insufficient resources, taints/affinity constraints).
- Update strategy constraints block replacement (maxUnavailable=0 with no surge headroom).
- PodDisruptionBudget or topology constraints prevent moving capacity.
- Admission/policy blocks the new ReplicaSet from becoming valid.
What to inspect first
Let the controller speak. Then inspect the pods it created.
- Look at `.status.conditions` on the Deployment.
- Describe a new pod and read its events.
kubectl
shell
kubectl rollout status deploy/<name> -n <ns>
kubectl describe deploy/<name> -n <ns>
kubectl get rs -n <ns> -l app=<label> -o wide
kubectl get pods -n <ns> -l app=<label> -o wideResolution guidance
Restore availability before you optimize. A safe rollout is one that can be undone.
- If readiness is wrong, fix probes/ports and re-deploy; don’t force traffic onto unready pods.
- If capacity is the gate, add nodes or reduce requests before expanding replicas.
- If strategy is the gate, adjust maxSurge/maxUnavailable with intent and document the safety posture.
- If impact is growing, roll back: `kubectl rollout undo`—then fix the root cause deliberately.
Common mistakes
Avoid changes that destroy attribution.
- Editing multiple fields at once and losing the causal thread.
- Forcing readiness to “pass” while the service is still broken downstream.
- Treating a stalled rollout as an image problem without reading events.
Related
Canonical link
Canonical URL: /atlas/deployment-rollout-stalled