Atlas: Deployment Rollout Stalled

Symptom

A Deployment rollout hangs or reports ProgressDeadlineExceeded; new pods don’t become Ready or traffic shifts incorrectly.

WorkloadsReliabilityOperations

What this usually means

The Deployment controller is creating new pods but cannot advance the rollout: pods fail readiness, availability gates cannot be satisfied, or the cluster cannot schedule the desired replicas.

Likely causes

Rollouts stall for a small set of reasons. Identify the gate that is holding.

Readiness never becomes true (probe mis-specified, dependency down, wrong port/path).
Scheduling cannot place new pods (insufficient resources, taints/affinity constraints).
Update strategy constraints block replacement (maxUnavailable=0 with no surge headroom).
PodDisruptionBudget or topology constraints prevent moving capacity.
Admission/policy blocks the new ReplicaSet from becoming valid.

What to inspect first

Let the controller speak. Then inspect the pods it created.

Look at `.status.conditions` on the Deployment.
Describe a new pod and read its events.

kubectl

shell

kubectl rollout status deploy/<name> -n <ns>
kubectl describe deploy/<name> -n <ns>
kubectl get rs -n <ns> -l app=<label> -o wide
kubectl get pods -n <ns> -l app=<label> -o wide

Resolution guidance

Restore availability before you optimize. A safe rollout is one that can be undone.

If readiness is wrong, fix probes/ports and re-deploy; don’t force traffic onto unready pods.
If capacity is the gate, add nodes or reduce requests before expanding replicas.
If strategy is the gate, adjust maxSurge/maxUnavailable with intent and document the safety posture.
If impact is growing, roll back: `kubectl rollout undo`—then fix the root cause deliberately.

Common mistakes

Avoid changes that destroy attribution.

Editing multiple fields at once and losing the causal thread.
Forcing readiness to “pass” while the service is still broken downstream.
Treating a stalled rollout as an image problem without reading events.

Text

What this usually means

Likely causes

What to inspect first

Resolution guidance

Common mistakes