Skip to content

Atlas Entry

Atlas: Deployment Rollout Stalled

A Deployment rollout hangs or reports ProgressDeadlineExceeded; new pods don’t become Ready or traffic shifts incorrectly.

Text

Symptom → evidence → resolution.

Symptom

A Deployment rollout hangs or reports ProgressDeadlineExceeded; new pods don’t become Ready or traffic shifts incorrectly.

WorkloadsReliabilityOperations

What this usually means

The Deployment controller is creating new pods but cannot advance the rollout: pods fail readiness, availability gates cannot be satisfied, or the cluster cannot schedule the desired replicas.

Likely causes

Rollouts stall for a small set of reasons. Identify the gate that is holding.

  • Readiness never becomes true (probe mis-specified, dependency down, wrong port/path).
  • Scheduling cannot place new pods (insufficient resources, taints/affinity constraints).
  • Update strategy constraints block replacement (maxUnavailable=0 with no surge headroom).
  • PodDisruptionBudget or topology constraints prevent moving capacity.
  • Admission/policy blocks the new ReplicaSet from becoming valid.

What to inspect first

Let the controller speak. Then inspect the pods it created.

  • Look at `.status.conditions` on the Deployment.
  • Describe a new pod and read its events.

kubectl

shell

kubectl rollout status deploy/<name> -n <ns>
kubectl describe deploy/<name> -n <ns>
kubectl get rs -n <ns> -l app=<label> -o wide
kubectl get pods -n <ns> -l app=<label> -o wide

Resolution guidance

Restore availability before you optimize. A safe rollout is one that can be undone.

  • If readiness is wrong, fix probes/ports and re-deploy; don’t force traffic onto unready pods.
  • If capacity is the gate, add nodes or reduce requests before expanding replicas.
  • If strategy is the gate, adjust maxSurge/maxUnavailable with intent and document the safety posture.
  • If impact is growing, roll back: `kubectl rollout undo`—then fix the root cause deliberately.

Common mistakes

Avoid changes that destroy attribution.

  • Editing multiple fields at once and losing the causal thread.
  • Forcing readiness to “pass” while the service is still broken downstream.
  • Treating a stalled rollout as an image problem without reading events.

Canonical link

Canonical URL: /atlas/deployment-rollout-stalled