Sacred Systems

Pod Lifecycle and Failure States

Pods are the symptom surface. If you can’t interpret their phases, reasons, and events, you cannot diagnose the cluster with discipline.

Return to Archive Governance Initiation

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

A pod is the smallest schedulable unit, but it is not the right unit of operation. Operators still read pods because pods record the first truth of failure: what the scheduler decided, what kubelet attempted, and what the runtime rejected.

Kubblai doctrine: read the pod as testimony, then return to the controller as the unit of intent.

Phases vs container states

Pod phase is a coarse summary. Container state carries the detail: waiting reasons, termination reasons, exit codes, and whether the last crash is preserved.

Most investigations fail because the operator reads only `kubectl get pods` and ignores `describe` and events.

Pending often means scheduling constraints or storage prerequisites.
Running does not mean Ready; readiness gates traffic.
CrashLoopBackOff is a restart pattern, not a root cause.

Events are the platform speaking

Events are not perfect, but they are often the shortest path to classification: image pull, mount, admission, probes, scheduling. When they repeat, they are also a cost signal.

Treat repeated events as a control loop gone wrong: the platform is retrying and making things worse.

kubectl

shell

kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -n 40

Common failure signatures

Learn a small set of signatures. Do not treat each incident as a new story.

ImagePullBackOff: naming/auth/egress/rate limiting; no app signal exists yet.
OOMKilled: memory limit exceeded or node pressure; fix economics before you tune probes.
Readiness failing: traffic gate is closed; endpoints empty or withheld.
Pending: placement failure, quota, policy, or prerequisites (PVC).

Field notes

Pods die for reasons unrelated to your application: node pressure, eviction thresholds, storage detach latency, admission failure. Don’t let the surface symptom trick you into changing code when the platform is the cause.

If you’re on call, your first job is classification. Your second job is containment.

Canonical Link

Canonical URL: /library/pod-lifecycle-and-failure-states