Lab: CrashLoopBackOff Triage

Goal

You will learn to triage CrashLoopBackOff without guesswork: identify whether the crash is application-level, configuration-level, or platform-level (probes, resources, permissions).

The ritual is calm: read state, collect evidence, change one thing, verify, and record the outcome.

Read the pod state and container exit code.
Confirm whether the restart is caused by probes or process exit.
Use targeted commands that preserve attribution.

What to inspect first (the canonical sequence)

CrashLoopBackOff is not a diagnosis. It is the platform telling you “the container stops.”

Events: probe failures, image pulls, mount errors, permission denials.
Container state: exitCode, reason (Error/OOMKilled), lastState.
Restart cadence: fast loops often indicate immediate process exit or failing startup behavior.

kubectl

shell

kubectl get pod <pod> -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous --all-containers=true

Classify the failure

Classify before you fix. The corrective action depends on the class.

Immediate process exit: bad args, missing files, missing env, wrong entrypoint.
Probe-driven restarts: liveness too aggressive, readiness mis-specified, startup too slow.
Resource pressure: OOMKilled, throttling, node pressure and eviction.
External dependency: DNS, service routing, auth tokens, secrets/config missing.

Apply a smallest-safe fix

Do not rewrite the world. Fix one governing mechanism, then verify.

The Order prefers reversibility: change a config value, add a guard, stage the rollout, and watch convergence.

If probes are the cause, tune initialDelaySeconds / timeoutSeconds / failureThreshold and verify with real startup time.
If config is missing, confirm the ConfigMap/Secret exists and is referenced correctly (name, key).
If the process exits due to args, verify the command/args rendered into the pod spec.

Verification

Your fix is real only if the platform converges and stays converged.

Restart count stops increasing.
Readiness stays true under real traffic/requests.
Events stop repeating.