Lab · Foundations
Lab: CrashLoopBackOff Triage
Practice the Order’s first rule of troubleshooting: evidence before action. Diagnose a crash loop using events, logs, exit codes, and probe semantics.
Prerequisites
What you should have before you begin.
- A local cluster (kind/minikube) or a sandbox namespace
- kubectl installed
- Comfort reading pod status
Lab text
Follow the sequence. Change one thing at a time.
Goal
You will learn to triage CrashLoopBackOff without guesswork: identify whether the crash is application-level, configuration-level, or platform-level (probes, resources, permissions).
The ritual is calm: read state, collect evidence, change one thing, verify, and record the outcome.
- Read the pod state and container exit code.
- Confirm whether the restart is caused by probes or process exit.
- Use targeted commands that preserve attribution.
What to inspect first (the canonical sequence)
CrashLoopBackOff is not a diagnosis. It is the platform telling you “the container stops.”
- Events: probe failures, image pulls, mount errors, permission denials.
- Container state: exitCode, reason (Error/OOMKilled), lastState.
- Restart cadence: fast loops often indicate immediate process exit or failing startup behavior.
kubectl
shell
kubectl get pod <pod> -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous --all-containers=trueClassify the failure
Classify before you fix. The corrective action depends on the class.
- Immediate process exit: bad args, missing files, missing env, wrong entrypoint.
- Probe-driven restarts: liveness too aggressive, readiness mis-specified, startup too slow.
- Resource pressure: OOMKilled, throttling, node pressure and eviction.
- External dependency: DNS, service routing, auth tokens, secrets/config missing.
Apply a smallest-safe fix
Do not rewrite the world. Fix one governing mechanism, then verify.
The Order prefers reversibility: change a config value, add a guard, stage the rollout, and watch convergence.
- If probes are the cause, tune initialDelaySeconds / timeoutSeconds / failureThreshold and verify with real startup time.
- If config is missing, confirm the ConfigMap/Secret exists and is referenced correctly (name, key).
- If the process exits due to args, verify the command/args rendered into the pod spec.
Verification
Your fix is real only if the platform converges and stays converged.
- Restart count stops increasing.
- Readiness stays true under real traffic/requests.
- Events stop repeating.
kubectl
shell
kubectl rollout status deploy/<deploy> -n <ns>
kubectl get pods -n <ns> -wRelated
Canonical link
Canonical URL: /labs/crashloopbackoff-triage