Skip to content

Lab · Foundations

Lab: CrashLoopBackOff Triage

Practice the Order’s first rule of troubleshooting: evidence before action. Diagnose a crash loop using events, logs, exit codes, and probe semantics.

Prerequisites

What you should have before you begin.

WorkloadsOperationsReliability
  • A local cluster (kind/minikube) or a sandbox namespace
  • kubectl installed
  • Comfort reading pod status

Lab text

Follow the sequence. Change one thing at a time.

Goal

You will learn to triage CrashLoopBackOff without guesswork: identify whether the crash is application-level, configuration-level, or platform-level (probes, resources, permissions).

The ritual is calm: read state, collect evidence, change one thing, verify, and record the outcome.

  • Read the pod state and container exit code.
  • Confirm whether the restart is caused by probes or process exit.
  • Use targeted commands that preserve attribution.

What to inspect first (the canonical sequence)

CrashLoopBackOff is not a diagnosis. It is the platform telling you “the container stops.”

  • Events: probe failures, image pulls, mount errors, permission denials.
  • Container state: exitCode, reason (Error/OOMKilled), lastState.
  • Restart cadence: fast loops often indicate immediate process exit or failing startup behavior.

kubectl

shell

kubectl get pod <pod> -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous --all-containers=true

Classify the failure

Classify before you fix. The corrective action depends on the class.

  • Immediate process exit: bad args, missing files, missing env, wrong entrypoint.
  • Probe-driven restarts: liveness too aggressive, readiness mis-specified, startup too slow.
  • Resource pressure: OOMKilled, throttling, node pressure and eviction.
  • External dependency: DNS, service routing, auth tokens, secrets/config missing.

Apply a smallest-safe fix

Do not rewrite the world. Fix one governing mechanism, then verify.

The Order prefers reversibility: change a config value, add a guard, stage the rollout, and watch convergence.

  • If probes are the cause, tune initialDelaySeconds / timeoutSeconds / failureThreshold and verify with real startup time.
  • If config is missing, confirm the ConfigMap/Secret exists and is referenced correctly (name, key).
  • If the process exits due to args, verify the command/args rendered into the pod spec.

Verification

Your fix is real only if the platform converges and stays converged.

  • Restart count stops increasing.
  • Readiness stays true under real traffic/requests.
  • Events stop repeating.

kubectl

shell

kubectl rollout status deploy/<deploy> -n <ns>
kubectl get pods -n <ns> -w

Canonical link

Canonical URL: /labs/crashloopbackoff-triage