Skip to content

Rites & Trials

On Call in the Dark Order: Kubernetes Failure Triage

When the pager rings, the rite is restraint: preserve evidence, choose reversible actions, and stabilize the control plane before you chase symptoms.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

On-call success is not brilliance; it is procedure. Your first job is to stop the incident from expanding. Your second job is to create a clean narrative from evidence: what changed, what failed, and what the system is doing now.

Kubblai triage doctrine is simple: control plane first, blast radius second, workload symptoms last.

  • Preserve evidence: events, controller logs, audit, and metrics snapshots.
  • Reduce change: freeze deploys and policy churn during diagnosis.
  • Choose reversible actions first; irreversible actions must be justified.

The first five minutes: classify the incident

Do not begin with kubectl randomness. Begin by classifying: is the control plane healthy, is the data plane healthy, or is this an application failure? Each class has different safe actions.

If the API is slow or failing admission, everything else is theatre.

  • API latency and error rates (429/5xx) indicate systemic control-plane stress.
  • Node NotReady storms indicate a data-plane or network partition.
  • Single namespace failures often point to quota, policy, or a workload regression.

kubectl

shell

kubectl get --raw /readyz?verbose
kubectl get nodes
kubectl get events -A --sort-by=.lastTimestamp | tail -n 80

Stabilize: stop contributing to churn

Many operators worsen incidents by increasing write pressure on an already stressed API server. Stop applying manifests. Stop restarting controllers. Stop “fixing” by creating more objects.

Stabilization means reducing work: pause noisy controllers, disable non-essential automation, and reduce resync pressure.

  • If admission is failing, prioritize webhook health and bypass posture.
  • If etcd is saturated, reduce write amplification (including status spam).
  • If nodes are flapping, identify the failure domain and stop draining blindly.

Evidence standards

Triage must be communicable. If you cannot explain the system’s state to another operator with concrete evidence, you are guessing.

The Order expects a short incident spine: symptom → scope → suspected mechanism → reversible mitigation → verification.

  • Events: look for admission rejections, image pulls, probes, eviction, and node pressure.
  • Metrics: API latency, etcd fsync/commit, controller queue depth, kubelet errors.
  • Logs: the first error is often upstream of the loudest error.

Safe mitigations that scale

A mitigation is safe when it reduces blast radius and preserves optionality. It is unsafe when it destroys state, masks evidence, or commits you to a single narrative.

Do not “fix” by deleting objects you do not understand.

  • Scale down known noisy deployments that hammer the API.
  • Temporarily relax non-critical policies only with explicit time bounds and audit.
  • Fail open only when you can measure and compensate.

After the restore: convert pain into doctrine

The rite ends only when memory is preserved. Postmortems are not ceremonies; they are how you reduce entropy across the platform.

Write the guardrail you wish had existed: policy, test, runbook, or automation with clear failure modes.

  • Capture a timeline with the evidence that justified each action.
  • Create a control-plane health dashboard that answers readiness questions fast.
  • Ensure incident learnings become shared texts in the archive.