Rites & Trials
On Call in the Dark Order: Kubernetes Failure Triage
When the pager rings, the rite is restraint: preserve evidence, choose reversible actions, and stabilize the control plane before you chase symptoms.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
On-call success is not brilliance; it is procedure. Your first job is to stop the incident from expanding. Your second job is to create a clean narrative from evidence: what changed, what failed, and what the system is doing now.
Kubblai triage doctrine is simple: control plane first, blast radius second, workload symptoms last.
- Preserve evidence: events, controller logs, audit, and metrics snapshots.
- Reduce change: freeze deploys and policy churn during diagnosis.
- Choose reversible actions first; irreversible actions must be justified.
The first five minutes: classify the incident
Do not begin with kubectl randomness. Begin by classifying: is the control plane healthy, is the data plane healthy, or is this an application failure? Each class has different safe actions.
If the API is slow or failing admission, everything else is theatre.
- API latency and error rates (429/5xx) indicate systemic control-plane stress.
- Node NotReady storms indicate a data-plane or network partition.
- Single namespace failures often point to quota, policy, or a workload regression.
kubectl
shell
kubectl get --raw /readyz?verbose
kubectl get nodes
kubectl get events -A --sort-by=.lastTimestamp | tail -n 80Stabilize: stop contributing to churn
Many operators worsen incidents by increasing write pressure on an already stressed API server. Stop applying manifests. Stop restarting controllers. Stop “fixing” by creating more objects.
Stabilization means reducing work: pause noisy controllers, disable non-essential automation, and reduce resync pressure.
- If admission is failing, prioritize webhook health and bypass posture.
- If etcd is saturated, reduce write amplification (including status spam).
- If nodes are flapping, identify the failure domain and stop draining blindly.
Evidence standards
Triage must be communicable. If you cannot explain the system’s state to another operator with concrete evidence, you are guessing.
The Order expects a short incident spine: symptom → scope → suspected mechanism → reversible mitigation → verification.
- Events: look for admission rejections, image pulls, probes, eviction, and node pressure.
- Metrics: API latency, etcd fsync/commit, controller queue depth, kubelet errors.
- Logs: the first error is often upstream of the loudest error.
Safe mitigations that scale
A mitigation is safe when it reduces blast radius and preserves optionality. It is unsafe when it destroys state, masks evidence, or commits you to a single narrative.
Do not “fix” by deleting objects you do not understand.
- Scale down known noisy deployments that hammer the API.
- Temporarily relax non-critical policies only with explicit time bounds and audit.
- Fail open only when you can measure and compensate.
After the restore: convert pain into doctrine
The rite ends only when memory is preserved. Postmortems are not ceremonies; they are how you reduce entropy across the platform.
Write the guardrail you wish had existed: policy, test, runbook, or automation with clear failure modes.
- Capture a timeline with the evidence that justified each action.
- Create a control-plane health dashboard that answers readiness questions fast.
- Ensure incident learnings become shared texts in the archive.
Canonical Link
Canonical URL: /library/on-call-in-the-dark-order-kubernetes-failure-triage
Related Readings
Canonical Texts
LibraryIncident Response as a Trial of Faith
Incidents reveal the true governance of your platform: who can act, what can be changed, and whether your system can recover with discipline.
Advanced Disciplines
LibraryObservability as Revelation
Observability is the discipline of evidence. Without it, incident response becomes storytelling.
Sacred Systems
LibraryThe API Server as the Gate of Truth
The API is the only public reality in Kubernetes. Everything else is implementation detail and transient effect.
Sacred Systems
LibraryThe Hidden Burdens of etcd
etcd is where intent is stored. It is also where unbounded ambition becomes latency, instability, and collapse.
Canonical Texts
LibraryKubblai Doctrine: Cluster Discipline and Operational Safety
Operational safety is not a mood. It is a set of constraints and practices that keep change survivable and failure contained.