Canonical Texts
Observability for People Who Actually Carry the Pager
If observability does not change decisions during an incident, it is decoration. Signal must be tied to failure modes and owned by the people who respond.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
Observability is a discipline of truth. In Kubernetes, truth is expensive: distributed state, eventual consistency, and layers of control loops that obscure causality.
Kubblai doctrine: instrument what you can act on. If you cannot name the action, the metric is a distraction.
- Tie signals to failure modes and runbook decisions.
- Prefer low-cardinality, high-meaning metrics over decorative dashboards.
- Treat alert fatigue as an availability risk.
Control plane visibility is mandatory
Most platforms are blind where it matters: API server latency, request throttling, admission performance, and controller churn. Without these, you misdiagnose system failures as application failures.
You must be able to answer: is the platform converging?
- API server request duration, inflight requests, and 429/5xx rates.
- Admission webhook latency/error rates and timeouts.
- Controller queue depth, reconcile duration, and conflict retries.
Change correlation: the missing organ
Incidents are frequently ‘nothing changed’ until someone remembers a deployment, a policy rollout, or a node pool upgrade. Correlation should be engineered, not improvised.
Every material change should emit an event that can be queried and graphed.
- Emit deploy events from GitOps/CI with version, SHA, and scope.
- Emit policy change events with owners and rollback links.
- Correlate node pool upgrades and autoscaler actions with workload health.
Alert hygiene as reliability work
Alerts must be sparse, owned, and actionable. If an alert triggers without a runbook action, it trains the on-call to ignore it.
Page on symptoms that require human intervention; ticket on slow drift.
- Separate paging alerts from informational alerts.
- Use multi-window burn rates for SLO-based paging.
- Retire alerts that do not lead to action in postmortems.
The on-call dashboard: what it must answer
A single dashboard should answer incident classification within minutes.
- Is the control plane healthy? (API latency, errors, etcd pressure).
- Are nodes healthy? (NotReady, pressure, kubelet errors).
- Are rollouts converging? (deployment conditions, replica set churn).
- Is traffic flowing? (ingress/controller health, DNS, service discovery).
Canonical Link
Canonical URL: /library/observability-for-people-who-carry-the-pager
Related Readings
Advanced Disciplines
LibraryObservability as Revelation
Observability is the discipline of evidence. Without it, incident response becomes storytelling.
Advanced Disciplines
LibraryTraces, Metrics, and the Reading of Omens
Telemetry is a system. If you do not govern cardinality and cost, observability becomes its own outage.
Rites & Trials
LibraryOn Call in the Dark Order: Kubernetes Failure Triage
When the pager rings, the rite is restraint: preserve evidence, choose reversible actions, and stabilize the control plane before you chase symptoms.
Advanced Disciplines
LibraryDebugging the Control Plane Under Pressure
The control plane fails quietly, then all at once. Debugging it requires you to reduce churn, read saturation signals, and avoid write amplification.
Rites & Trials
LibraryIncident Doctrine for Platform Teams
Platform incidents are governance incidents. The doctrine must define authority, evidence, safe mitigations, and how memory becomes guardrail.