Canonical Texts

Observability for People Who Actually Carry the Pager

If observability does not change decisions during an incident, it is decoration. Signal must be tied to failure modes and owned by the people who respond.

Return to Archive Governance Initiation

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

Observability is a discipline of truth. In Kubernetes, truth is expensive: distributed state, eventual consistency, and layers of control loops that obscure causality.

Kubblai doctrine: instrument what you can act on. If you cannot name the action, the metric is a distraction.

Tie signals to failure modes and runbook decisions.
Prefer low-cardinality, high-meaning metrics over decorative dashboards.
Treat alert fatigue as an availability risk.

Control plane visibility is mandatory

Most platforms are blind where it matters: API server latency, request throttling, admission performance, and controller churn. Without these, you misdiagnose system failures as application failures.

You must be able to answer: is the platform converging?

API server request duration, inflight requests, and 429/5xx rates.
Admission webhook latency/error rates and timeouts.
Controller queue depth, reconcile duration, and conflict retries.

Change correlation: the missing organ

Incidents are frequently ‘nothing changed’ until someone remembers a deployment, a policy rollout, or a node pool upgrade. Correlation should be engineered, not improvised.

Every material change should emit an event that can be queried and graphed.

Emit deploy events from GitOps/CI with version, SHA, and scope.
Emit policy change events with owners and rollback links.
Correlate node pool upgrades and autoscaler actions with workload health.

Alert hygiene as reliability work

Alerts must be sparse, owned, and actionable. If an alert triggers without a runbook action, it trains the on-call to ignore it.

Page on symptoms that require human intervention; ticket on slow drift.

Separate paging alerts from informational alerts.
Use multi-window burn rates for SLO-based paging.
Retire alerts that do not lead to action in postmortems.

The on-call dashboard: what it must answer

A single dashboard should answer incident classification within minutes.

Is the control plane healthy? (API latency, errors, etcd pressure).
Are nodes healthy? (NotReady, pressure, kubelet errors).
Are rollouts converging? (deployment conditions, replica set churn).
Is traffic flowing? (ingress/controller health, DNS, service discovery).

Canonical Link

Canonical URL: /library/observability-for-people-who-carry-the-pager