Skip to content

Advanced Disciplines

The Dark Order’s Guide to Observability in Kubernetes

Observability is not dashboards. It is the discipline of evidence: the ability to prove what happened, what changed, and why the system behaved as it did.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine: evidence before narrative

During incidents, teams tell stories. Observability is how you replace stories with evidence.

The Order’s rule: if you cannot answer ‘what changed’ and ‘where is the bottleneck’, you do not yet have an observability system.

  • Events for object-level narrative.
  • Metrics for saturation and error budgets.
  • Traces for causality across boundaries.
  • Audit for governance and attribution.

Control plane visibility is mandatory

Clusters fail through the control plane: API latency, admission failures, scheduler stalls, etcd pressure. If you only observe workloads, you will misdiagnose platform incidents as application incidents.

A serious shrine keeps watch over the gate of truth.

Signal quality: the quiet art

Too many logs is not observability; it is storage debt. Too many alerts is not readiness; it is learned helplessness.

Align signals with failure modes: control plane QPS/latency, reconcile duration, queue depths, node pressure, and rollout health.

Runbooks as canonical texts

Runbooks are doctrine translated into action. They must be executable under stress: concrete commands, expected outputs, and decision points.

Postmortems are not documents. They are how you update doctrine.

kubectl

shell

kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
kubectl top nodes
kubectl top pods -A | head
kubectl get --raw /metrics | head

Common failure modes (and how to avoid them)

The Order sees the same patterns repeat across organizations.

  • No control plane telemetry → platform incidents misdiagnosed.
  • Alert storms without ownership → alerts ignored during real events.
  • Missing change correlation → ‘nothing changed’ becomes the default lie.
  • Metrics without SLOs → data without decisions.