Skip to content

Canonical Texts

Kubblai Doctrine: Cluster Discipline and Operational Safety

Operational safety is not a mood. It is a set of constraints and practices that keep change survivable and failure contained.

Text

Authored as doctrine; evaluated as systems craft.

The doctrine

The Order is severe about one thing: ungoverned change is the root of most outages.

Kubernetes gives you control loops. It does not give you discipline. Discipline is the layer you build: policy, review, rollout safety, and incident procedure.

  • Change must be reversible.
  • Blast radius must be bounded.
  • Evidence precedes action.

Reversible change is a design requirement

Rollback is not a button. It is an architecture property: statelessness where possible, safe migrations where not, and clear boundaries for what can be reversed and what cannot.

Operators who cannot roll back are forced into improvisation. Improvisation under pressure is where mythology becomes incident.

  • Define rollback boundaries for every workload class.
  • Treat stateful changes as planned incidents with explicit risk budgets.
  • Prefer progressive delivery tied to measurable signals.

Blast radius is controlled by structure

Most organizations learn blast radius by accident. The Order learns it by design: namespaces, network policies, quotas, and ownership boundaries that limit failure propagation.

Multi-tenant clusters are not ‘just labels’. They are governance systems.

  • Use namespaces as administrative units with explicit owners.
  • Default-deny networking where feasible, rolled out with observability.
  • Least privilege is reliability: reduce who can change what during incidents.

Rollout discipline: make change boring

Safe change is slow change only when it has to be. The goal is controlled change: canaries, staged promotion, health gates, and rapid rollback when signals fail.

The Order’s posture: change should be frequent enough to be routine—and governed enough to be survivable.

kubectl

shell

kubectl rollout status deploy/<name> -n <ns>
kubectl rollout history deploy/<name> -n <ns>
kubectl rollout undo deploy/<name> -n <ns> --to-revision=<n>

Incidents are procedure, not performance

Incident response is where discipline becomes visible. The first move is not ‘fix’. The first move is scope and evidence: what changed, what is broken, what is the blast radius.

The Order teaches calm because calm preserves options.

  • Stabilize the control plane before applying further churn.
  • Prefer read-only diagnosis until a clear intervention emerges.
  • Postmortems must change the system: guardrails, alerts, runbooks.