Advanced Disciplines
The Ritual of Safe Cluster Upgrades
Upgrades are not events. They are a governance loop: preflight, stage, validate, and preserve reversibility under pressure.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
A cluster upgrade is not a checkbox; it is a multi-system negotiation under constrained time. Your enemy is not the new version. Your enemy is unexamined coupling: webhooks, CNIs, CSIs, policy, and workloads that depend on undefined behavior.
Kubblai doctrine is restraint: you upgrade when you can prove you can stop. You ship a rollback plan you have rehearsed, not a hope you have written.
- Treat version skew as law: control plane, kubelet, kubectl, and addons.
- Define exit criteria before you begin: SLO impact, error budget, and validation steps.
- Reduce change surface: freeze policy churn and unrelated deploys during the window.
Preflight: map the blast radius
Before you touch versions, inventory the moving parts that will break first: admission webhooks, PSP/PSA posture, CNI/CSI, service mesh, ingress controller, autoscalers, and any controller that writes aggressively to the API.
If you cannot name what will fail, you have not earned the right to proceed.
- Collect addon versions, compatibility notes, and rollback paths.
- Confirm your etcd backup posture and restore practice.
- Verify you can bypass admission for emergency recovery (with explicit compensations).
kubectl
shell
kubectl version --short
kubectl get nodes -o wide
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations
kubectl get pods -A | rg -n "CrashLoopBackOff|Pending|ImagePullBackOff" || trueStage the upgrade like a control loop
A safe upgrade is staged: small scope, verify signals, expand scope. You do not “finish faster” by upgrading everything at once; you only reduce your ability to attribute causes.
Use canary pools, surge capacity, and measured drain. If you have stateful workloads, draining is not a button; it is a contract with storage and disruption budgets.
- One node pool at a time; one failure domain at a time.
- Observe admission latency and API server saturation during drains.
- Confirm PDB behavior and eviction rates under real load.
Admission and policy: the silent outage vector
Many upgrades fail because admission is treated as decoration. Webhooks add latency to every write and are tightly coupled to API behavior, TLS, and network policy. During an upgrade, admission becomes the gate of truth—and the gate can jam.
Upgrade doctrine includes budgets: maximum admission latency, maximum webhook error rate, and an emergency bypass procedure that is logged and reviewed.
- Budget admission latency; instrument it and alert on it.
- Treat webhook failurePolicy choices as incident posture decisions.
- Stage policy changes separately from version changes.
Rollback is not “undo”
Rollback is a system design problem. Some changes are not reversible: API removals, etcd migrations, and addon state transitions. Your rollback plan must name what you can revert and what you must repair forward.
Kubblai doctrine: if the rollback plan is “restore from backup,” it must be rehearsed with timing and data loss assumptions written explicitly.
- Write a rollback matrix: component → reversible? → how → time cost.
- Define a stop-loss threshold (SLO breach, error budget burn) that triggers freeze/rollback.
- Keep an operator timeline: decisions, observations, and why.
Validation: what you must prove
Validation is not “pods are Running.” Validate the paths that actually serve traffic: ingress, DNS, service discovery, identity, and storage. Validate controllers you rely on for safety: HPA, disruption controllers, admission, policy.
The Order measures upgrades by the absence of surprise, not by the speed of completion.
- Probe service-to-service connectivity across namespaces and nodes.
- Validate storage operations: attach/detach, reschedule, and recovery.
- Validate observability: events, audit logs, metrics pipelines.
Canonical Link
Canonical URL: /library/the-ritual-of-safe-cluster-upgrades
Related Readings
Advanced Disciplines
LibraryUpgrade Strategy and the Ritual of Continuity
Upgrades are inevitable. The ritual is continuity: the platform changes while service remains intact.
Sacred Systems
LibraryThe API Server as the Gate of Truth
The API is the only public reality in Kubernetes. Everything else is implementation detail and transient effect.
Governance & Power
LibraryAdmission Control and the Rite of Judgment
Admission is where governance becomes enforceable. It is also a place where outages are born.
Advanced Disciplines
LibraryRuntime Security and the Defense of the Sacred Plane
Security is not a feature; it is an operational discipline. Controls must be enforceable and survivable under load.
Canonical Texts
LibraryIncident Response as a Trial of Faith
Incidents reveal the true governance of your platform: who can act, what can be changed, and whether your system can recover with discipline.