Advanced Disciplines

The Ritual of Safe Cluster Upgrades

Upgrades are not events. They are a governance loop: preflight, stage, validate, and preserve reversibility under pressure.

Return to Archive Governance Initiation

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

A cluster upgrade is not a checkbox; it is a multi-system negotiation under constrained time. Your enemy is not the new version. Your enemy is unexamined coupling: webhooks, CNIs, CSIs, policy, and workloads that depend on undefined behavior.

Kubblai doctrine is restraint: you upgrade when you can prove you can stop. You ship a rollback plan you have rehearsed, not a hope you have written.

Treat version skew as law: control plane, kubelet, kubectl, and addons.
Define exit criteria before you begin: SLO impact, error budget, and validation steps.
Reduce change surface: freeze policy churn and unrelated deploys during the window.

Preflight: map the blast radius

Before you touch versions, inventory the moving parts that will break first: admission webhooks, PSP/PSA posture, CNI/CSI, service mesh, ingress controller, autoscalers, and any controller that writes aggressively to the API.

If you cannot name what will fail, you have not earned the right to proceed.

Collect addon versions, compatibility notes, and rollback paths.
Confirm your etcd backup posture and restore practice.
Verify you can bypass admission for emergency recovery (with explicit compensations).

kubectl

shell

kubectl version --short
kubectl get nodes -o wide
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations
kubectl get pods -A | rg -n "CrashLoopBackOff|Pending|ImagePullBackOff" || true

Stage the upgrade like a control loop

A safe upgrade is staged: small scope, verify signals, expand scope. You do not “finish faster” by upgrading everything at once; you only reduce your ability to attribute causes.

Use canary pools, surge capacity, and measured drain. If you have stateful workloads, draining is not a button; it is a contract with storage and disruption budgets.

One node pool at a time; one failure domain at a time.
Observe admission latency and API server saturation during drains.
Confirm PDB behavior and eviction rates under real load.

Admission and policy: the silent outage vector

Many upgrades fail because admission is treated as decoration. Webhooks add latency to every write and are tightly coupled to API behavior, TLS, and network policy. During an upgrade, admission becomes the gate of truth—and the gate can jam.

Upgrade doctrine includes budgets: maximum admission latency, maximum webhook error rate, and an emergency bypass procedure that is logged and reviewed.

Budget admission latency; instrument it and alert on it.
Treat webhook failurePolicy choices as incident posture decisions.
Stage policy changes separately from version changes.

Rollback is not “undo”

Rollback is a system design problem. Some changes are not reversible: API removals, etcd migrations, and addon state transitions. Your rollback plan must name what you can revert and what you must repair forward.

Kubblai doctrine: if the rollback plan is “restore from backup,” it must be rehearsed with timing and data loss assumptions written explicitly.

Write a rollback matrix: component → reversible? → how → time cost.
Define a stop-loss threshold (SLO breach, error budget burn) that triggers freeze/rollback.
Keep an operator timeline: decisions, observations, and why.

Validation: what you must prove

Validation is not “pods are Running.” Validate the paths that actually serve traffic: ingress, DNS, service discovery, identity, and storage. Validate controllers you rely on for safety: HPA, disruption controllers, admission, policy.

The Order measures upgrades by the absence of surprise, not by the speed of completion.

Probe service-to-service connectivity across namespaces and nodes.
Validate storage operations: attach/detach, reschedule, and recovery.
Validate observability: events, audit logs, metrics pipelines.

Canonical Link

Canonical URL: /library/the-ritual-of-safe-cluster-upgrades