Skip to content

Advanced Disciplines

Upgrade Windows, Rollback Reality, and the Myth of Zero Risk

Zero risk is not a promise; it is an unpriced liability. Upgrade windows exist to concentrate attention where systems are most fragile: the boundary between versions.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

Upgrade windows are not bureaucracy. They are how an institution concentrates attention, reduces competing changes, and preserves the ability to recover when reality disagrees.

Kubblai doctrine: if you cannot allocate attention for a risky change, you cannot claim you can operate it.

  • Schedule attention, not just time.
  • Define stop-loss thresholds and rollback triggers in advance.
  • Reduce concurrency: fewer moving parts means better attribution.

Why rollback is harder than you remember

Rollback is not ‘deploy the previous manifest.’ It is a multi-system event: schema compatibility, client version skew, addon state, and operational side effects that persist after you revert versions.

Many teams discover too late that their rollback plan depended on a world that no longer exists.

  • API removals and behavior changes are not reversible by redeploy.
  • etcd and addon migrations can create one-way doors.
  • Workload state (DB migrations, caches) can outlive the rollback.

The stop-loss contract

A stop-loss contract is a decision made while calm: the measurable conditions under which you stop, freeze, or roll back.

Without stop-loss, your upgrade window becomes a panic window.

  • Tie stop-loss to SLO burn, error rate, and control-plane health.
  • Define who has authority to stop the rollout and how it is communicated.
  • Practice the ‘freeze’ move: stopping churn often restores truth.

Rehearsal: the missing discipline

Most teams rehearse deploys; few rehearse rollback. But rollback is where your process is tested: credential access, runbooks, backup restore, and coordination.

The Order expects rehearsal in a representative environment, not a toy cluster.

  • Rehearse admission bypass and restoration.
  • Rehearse addon downgrade and recovery, or explicitly declare it unsupported.
  • Measure time-to-restore and document the assumptions.

Windows as governance

An upgrade window is governance: it defines what changes are permitted, who is on call, and which systems are frozen.

Done well, windows reduce incidents. Done poorly, windows become ritual without control.

  • Freeze non-essential deploys during the window.
  • Staff with operators who can read control-plane signals.
  • Publish a post-window report: what changed, what was observed, what was learned.