Advanced Disciplines

The Dark Arts of Rollout Safety

Safe rollouts are engineered: explicit health signals, bounded blast radius, and stop-loss thresholds tied to SLOs—not optimism tied to dashboards.

Return to Archive Governance Initiation

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

Rollout safety is a system of constraints: probes, budgets, and staged exposure. The goal is not ‘zero downtime.’ The goal is controlled degradation and fast recovery when reality disagrees.

Kubblai doctrine: rollouts must be reversible, observable, and boring.

Define health signals you trust (and instrument them).
Bound blast radius with staged rollouts and traffic control.
Write stop-loss thresholds before you begin.

Probes as contracts

Probes are not checkboxes. They are contracts: what does it mean for a workload to be ready, and how fast should it fail? Bad probes turn deploys into latency incidents.

Readiness should represent serving ability. Liveness should represent irrecoverable deadlock, not transient slowness.

Avoid liveness probes that kill slow startups under load.
Keep readiness fast and meaningful; fail closed on dependency loss only when justified.
Instrument probe failures and correlate with deploys.

Surge math and capacity reality

MaxSurge and MaxUnavailable are capacity decisions. If you surge without headroom, you create Pending storms. If you allow too much unavailability, you violate SLOs.

Surge math must consider cluster fragmentation and node pool constraints.

Validate headroom before large rollouts; use canary pools where needed.
Prefer small canaries with tight verification windows.
Avoid global rollouts during control-plane stress or upgrade windows.

Traffic shaping and progressive delivery

If you serve traffic, you need traffic control: weighted routing, request mirroring, and staged exposure. Without it, deploys are binary events with binary failure.

Progressive delivery tools help, but doctrine matters more than tooling.

Start with a small percentage; validate error rates and latency, not only readiness.
Use automatic rollback tied to SLO burn where possible.
Treat dependency changes (DB migrations) as separate, gated steps.

Stop-loss thresholds

A stop-loss threshold is a decision you make while calm: if error rate exceeds X or latency exceeds Y, you roll back or freeze. It prevents panic-driven improvisation.

Kubblai doctrine: if you cannot define stop-loss thresholds, you are not ready to ship the change.

Define thresholds per critical endpoint; use error budget burn rates.
Automate rollback where reliable; otherwise codify operator procedure.
Record the decision; feed it into future rollout playbooks.

Canonical Link

Canonical URL: /library/the-dark-arts-of-rollout-safety