Advanced Disciplines
The Dark Arts of Rollout Safety
Safe rollouts are engineered: explicit health signals, bounded blast radius, and stop-loss thresholds tied to SLOs—not optimism tied to dashboards.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
Rollout safety is a system of constraints: probes, budgets, and staged exposure. The goal is not ‘zero downtime.’ The goal is controlled degradation and fast recovery when reality disagrees.
Kubblai doctrine: rollouts must be reversible, observable, and boring.
- Define health signals you trust (and instrument them).
- Bound blast radius with staged rollouts and traffic control.
- Write stop-loss thresholds before you begin.
Probes as contracts
Probes are not checkboxes. They are contracts: what does it mean for a workload to be ready, and how fast should it fail? Bad probes turn deploys into latency incidents.
Readiness should represent serving ability. Liveness should represent irrecoverable deadlock, not transient slowness.
- Avoid liveness probes that kill slow startups under load.
- Keep readiness fast and meaningful; fail closed on dependency loss only when justified.
- Instrument probe failures and correlate with deploys.
Surge math and capacity reality
MaxSurge and MaxUnavailable are capacity decisions. If you surge without headroom, you create Pending storms. If you allow too much unavailability, you violate SLOs.
Surge math must consider cluster fragmentation and node pool constraints.
- Validate headroom before large rollouts; use canary pools where needed.
- Prefer small canaries with tight verification windows.
- Avoid global rollouts during control-plane stress or upgrade windows.
Traffic shaping and progressive delivery
If you serve traffic, you need traffic control: weighted routing, request mirroring, and staged exposure. Without it, deploys are binary events with binary failure.
Progressive delivery tools help, but doctrine matters more than tooling.
- Start with a small percentage; validate error rates and latency, not only readiness.
- Use automatic rollback tied to SLO burn where possible.
- Treat dependency changes (DB migrations) as separate, gated steps.
Stop-loss thresholds
A stop-loss threshold is a decision you make while calm: if error rate exceeds X or latency exceeds Y, you roll back or freeze. It prevents panic-driven improvisation.
Kubblai doctrine: if you cannot define stop-loss thresholds, you are not ready to ship the change.
- Define thresholds per critical endpoint; use error budget burn rates.
- Automate rollback where reliable; otherwise codify operator procedure.
- Record the decision; feed it into future rollout playbooks.
Canonical Link
Canonical URL: /library/the-dark-arts-of-rollout-safety
Related Readings
Canonical Texts
LibraryKubblai Doctrine: Cluster Discipline and Operational Safety
Operational safety is not a mood. It is a set of constraints and practices that keep change survivable and failure contained.
Advanced Disciplines
LibraryProbes, Liveness, Readiness, and the Test of Worthiness
A probe is a contract between the workload and the cluster. Poor probes turn minor latency into systemic failure.
Rites & Trials
LibraryIncident Doctrine for Platform Teams
Platform incidents are governance incidents. The doctrine must define authority, evidence, safe mitigations, and how memory becomes guardrail.
Canonical Texts
LibraryObservability for People Who Actually Carry the Pager
If observability does not change decisions during an incident, it is decoration. Signal must be tied to failure modes and owned by the people who respond.
Advanced Disciplines
LibraryThe Ritual of Safe Cluster Upgrades
Upgrades are not events. They are a governance loop: preflight, stage, validate, and preserve reversibility under pressure.