Rites & Trials
Incident Doctrine for Platform Teams
Platform incidents are governance incidents. The doctrine must define authority, evidence, safe mitigations, and how memory becomes guardrail.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
In platform incidents, ambiguity kills. Multiple teams change the same system; multiple tools report conflicting truths. The doctrine must create a single spine: authority, evidence, and procedure.
Kubblai doctrine: incident command is a technical role. It exists to preserve optionality and reduce harm.
- Define incident roles: commander, operations, communications, scribe.
- Prefer reversible mitigations; require justification for irreversible actions.
- Write the timeline as you go; memory evaporates under stress.
Evidence standards
A platform team must refuse folklore. Every major decision should reference a signal: metrics, logs, traces, events, or audit evidence.
If you cannot cite evidence, label the statement as a hypothesis and test it.
- Capture control plane readiness and latency snapshots early.
- Record admission errors, webhook timeouts, and throttling events.
- Correlate symptoms to recent changes: deploys, policy, upgrades.
Authority boundaries
Incidents get worse when everyone has the power to ‘help.’ Authority boundaries prevent conflicting actions.
Kubblai governance defines who can pause GitOps, bypass admission, drain nodes, or change policy baselines.
- Define break-glass access with explicit approvals and audit.
- Keep an emergency rollback path for policy and admission systems.
- Prevent uncoordinated changes to control-plane adjacent components.
Mitigation patterns that hold
Mitigation is a sequence: stabilize → reduce blast radius → restore critical paths → repair forward.
Do not chase every error. Restore the system’s ability to converge.
- Stabilize admission and API health before scaling workloads.
- Throttle pathological tenants or controllers rather than draining blindly.
- Restore DNS/ingress/telemetry as critical infrastructure.
Postmortems as institutional memory
Postmortems are how doctrine evolves. The output is not a PDF; it is a guardrail: policy change, test suite addition, or runbook update that prevents repetition.
Every incident should end with at least one concrete platform improvement.
- Write a ‘should have caught’ section tied to tests/alerts/policy.
- Track remediation to completion; incomplete repairs become future outages.
- Publish lessons as archive texts and reference them in runbooks.
Canonical Link
Canonical URL: /library/incident-doctrine-for-platform-teams
Related Readings
Canonical Texts
LibraryIncident Response as a Trial of Faith
Incidents reveal the true governance of your platform: who can act, what can be changed, and whether your system can recover with discipline.
Rites & Trials
LibraryOn Call in the Dark Order: Kubernetes Failure Triage
When the pager rings, the rite is restraint: preserve evidence, choose reversible actions, and stabilize the control plane before you chase symptoms.
Canonical Texts
LibraryObservability for People Who Actually Carry the Pager
If observability does not change decisions during an incident, it is decoration. Signal must be tied to failure modes and owned by the people who respond.
Advanced Disciplines
LibraryThe Ritual of Safe Cluster Upgrades
Upgrades are not events. They are a governance loop: preflight, stage, validate, and preserve reversibility under pressure.
Advanced Disciplines
LibraryGitOps Beyond Ceremony: Where Declarative Systems Break
GitOps is powerful because it makes intent legible. It fails when intent is ambiguous, ownership is unclear, and emergency changes are not governed.