Skip to content

Rites & Trials

Incident Doctrine for Platform Teams

Platform incidents are governance incidents. The doctrine must define authority, evidence, safe mitigations, and how memory becomes guardrail.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

In platform incidents, ambiguity kills. Multiple teams change the same system; multiple tools report conflicting truths. The doctrine must create a single spine: authority, evidence, and procedure.

Kubblai doctrine: incident command is a technical role. It exists to preserve optionality and reduce harm.

  • Define incident roles: commander, operations, communications, scribe.
  • Prefer reversible mitigations; require justification for irreversible actions.
  • Write the timeline as you go; memory evaporates under stress.

Evidence standards

A platform team must refuse folklore. Every major decision should reference a signal: metrics, logs, traces, events, or audit evidence.

If you cannot cite evidence, label the statement as a hypothesis and test it.

  • Capture control plane readiness and latency snapshots early.
  • Record admission errors, webhook timeouts, and throttling events.
  • Correlate symptoms to recent changes: deploys, policy, upgrades.

Authority boundaries

Incidents get worse when everyone has the power to ‘help.’ Authority boundaries prevent conflicting actions.

Kubblai governance defines who can pause GitOps, bypass admission, drain nodes, or change policy baselines.

  • Define break-glass access with explicit approvals and audit.
  • Keep an emergency rollback path for policy and admission systems.
  • Prevent uncoordinated changes to control-plane adjacent components.

Mitigation patterns that hold

Mitigation is a sequence: stabilize → reduce blast radius → restore critical paths → repair forward.

Do not chase every error. Restore the system’s ability to converge.

  • Stabilize admission and API health before scaling workloads.
  • Throttle pathological tenants or controllers rather than draining blindly.
  • Restore DNS/ingress/telemetry as critical infrastructure.

Postmortems as institutional memory

Postmortems are how doctrine evolves. The output is not a PDF; it is a guardrail: policy change, test suite addition, or runbook update that prevents repetition.

Every incident should end with at least one concrete platform improvement.

  • Write a ‘should have caught’ section tied to tests/alerts/policy.
  • Track remediation to completion; incomplete repairs become future outages.
  • Publish lessons as archive texts and reference them in runbooks.