Probes, Liveness, Readiness, and the Test of Worthiness

Doctrine

Readiness is whether the service should receive traffic. Liveness is whether the process should be restarted. Confusing them is a classic self-inflicted outage.

Kubblai doctrine: probes must be designed with the same seriousness as SLOs.

Flapping is a failure mode

A readiness probe that fails under normal p95 latency causes cascading retries and overload. A liveness probe that kills a struggling process can prevent recovery.

Probe aggressiveness must be tuned to reality, not optimism.

Use generous timeouts and failure thresholds for cold starts.
Avoid liveness probes that depend on downstream services.
Prefer startup probes for slow initialization paths.

Dependency graphs and partial outages

A service can be alive but not safe to serve. It can be ready for some routes but not others. Probe design should reflect the failure boundaries you actually care about.

If you model your service as a single boolean, you will eventually be wrong.

Operator practice

During incidents, inspect probe failures and correlate with node pressure, CPU throttling, and upstream timeouts. Probes rarely fail in isolation.

A probe storm is often a symptom of systemic saturation.

Text

Doctrine

Flapping is a failure mode

Dependency graphs and partial outages

Operator practice