Advanced Disciplines
Probes, Liveness, Readiness, and the Test of Worthiness
A probe is a contract between the workload and the cluster. Poor probes turn minor latency into systemic failure.
Text
Authored as doctrine; evaluated as operations.
Doctrine
Readiness is whether the service should receive traffic. Liveness is whether the process should be restarted. Confusing them is a classic self-inflicted outage.
Kubblai doctrine: probes must be designed with the same seriousness as SLOs.
Flapping is a failure mode
A readiness probe that fails under normal p95 latency causes cascading retries and overload. A liveness probe that kills a struggling process can prevent recovery.
Probe aggressiveness must be tuned to reality, not optimism.
- Use generous timeouts and failure thresholds for cold starts.
- Avoid liveness probes that depend on downstream services.
- Prefer startup probes for slow initialization paths.
Dependency graphs and partial outages
A service can be alive but not safe to serve. It can be ready for some routes but not others. Probe design should reflect the failure boundaries you actually care about.
If you model your service as a single boolean, you will eventually be wrong.
Operator practice
During incidents, inspect probe failures and correlate with node pressure, CPU throttling, and upstream timeouts. Probes rarely fail in isolation.
A probe storm is often a symptom of systemic saturation.
Canonical Link
Canonical URL: /library/probes-liveness-readiness-and-the-test-of-worthiness
Related Readings
Canonical Texts
LibraryIncident Response as a Trial of Faith
Incidents reveal the true governance of your platform: who can act, what can be changed, and whether your system can recover with discipline.
Advanced Disciplines
LibraryHPA, VPA, and the Limits of Elasticity
Elasticity is not free. It is a control system built on noisy signals and hard limits.
Sacred Systems
LibraryKubelet and the Discipline of Obedience
The kubelet is where the platform’s abstract intent becomes real processes. It obeys—but it also refuses when the node is dying.