Advanced Disciplines
Latency Budgets for Inference Systems Running on Kubernetes
Inference fails at the tail. A latency budget is doctrine: a written allocation of time across the request path, enforced by scaling, shedding, and controlled degradation.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
Latency budgets are the most honest form of system design. They force you to admit what you can afford: queue time, compute time, network time, retries, and cold starts.
Kubblai doctrine: you cannot ‘optimize later’ if you never allocated time for the components that dominate p99.
- Write the budget as a table: component → target → measurement method.
- Measure p99 under load; optimize p50 only when p99 is bounded.
- Treat retries as debt; they multiply load during incidents.
Decompose the path
Most inference stacks hide queue time. Kubernetes adds additional contributors: service routing, sidecars, node networking, and GPU contention.
If you cannot break latency into components, you cannot operate it.
- Queue time: admission into the model server (request queue/concurrency).
- Compute time: model execution, batching, and GPU saturation.
- Network time: ingress, service mesh, node hops, and TLS overhead.
Budgets require posture decisions
A budget implies posture: warm capacity, scaling thresholds, and what you do when demand exceeds supply. Without posture, budgets become promises you cannot keep.
The Order prefers controlled degradation: load shedding and bounded concurrency.
- Warm pools for critical models; autoscaling for non-critical models.
- Concurrency limits to prevent tail collapse.
- Load shedding policies that preserve core functionality.
Autoscaling under tail constraints
Scaling decisions must happen before p99 collapses. This often means scaling on queue depth or saturation rather than CPU. It also means accounting for node provisioning latency.
Autoscaling that reacts too late is a ritual without control.
- Scale on queue length and measured saturation; use stabilization windows.
- Budget for node pool scale-up time; keep headroom for tail.
- Avoid thrash; slow scale-down for stability.
Observability that matches the budget
To enforce a latency budget, you need telemetry that reflects the budget components: queue time, compute time, and network time—with correlation to deploys and autoscaler actions.
Tail problems are often caused by rare interactions. You need traces.
- Trace queue time separately from compute time.
- Correlate latency spikes with node events, GPU contention, and rollouts.
- Alert on budget breach with clear runbook actions.
Canonical Link
Canonical URL: /library/latency-budgets-for-inference-systems-running-on-kubernetes
Related Readings
Advanced Disciplines
LibraryAI Inference on Kubernetes: Latency, Cost, and Operational Reality
Inference is a production system with hard budgets: p99 latency, cost per request, and controlled degradation under load. Kubernetes can host it—if you respect scarcity and failure modes.
Canonical Texts
LibraryObservability for People Who Actually Carry the Pager
If observability does not change decisions during an incident, it is decoration. Signal must be tied to failure modes and owned by the people who respond.
Advanced Disciplines
LibraryCapacity, Bin Packing, and the Lies We Tell the Scheduler
The scheduler is not a magician. It places pods based on the numbers you give it. When those numbers are lies, placement becomes a slow-motion incident.
Advanced Disciplines
LibraryCluster Autoscaling and the Economics of Expansion
Adding nodes is not ‘scale.’ It is a controlled expansion of failure domains, cost, and operational surface area.
Rites & Trials
LibraryIncident Doctrine for Platform Teams
Platform incidents are governance incidents. The doctrine must define authority, evidence, safe mitigations, and how memory becomes guardrail.