Advanced Disciplines

Latency Budgets for Inference Systems Running on Kubernetes

Inference fails at the tail. A latency budget is doctrine: a written allocation of time across the request path, enforced by scaling, shedding, and controlled degradation.

Return to Archive Governance Initiation

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

Latency budgets are the most honest form of system design. They force you to admit what you can afford: queue time, compute time, network time, retries, and cold starts.

Kubblai doctrine: you cannot ‘optimize later’ if you never allocated time for the components that dominate p99.

Write the budget as a table: component → target → measurement method.
Measure p99 under load; optimize p50 only when p99 is bounded.
Treat retries as debt; they multiply load during incidents.

Decompose the path

Most inference stacks hide queue time. Kubernetes adds additional contributors: service routing, sidecars, node networking, and GPU contention.

If you cannot break latency into components, you cannot operate it.

Queue time: admission into the model server (request queue/concurrency).
Compute time: model execution, batching, and GPU saturation.
Network time: ingress, service mesh, node hops, and TLS overhead.

Budgets require posture decisions

A budget implies posture: warm capacity, scaling thresholds, and what you do when demand exceeds supply. Without posture, budgets become promises you cannot keep.

The Order prefers controlled degradation: load shedding and bounded concurrency.

Warm pools for critical models; autoscaling for non-critical models.
Concurrency limits to prevent tail collapse.
Load shedding policies that preserve core functionality.

Autoscaling under tail constraints

Scaling decisions must happen before p99 collapses. This often means scaling on queue depth or saturation rather than CPU. It also means accounting for node provisioning latency.

Autoscaling that reacts too late is a ritual without control.

Scale on queue length and measured saturation; use stabilization windows.
Budget for node pool scale-up time; keep headroom for tail.
Avoid thrash; slow scale-down for stability.

Observability that matches the budget

To enforce a latency budget, you need telemetry that reflects the budget components: queue time, compute time, and network time—with correlation to deploys and autoscaler actions.

Tail problems are often caused by rare interactions. You need traces.

Trace queue time separately from compute time.
Correlate latency spikes with node events, GPU contention, and rollouts.
Alert on budget breach with clear runbook actions.

Canonical Link

Canonical URL: /library/latency-budgets-for-inference-systems-running-on-kubernetes