Skip to content

Governance & Power

Platform Cost Doctrine: Waste, Density, and the Economics of the Cluster

Cost is a signal. When ignored, it reappears as fragility: overloaded nodes, under-provisioned control planes, and rushed change driven by budget panic.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

Platforms are economic systems. Every choice—requests, placement constraints, isolation posture—allocates scarce resources. If you refuse to govern cost, you govern outage instead.

Kubblai doctrine: cost must be legible to those who make decisions, and the incentives must align with reliability.

  • Make resource usage visible per tenant and workload class.
  • Treat requests discipline as a reliability control, not merely a budget issue.
  • Publish a cost posture: where you overpay intentionally for resilience.

Waste is rarely malicious

Most waste is structural: default limits, cargo-cult requests, and fear-driven overprovisioning. Teams pad requests to avoid incident blame, then the scheduler packs less efficiently and autoscalers scale more.

The Order addresses waste by making the system safe enough to be honest.

  • Provide safe rollback and incident processes so teams don’t hoard capacity.
  • Use VPA recommendations judiciously; treat them as inputs, not authority.
  • Audit top waste sources quarterly and tie to remediation work.

Density vs isolation: the governance tradeoff

Higher density reduces cost but increases blast radius. Isolation increases cost but reduces coupling and incident spread. There is no free posture; there is only an explicit trade.

Mature platforms choose isolation boundaries based on threat model and failure domains, not aesthetics.

  • Use multi-cluster or node pool separation for high-risk tenants/workloads.
  • Use quotas, policy baselines, and network boundaries in shared clusters.
  • Avoid false isolation: names without enforcement.

Autoscaling economics

Autoscalers convert demand into nodes. Their economics are shaped by provisioning latency, fragmentation, and your willingness to pay for headroom. Underprovisioning causes latency and scheduling failures; overprovisioning burns budget silently.

The Order defines headroom targets per failure domain and workload tier.

  • Set headroom targets for critical tiers; measure them continuously.
  • Plan for provisioning time; scarcity is often time-based, not total-based.
  • Treat node pool churn as an operational cost; minimize thrash.

A practical cost protocol

Cost discipline that survives real orgs is a cadence, not a crusade.

  • Monthly: top 20 workloads by requested vs used CPU/memory; remediate the worst.
  • Quarterly: review priority classes, quotas, and tenant budgets; adjust governance.
  • After incidents: update requests and rollout safety based on observed contention.