CSI and the Persistence of State

Doctrine

Stateful systems demand humility. Storage is slower than compute, less forgiving than networking, and more expensive to recover.

Kubblai doctrine: treat CSI as critical infrastructure, not as a feature.

Topology and attachment realities

Volumes are not universal. They have zones, attachment limits, and failure domains. Scheduling and storage must agree, or you will create unschedulable workloads.

Operators who ignore topology inevitably debug ‘Pending’ at 3 a.m.

Understand storage class topology constraints.
Know attachment limits per node type.
Test failover and reattach time.

Expansion, snapshots, and recovery

Volume expansion and snapshots are operational tools. They must be tested before you need them. The first time you attempt restoration should not be during incident response.

Treat backup/restore as a production workflow with SLOs.

Failure signatures

CSI failures are often slow: attach timeouts, mount hangs, and node-level issues. Your observability must include node logs, CSI controller logs, and events.

Do not debug stateful incidents with only application logs.

Text

Doctrine

Topology and attachment realities

Expansion, snapshots, and recovery

Failure signatures