Sacred Systems
DNS in Kubernetes: What Fails and Why
DNS is not a single system. It is a chain with distinct failure classes. Learn to classify by symptom and prove from inside the namespace.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
DNS is often blamed because it is where failures are observed. Many ‘DNS incidents’ are actually routing incidents, policy incidents, or upstream availability incidents.
Kubblai doctrine: classify by symptom (NXDOMAIN vs timeout vs SERVFAIL), then prove the chain from inside the cluster.
- NXDOMAIN: name mismatch or wrong namespace is common.
- Timeout: reachability/policy/egress is common.
- SERVFAIL: CoreDNS health or upstream resolution is common.
Search domains and ndots
Client resolvers expand names using search domains. With high ndots values, clients can generate multiple queries per lookup, amplifying load and latency.
In high-throughput systems, DNS query amplification becomes a real cost and a real failure mode.
- Measure DNS QPS and latency; treat it as production telemetry.
- Be conservative about global resolver tweaks; validate impact per workload class.
- Use fully qualified names in critical paths when ambiguity causes harm.
CoreDNS posture
CoreDNS is a deployment with resources, scaling characteristics, and upstream dependencies. If it is starved, everything looks broken.
If you enforce NetworkPolicy, ensure DNS egress is explicit.
kubectl
shell
kubectl get pods -n kube-system | rg -n "coredns|dns" || true
kubectl get svc -n kube-system | rg -n "dns" || trueField notes
Treat DNS timeouts as a signal that something cannot reach something. Do not assume the resolver is wrong; assume the path is blocked until proven otherwise.
When CoreDNS is unhealthy, stop debugging application symptoms. Fix DNS first.
Canonical Link
Canonical URL: /library/dns-in-kubernetes-what-fails-and-why
Related Readings
Sacred Systems
LibraryCNI as the Nervous System of the Cluster
Your CNI is not plumbing. It is a distributed system with its own control plane, performance ceiling, and failure modes.
Advanced Disciplines
LibraryNetwork Policy and the Discipline of Isolation
Isolation is not paranoia; it is how you keep a single compromised workload from becoming a platform incident.
Advanced Disciplines
LibraryIngress, Egress, and the Borders of the Mesh
Ingress is not a convenience; it is the public boundary of your system. Egress is the boundary you forget until it becomes the breach.
Advanced Disciplines
LibraryObservability as Revelation
Observability is the discipline of evidence. Without it, incident response becomes storytelling.
Rites & Trials
LibraryIncident Doctrine for Platform Teams
Platform incidents are governance incidents. The doctrine must define authority, evidence, safe mitigations, and how memory becomes guardrail.