Skip to content

Sacred Systems

DNS in Kubernetes: What Fails and Why

DNS is not a single system. It is a chain with distinct failure classes. Learn to classify by symptom and prove from inside the namespace.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

DNS is often blamed because it is where failures are observed. Many ‘DNS incidents’ are actually routing incidents, policy incidents, or upstream availability incidents.

Kubblai doctrine: classify by symptom (NXDOMAIN vs timeout vs SERVFAIL), then prove the chain from inside the cluster.

  • NXDOMAIN: name mismatch or wrong namespace is common.
  • Timeout: reachability/policy/egress is common.
  • SERVFAIL: CoreDNS health or upstream resolution is common.

Search domains and ndots

Client resolvers expand names using search domains. With high ndots values, clients can generate multiple queries per lookup, amplifying load and latency.

In high-throughput systems, DNS query amplification becomes a real cost and a real failure mode.

  • Measure DNS QPS and latency; treat it as production telemetry.
  • Be conservative about global resolver tweaks; validate impact per workload class.
  • Use fully qualified names in critical paths when ambiguity causes harm.

CoreDNS posture

CoreDNS is a deployment with resources, scaling characteristics, and upstream dependencies. If it is starved, everything looks broken.

If you enforce NetworkPolicy, ensure DNS egress is explicit.

kubectl

shell

kubectl get pods -n kube-system | rg -n "coredns|dns" || true
kubectl get svc -n kube-system | rg -n "dns" || true

Field notes

Treat DNS timeouts as a signal that something cannot reach something. Do not assume the resolver is wrong; assume the path is blocked until proven otherwise.

When CoreDNS is unhealthy, stop debugging application symptoms. Fix DNS first.