Atlas: DNS Resolution Failures

Symptom

Pods cannot resolve service names; errors show NXDOMAIN, SERVFAIL, or timeouts.

DNSNetworkingOperationsReliability

What this usually means

Either the name is wrong (NXDOMAIN), the DNS system is unhealthy (SERVFAIL), or the query cannot reach the DNS service (timeouts). The fix depends on which class you’re in.

Likely causes

Treat DNS as a chain: application → network/policy → CoreDNS → upstream resolution.

Name mismatch or wrong namespace (NXDOMAIN).
NetworkPolicy default-deny blocks UDP/TCP 53 to kube-dns/CoreDNS.
CoreDNS pods unhealthy or overloaded; upstream resolution failing.
Search domains/ndots causing surprising query patterns and amplified load.
Node/CNI routing issues prevent reaching the kube-dns service VIP.

What to inspect first

Classify by symptom, then prove from inside the affected namespace.

NXDOMAIN suggests naming/scope.
Timeout suggests routing/policy/egress or CoreDNS outage.

kubectl

shell

kubectl get pods -n kube-system | rg -n "coredns|dns" || true
kubectl get svc -n kube-system | rg -n "dns" || true

kubectl

shell

kubectl run -n <ns> dns-debug --image=busybox:1.36 --restart=Never --command -- sh -c "sleep 3600"
kubectl exec -n <ns> dns-debug -- nslookup kubernetes.default.svc.cluster.local

Resolution guidance

Fix the first broken link in the chain, then verify stability under real traffic.

If policy blocks DNS, add explicit egress to kube-dns and monitor query volume.
If CoreDNS is unhealthy, restore it first; downstream debugging is wasted until DNS is stable.
If ndots/search behavior is pathological, adjust client configuration carefully and measure.
If the name is wrong, fix the reference (service name, namespace, or FQDN).

Text

What this usually means

Likely causes

What to inspect first

Resolution guidance