Atlas Entry
Atlas: DNS Resolution Failures
Pods cannot resolve service names; errors show NXDOMAIN, SERVFAIL, or timeouts.
Text
Symptom → evidence → resolution.
Symptom
Pods cannot resolve service names; errors show NXDOMAIN, SERVFAIL, or timeouts.
DNSNetworkingOperationsReliability
What this usually means
Either the name is wrong (NXDOMAIN), the DNS system is unhealthy (SERVFAIL), or the query cannot reach the DNS service (timeouts). The fix depends on which class you’re in.
Likely causes
Treat DNS as a chain: application → network/policy → CoreDNS → upstream resolution.
- Name mismatch or wrong namespace (NXDOMAIN).
- NetworkPolicy default-deny blocks UDP/TCP 53 to kube-dns/CoreDNS.
- CoreDNS pods unhealthy or overloaded; upstream resolution failing.
- Search domains/ndots causing surprising query patterns and amplified load.
- Node/CNI routing issues prevent reaching the kube-dns service VIP.
What to inspect first
Classify by symptom, then prove from inside the affected namespace.
- NXDOMAIN suggests naming/scope.
- Timeout suggests routing/policy/egress or CoreDNS outage.
kubectl
shell
kubectl get pods -n kube-system | rg -n "coredns|dns" || true
kubectl get svc -n kube-system | rg -n "dns" || truekubectl
shell
kubectl run -n <ns> dns-debug --image=busybox:1.36 --restart=Never --command -- sh -c "sleep 3600"
kubectl exec -n <ns> dns-debug -- nslookup kubernetes.default.svc.cluster.localResolution guidance
Fix the first broken link in the chain, then verify stability under real traffic.
- If policy blocks DNS, add explicit egress to kube-dns and monitor query volume.
- If CoreDNS is unhealthy, restore it first; downstream debugging is wasted until DNS is stable.
- If ndots/search behavior is pathological, adjust client configuration carefully and measure.
- If the name is wrong, fix the reference (service name, namespace, or FQDN).
Related
Canonical link
Canonical URL: /atlas/dns-resolution-failures