Lab · Intermediate
Lab: DNS Triage Inside the Cluster
DNS failures masquerade as everything. This lab teaches a minimal sequence to prove whether the problem is DNS, routing, policy, or the application.
Prerequisites
What you should have before you begin.
- A cluster with CoreDNS/kube-dns
- kubectl installed
- Ability to run a debug pod
Lab text
Follow the sequence. Change one thing at a time.
Goal
You will learn to prove DNS behavior from inside the cluster: service names, FQDNs, search domains, and failure modes caused by policy or egress restrictions.
- Confirm DNS service and CoreDNS pods are healthy.
- Run queries from a debug pod.
- Distinguish NXDOMAIN from timeouts.
Check the DNS system
Start with the system components. If they are unhealthy, stop there.
- If CoreDNS is CrashLooping, fix that first.
- If the DNS service has no endpoints, you have a control-plane/label issue.
kubectl
shell
kubectl get svc -n kube-system | rg -n "dns"
kubectl get pods -n kube-system | rg -n "coredns|dns"Query from inside the cluster
Run queries from a pod in the affected namespace (policy differs by namespace).
- NXDOMAIN suggests name mismatch.
- Timeout suggests routing/policy/egress or CoreDNS overload.
kubectl
shell
kubectl run -n <ns> dns-debug --image=busybox:1.36 --restart=Never --command -- sh -c "sleep 3600"
kubectl exec -n <ns> dns-debug -- nslookup kubernetes.default.svc.cluster.localCommon causes
DNS symptoms often come from non-DNS sources.
- NetworkPolicy default-deny blocks UDP/TCP 53 to CoreDNS.
- Node-level DNS or CNI issues prevent reaching kube-dns service VIP.
- CoreDNS overload or upstream resolution failures.
- Search domain/ndots behavior causing surprising query patterns.
Verify and record
Write down what changed the outcome. DNS issues recur.
- Service FQDN resolves consistently.
- Timeouts stop.
- Applications stop retry storms (reduced load).
Related
Canonical link
Canonical URL: /labs/dns-triage-inside-the-cluster