Lab: DNS Triage Inside the Cluster

Goal

You will learn to prove DNS behavior from inside the cluster: service names, FQDNs, search domains, and failure modes caused by policy or egress restrictions.

Confirm DNS service and CoreDNS pods are healthy.
Run queries from a debug pod.
Distinguish NXDOMAIN from timeouts.

Check the DNS system

Start with the system components. If they are unhealthy, stop there.

If CoreDNS is CrashLooping, fix that first.
If the DNS service has no endpoints, you have a control-plane/label issue.

kubectl

shell

kubectl get svc -n kube-system | rg -n "dns"
kubectl get pods -n kube-system | rg -n "coredns|dns"

Query from inside the cluster

Run queries from a pod in the affected namespace (policy differs by namespace).

NXDOMAIN suggests name mismatch.
Timeout suggests routing/policy/egress or CoreDNS overload.

kubectl

shell

kubectl run -n <ns> dns-debug --image=busybox:1.36 --restart=Never --command -- sh -c "sleep 3600"
kubectl exec -n <ns> dns-debug -- nslookup kubernetes.default.svc.cluster.local

Common causes

DNS symptoms often come from non-DNS sources.

NetworkPolicy default-deny blocks UDP/TCP 53 to CoreDNS.
Node-level DNS or CNI issues prevent reaching kube-dns service VIP.
CoreDNS overload or upstream resolution failures.
Search domain/ndots behavior causing surprising query patterns.

Verify and record

Write down what changed the outcome. DNS issues recur.

Service FQDN resolves consistently.
Timeouts stop.
Applications stop retry storms (reduced load).

Prerequisites

Lab text

Goal

Check the DNS system

Query from inside the cluster

Common causes

Verify and record