Skip to content

Evidence First

Troubleshooting Atlas

A searchable index of common Kubernetes failures. Each entry follows the same doctrine: symptom → likely causes → what to inspect → commands → resolution → related reading.

How to use the atlas

A small protocol that prevents thrash.

  • Confirm the symptom precisely (don’t generalize).
  • Run the inspection commands and collect evidence.
  • Choose the smallest safe fix and verify convergence.
  • Follow related readings to strengthen the underlying model.

Entries

13 diagnostic texts · built for search and speed

Topic

Showing 13 of 13.

Atlas

Troubleshoot

Atlas: Pods in CrashLoopBackOff

CrashLoopBackOff is a symptom. This entry provides a canonical triage sequence and safe resolutions.

Atlas

Troubleshoot

Atlas: ImagePullBackOff / ErrImagePull

Pull failures are usually naming, auth, or network. This entry gives the shortest path to truth.

Atlas

Troubleshoot

Atlas: Service Has No Endpoints

If endpoints are empty, traffic cannot route. This entry teaches the endpoint-first diagnostic sequence.

Atlas

Troubleshoot

Atlas: Pods Pending (Scheduling)

Pending pods are placement failures. This entry teaches you to read scheduler testimony and fix the governing constraint.

Atlas

Troubleshoot

Atlas: Readiness Probe Failing

Readiness is the traffic gate. This entry teaches probe semantics that prevent silent outages and restart storms.

Atlas

Troubleshoot

Atlas: Admission Webhook Timeouts

When admission fails, deploys stop. This entry teaches the shortest path to identifying the webhook and restoring the gate of truth.

Atlas

Troubleshoot

Atlas: Deployment Rollout Stalled

A rollout is a control loop with gates. This entry teaches how to read the gates and restore forward motion safely.

Atlas

Troubleshoot

Atlas: Liveness Probe Restarts

Liveness is the kill switch. When it is wrong, it creates outages that look like instability.

Atlas

Troubleshoot

Atlas: Ingress Returns 502/503

When ingress returns 502/503, the edge is telling you upstream is missing, unhealthy, or too slow.

Atlas

Troubleshoot

Atlas: PVC Pending (Storage)

PVC Pending is a binding failure. This entry teaches how to read storage events and unblock provisioning safely.

Atlas

Troubleshoot

Atlas: Node NotReady

Node NotReady is a failure domain boundary. This entry teaches containment first, then root cause.

Atlas

Troubleshoot

Atlas: OOMKilled and Evictions

Memory failures are accounting failures. This entry shows how to prove the killer and right-size with restraint.

Atlas

Troubleshoot

Atlas: HPA Not Scaling

When HPA does nothing, either metrics are missing or the signal is wrong. This entry teaches the proof path.