Troubleshooting Kubernetes Step-by-Step สำหรับมือใหม่ DevOps

ก่อนเริ่ม Debug

อย่าเพิ่ง restart, delete, หรือ apply ใหม่ก่อน — ทุกครั้งที่ทำแบบนั้น ทำลายหลักฐาน ของปัญหา

ขั้นตอนที่ 1 ของ debug ที่ดีคือ: observe ก่อน — แก้ที่หลัง

Mental Model: Layer ที่ต้องตรวจ

K8s = หลาย layer ซ้อนกัน ปัญหาเกิดได้ที่ไหนก็ได้:

1. Application code   ← bug, exception, hang
2. Container          ← OOM, crash, image ผิด
3. Pod                ← probe fail, init fail
4. Workload (Deploy)  ← replica ไม่ตรง, rollout ค้าง
5. Service / Endpoint ← selector ผิด
6. Ingress            ← rule ผิด, cert หมด
7. CNI / Network      ← policy block, DNS
8. Node               ← disk เต็ม, kubelet ตาย
9. Cluster (control)  ← apiserver, etcd

ไล่จาก layer สูง (ใกล้ user) → layer ต่ำ (infrastructure) — ส่วนใหญ่เจอที่ 1-5

Step 1: รู้ก่อนว่า "ปัญหา" คืออะไรกันแน่

อย่าถามแค่ "เว็บ down" — ถามตัวเองให้ละเอียด:

Down ทุก endpoint หรือบางอัน?
Down 100% หรือบางครั้ง?
เริ่มเมื่อไหร่ — ก่อน/หลัง deploy?
มี user แค่บางคนหรือทุกคน?
มี error message อะไร — 500, 502, timeout?

ถ้าตอบคำถามพวกนี้ได้ — ลดขอบเขต debug เหลือ 1/4

Step 2: ดูภาพรวมก่อน

# ดูทุก resource ใน namespace
kubectl get all -n production

# ดู pod ทั้งหมดเรียงตาม status
kubectl get pods -A | grep -v Running | grep -v Completed

หาสิ่งผิดปกติ:

CrashLoopBackOff — container restart ไม่หยุด
Error — เกิด error
ImagePullBackOff — pull image ไม่ได้
Pending — ยังไม่ได้ schedule
0/1 Running — start แล้วแต่ไม่ ready

Step 3: เจาะลงไปที่ Pod ที่ผิดปกติ

kubectl describe pod <pod-name> -n <namespace>

อ่านส่วนสำคัญ:

`Status`

Phase: Running
Conditions:
  Ready: False
  ContainersReady: False

`Containers`

State: Waiting
  Reason: CrashLoopBackOff
Last State: Terminated
  Reason: Error
  Exit Code: 1

`Events` (ส่วนสำคัญที่สุด)

Events:
  Warning  Failed     2m  Error: ImagePullBackOff
  Warning  Unhealthy  1m  Liveness probe failed: HTTP 500

90% ของ root cause อยู่ใน Events

Step 4: ดู Log ของ Container

# log ปัจจุบัน
kubectl logs <pod>

# ของ container ที่เพิ่ง crash
kubectl logs <pod> --previous

# ติดตาม real-time
kubectl logs -f <pod>

# ของ pod ที่ label เดียวกัน
kubectl logs -l app=myapp --tail=100

# 100 บรรทัดท้าย
kubectl logs <pod> --tail=100

ถ้า pod มี container หลายตัว:

kubectl logs <pod> -c <container-name>

Step 5: ทดสอบจากใน Pod

kubectl exec -it <pod> -- sh

ใน shell:

# DNS
nslookup kubernetes.default

# connectivity
nc -zv postgres-svc 5432
curl http://other-service:8080

# process
ps aux

# port listening
ss -tlnp

# environment
env | grep DB

ถ้า image ไม่มี shell / tool — ใช้ debug pod:

kubectl run -it --rm debug \
  --image=nicolaka/netshoot \
  --restart=Never -- bash

Step 6: ตรวจ Service / Endpoint

kubectl get svc -n production
kubectl get endpoints -n production

ดู endpoints ของ service ที่ปัญหา:

kubectl get endpoints myapp -o yaml

ถ้า endpoints: [] — ไม่มี pod ไหน match selector

ตรวจ selector กับ pod label:

kubectl get svc myapp -o yaml | grep -A 3 selector
kubectl get pods --show-labels | grep myapp

Step 7: ตรวจ Ingress

kubectl get ingress -n production
kubectl describe ingress myapp -n production

ดู:

Host ตรงกับ DNS มั้ย
Backend service ที่ point ไปอยู่จริงมั้ย
TLS cert valid มั้ย

ดู ingress controller log:

kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

Step 8: ตรวจ Resource

# CPU/memory ของ node
kubectl top nodes

# CPU/memory ของ pod
kubectl top pods -A --sort-by=memory

# ดู limit/request
kubectl describe pod <pod> | grep -A 3 -E "Limits|Requests"

ปัญหาที่เจอบ่อย:

Node เต็ม → pod pending
OOM → pod restart
CPU throttle → response ช้า

Step 9: ตรวจ Event ทั้ง Cluster

kubectl get events -A --sort-by='.lastTimestamp' | tail -30

เห็นเหตุการณ์ใหญ่ที่ describe pod เดียวไม่บอก เช่น:

Node NotReady
PVC binding fail
Scheduler ปฏิเสธ

Step 10: ตรวจ Node

kubectl get nodes
kubectl describe node <node-name>

ดู:

Conditions: — Ready, MemoryPressure, DiskPressure, PIDPressure
Allocatable vs Allocated resources

ถ้า DiskPressure: True — node disk เต็ม

ssh <node>
df -h
sudo du -sh /var/lib/docker /var/lib/kubelet

Cheat Sheet — เช็ค 30 วินาที

kubectl get pods -A | grep -v Running | grep -v Completed
kubectl describe pod <bad-pod> | tail -30
kubectl logs <bad-pod> --previous --tail=50
kubectl get endpoints <svc-name>
kubectl get events --sort-by='.lastTimestamp' | tail -10

5 คำสั่งนี้ครอบคลุม 80% ของปัญหา

Tools ที่ทำให้เร็วขึ้น 10 เท่า

# k9s — TUI ของ Kubernetes
brew install k9s

# stern — multi-pod log
brew install stern
stern -l app=myapp

# kubectx + kubens — สลับ context/namespace เร็ว
brew install kubectx
kubens production

# kubectl plugins
kubectl krew install neat tree resource-capacity

ตัวอย่างเคสจริง: "Application down"

ลำดับที่ทำ:

# 1. ดูภาพรวม
kubectl get pods -n prod

# myapp-abc123  0/1  CrashLoopBackOff  5  3m

# 2. ทำไม crash
kubectl describe pod myapp-abc123 -n prod
# Events: Liveness probe failed: HTTP 500

# 3. ดู app log
kubectl logs myapp-abc123 -n prod --previous
# Error: connect ECONNREFUSED postgres:5432

# 4. ตรวจ postgres
kubectl get svc -n prod | grep postgres
# postgres   ClusterIP  10.x.x.x  ...

kubectl get endpoints postgres -n prod
# (no endpoints)

# 5. ตรวจ pod postgres
kubectl get pods -n prod -l app=postgres
# postgres-xyz  0/1  CrashLoopBackOff

# 6. ทำไม postgres ตาย
kubectl logs postgres-xyz -n prod --previous
# FATAL: data directory permission denied

# 7. ตรวจ PVC
kubectl describe pvc postgres-pvc -n prod
# Bound, but...

→ root cause: PVC bound แต่ permission ผิด — แก้ที่ initContainer chown

Mindset เวลา debug

1. แก้ทีละอย่าง

อย่าเปลี่ยน 5 อย่างพร้อมกัน — แก้ → test → ถ้าไม่ได้ revert → ลองใหม่

2. เก็บหลักฐาน

kubectl get pod <bad> -o yaml > /tmp/bad-pod.yaml
kubectl logs <bad> --previous > /tmp/bad-pod.log
kubectl describe pod <bad> > /tmp/bad-pod-describe.txt

ก่อน restart/delete — copy ของพวกนี้ไว้ debug ที่หลังได้

3. ใช้ Google ให้เป็น

copy error message ตรงๆ ลง Google + ใส่ "kubernetes" — เกือบทุก error เคยมีคนเจอแล้ว

4. ถามทีม

ใช้เวลา debug 30 นาทีไม่เจอ — ถามคนใกล้ตัว ไม่ใช่อาย จะได้ไอเดียใหม่

Top 10 ปัญหาที่เจอบ่อยสุด

CrashLoopBackOff → ดู kubectl logs --previous
ImagePullBackOff → image name/tag ผิด หรือ private repo ไม่มี imagePullSecret
Pending pod → resource ไม่พอ / nodeSelector ไม่ match / PVC ยัง bound ไม่ได้
0/1 Ready → readiness probe fail (อ่านเพิ่ม)
Service ไม่มี endpoint → selector กับ pod label ไม่ match
OOMKilled → memory limit ต่ำเกิน
DNS resolve ไม่ได้ → CoreDNS ปัญหา (อ่านเพิ่ม)
Cert expired → renew Let's Encrypt / kubeadm certs
Node NotReady → kubelet ตาย / disk เต็ม
Slow API → resource limit / DB query / external dependency

สรุป

Troubleshooting K8s = ฝีมือที่ต้องฝึก — ทุกครั้งที่ debug จบ จดเคสไว้ในไฟล์ markdown ของตัวเอง ครั้งหน้าเจอเหมือนกัน 1 นาทีจบ

หลัก:

Observe ก่อน, แก้ทีหลัง
ไล่ layer สูง → ต่ำ
5 คำสั่งหลัก: get pods → describe → logs --previous → endpoints → events
เก็บหลักฐานก่อน restart

ลองอ่านบทความ debug specific ตามอาการที่เจอ: