Error 500 ใน Kubernetes ไล่ปัญหาให้เจอเร็ว

โอกาสที่จะเจอ 500 ใน K8s

ลูกค้าแจ้งว่าเว็บ "ใช้ไม่ได้" curl ดูเห็น:

$ curl -I https://myapp.com
HTTP/2 500

หรือ:

HTTP/2 502   ← Bad Gateway
HTTP/2 503   ← Service Unavailable
HTTP/2 504   ← Gateway Timeout

5xx แต่ละตัวบอกอะไร:

500 — app error (exception ใน code)
502 — proxy เข้า upstream ได้แต่ upstream ตอบไม่ปกติ
503 — ไม่มี backend ที่ healthy (pod ตายหมด / ไม่ ready)
504 — proxy timeout รอ upstream

Flow ที่ request ผ่านใน K8s

User → Ingress Controller (Nginx/Traefik)
     → Service (cluster IP)
     → Pod (your app)
     → Database / external API

ถ้า 500 — ปัญหาอยู่ใน 1 ใน 4 layer นี้ ไล่จากนอก-ใน เร็วที่สุด

Step 1: ตรวจ Ingress / Pod ระดับสูง

# ดู resource ในส่วนที่เกี่ยว
kubectl get ingress,svc,pod -n production

ดู:

Pod count ตรงกับ replicas ที่ต้องการมั้ย
Pod status เป็น Running มั้ย
มี pod CrashLoopBackOff หรือ Pending มั้ย

ผลลัพธ์ที่บอกปัญหา:

NAME                       READY   STATUS             RESTARTS   AGE
myapp-7d4f5c-abc12         0/1     CrashLoopBackOff   5          3m
myapp-7d4f5c-def34         1/1     Running            0          2d
myapp-7d4f5c-ghi56         0/1     CrashLoopBackOff   5          3m

→ 2/3 pod ตาย — request ที่ไป pod ตายได้ 502/503

Step 2: ดู Pod Detail

kubectl describe pod myapp-7d4f5c-abc12

ดู Events: ที่ท้าย — บอกเหตุผล:

Events:
  Type     Reason     Age    Message
  ----     ------     ----   -------
  Warning  BackOff    1m     Back-off restarting failed container
  Warning  Unhealthy  2m     Liveness probe failed: HTTP probe failed with statuscode: 500

→ liveness probe fail = container ตายซ้ำๆ

Step 3: ดู Pod Logs

# log ปัจจุบัน
kubectl logs myapp-7d4f5c-abc12

# ของ container ที่ crash (rounded ก่อนหน้า)
kubectl logs myapp-7d4f5c-abc12 --previous

# เฉพาะ N บรรทัดล่าสุด
kubectl logs --tail=100 myapp-7d4f5c-abc12

# log ของทุก pod ที่ label app=myapp
kubectl logs -l app=myapp --tail=50

# follow real-time
kubectl logs -f myapp-7d4f5c-abc12

ดู error message ใน app log ส่วนใหญ่บอก root cause ตรงๆ:

Error: ECONNREFUSED 10.0.0.5:5432
   at TCP.<anonymous> (...)

→ database connect ไม่ได้

Error: secret "JWT_SECRET" not found

→ env ไม่ได้ inject เข้า pod

Step 4: ทดสอบจากใน pod

ถ้า log ไม่บอกชัด — เข้าไป test จากใน pod:

kubectl exec -it myapp-7d4f5c-def34 -- sh

ใน pod:

# DB connectivity
nc -zv postgres-service 5432

# DNS
nslookup postgres-service

# call own health endpoint
curl http://localhost:3000/health

# external API
curl https://api.external.com/status

ใช้ debug pod ถ้า image production ไม่มี tool:

kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash

Step 5: ตรวจ Service / Endpoint

kubectl get svc myapp
kubectl get endpoints myapp

ดู endpoints:

NAME    ENDPOINTS
myapp   10.244.1.5:3000,10.244.2.7:3000

ถ้า endpoints ว่าง:

NAME    ENDPOINTS
myapp   <none>

→ ไม่มี pod ที่ match selector → 503

ตรวจ selector:

kubectl get svc myapp -o yaml | grep -A 3 selector
kubectl get pod -l <selector-from-above>

Step 6: Ingress Controller Log

# หา ingress controller pod
kubectl get pods -n ingress-nginx

# ดู log (ล่าสุด 200 บรรทัด)
kubectl logs -n ingress-nginx ingress-nginx-controller-xxx --tail=200

ดู:

502: connect() failed (111: Connection refused) — pod ปฏิเสธ connection
503: no live upstreams — ไม่มี healthy backend
504: upstream timed out — pod ตอบช้า

Step 7: Resource limit + node pressure

ถ้า pod ใหม่ start ไม่ได้:

kubectl describe pod myapp-pending-pod

อาจเห็น:

Warning  FailedScheduling  ... Insufficient memory

→ node เต็ม

kubectl top nodes
kubectl top pods -A --sort-by=memory

ดู resource ที่ใช้ vs available

Step 8: ดู Node + Cluster Health

kubectl get nodes
# ดูว่า node ใดๆ NotReady มั้ย

kubectl describe node <node-with-issue>

# ดู pod system สำคัญ
kubectl get pods -n kube-system

Pattern ของ 5xx แต่ละตัว

500 — app error ภายใน

99% เป็น exception ใน code — kubectl logs แล้วอ่าน stack trace

ตัวอย่าง:

Database query ผิด
Null pointer
File not found
Out of memory

แก้ที่ code

502 — proxy got bad response

ส่วนใหญ่:

Pod ตายระหว่าง request
App listen port ผิด (Service บอก 3000 แต่ app listen 8080)
App send malformed response

# เช็ค port app listen จริง
kubectl exec -it pod -- ss -tlnp

503 — no upstream available

Pod ทั้งหมด not ready
HPA scale down เกินไป
Readiness probe fail ทั้งหมด

# ดู readiness
kubectl describe pod | grep -A 10 "Readiness"

504 — upstream timeout

App ทำงานช้าเกิน proxy timeout
DB query ยาว
External API ค้าง

ตั้ง timeout ที่ Ingress:

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"

Quick reference flow

500/502/503/504
  ↓
kubectl get pods → all Running?
  ↓ no
kubectl describe pod → Events
  ↓ ok
kubectl logs --previous → app error?
  ↓ ok
kubectl get endpoints → has IP?
  ↓ no
ตรวจ selector, label
  ↓ ok
kubectl exec ← test connectivity จากใน pod
  ↓ ok
kubectl logs -n ingress-nginx → upstream error?

Tool ที่ทำให้เร็วขึ้น

k9s — TUI สำหรับ Kubernetes ดู resource, log, exec ทั้งหมดในที่เดียว

brew install k9s    # หรือ download binary
k9s

stern — log จากหลาย pod พร้อมกัน

stern -l app=myapp --tail=100

kubectl-debug plugin — debug pod ที่ image ไม่มี shell

kubectl debug -it pod/myapp --image=nicolaka/netshoot

ป้องกัน

Liveness + Readiness probe ที่ดี — แยก 2 ตัวให้ถูก
- Liveness = "ตายหรือยัง" → ถ้า fail = restart
- Readiness = "พร้อมรับ traffic มั้ย" → ถ้า fail = ตัดออกจาก service
Resource request/limit — กัน OOM kill
PodDisruptionBudget — กัน rolling update ทำ pod ตายหมด
Centralized logging (Loki) + alert
Health endpoint ที่เช็ค dependency จริงๆ

// /health
{
  "status": "ok",
  "checks": {
    "database": await db.ping(),
    "redis": await redis.ping(),
    "external_api": await checkApi(),
  }
}

สรุป

ไล่ตามลำดับ: pod status → describe → log → endpoint → ingress log

90% ของ 500 — kubectl logs --previous บอก root cause เลย

ตั้ง k9s + stern ไว้ใช้ — debug เร็วกว่า kubectl บริสุทธิ์เยอะ