ContainersNotReady ใน Kubernetes แก้ยังไงให้จบ

อาการ

$ kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
myapp-abc123      0/1     Running   0          5m
myapp-def456      0/1     Running   0          5m

STATUS: Running แต่ READY: 0/1 — container start ขึ้นมาแล้วแต่ไม่ "พร้อม"

ผลลัพธ์ที่กระทบ:

Service ตัด pod ออกจาก endpoint → ไม่มี traffic ไปถึง
Ingress 503 (no live upstream)
Rolling deploy ค้าง — pod ใหม่ไม่ ready, pod เก่าก็ยังอยู่

เข้าใจ Pod Phase ก่อน

K8s แยก:

STATUS (.status.phase) — Pending, Running, Succeeded, Failed, Unknown
READY — เป็นจริงเมื่อ readiness probe ของทุก container = ผ่าน

Running + READY 0/1 = container ถูก start แล้ว แต่ readiness probe ยังไม่ผ่าน

Step 1: describe pod

kubectl describe pod myapp-abc123

อ่าน Conditions: และ Events:

Conditions:
  Type              Status
  Initialized       True
  Ready             False     ← นี่คือ root
  ContainersReady   False
  PodScheduled      True

Events:
  Warning  Unhealthy  2m  Readiness probe failed: HTTP probe failed with statuscode: 500

ส่วน Events บอกชัดสุด — readiness probe ตอบ 500

Step 2: ดู Probe Config

kubectl get pod myapp-abc123 -o yaml | grep -A 10 "readinessProbe\|livenessProbe\|startupProbe"

ตัวอย่าง:

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 3

Step 3: สาเหตุที่พบบ่อย

A. App ใช้เวลา start นานกว่า initialDelaySeconds

App เปิด connection DB / load model ขนาดใหญ่ ใช้ 30 วินาที แต่ probe เริ่มเช็คตั้งแต่วินาทีที่ 5

แก้:

# ใช้ startupProbe แยกต่างหาก
startupProbe:
  httpGet: { path: /health, port: 3000 }
  failureThreshold: 30      # 30 × 10s = ให้เวลา start สูงสุด 5 นาที
  periodSeconds: 10

readinessProbe:
  httpGet: { path: /ready, port: 3000 }
  periodSeconds: 5
  failureThreshold: 3

startupProbe ทำงานก่อน — pass แล้วถึงจะส่งต่อให้ readiness/liveness probe

B. Path / port ผิด

ลอง curl ตรงเข้า pod:

kubectl exec -it myapp-abc123 -- curl -v http://localhost:3000/ready

ถ้า path ผิด — ได้ 404 ถ้า port ผิด — ได้ "Connection refused"

แก้: ตรวจสอบกับ app code ว่า endpoint ที่ใช้คืออะไร

C. Endpoint /ready check dependency ที่ตอบช้า/ตาย

// app/health.ts
app.get('/ready', async (req, res) => {
  await db.$queryRaw`SELECT 1`
  await redis.ping()
  res.json({ status: 'ok' })
})

ถ้า DB ตอบช้า > timeoutSeconds (default 1s) → probe fail

แก้:

เพิ่ม timeoutSeconds: 5
หรือ optimize health check ไม่ให้ depend service ภายนอก
หรือแยก liveness vs readiness:
- Liveness = "process ยังตอบสนองมั้ย" — ตรวจแค่ basic
- Readiness = "พร้อมรับ traffic มั้ย" — ตรวจ dependency

D. Container start แล้วเกิด crash silent

STATUS: Running แต่ pid 1 อาจกำลังจะตาย

kubectl logs myapp-abc123 --tail=100
kubectl logs myapp-abc123 --previous   # ถ้า restart มาแล้ว

ดู error ใน log

E. Image entrypoint ผิด (server ไม่ได้ start)

kubectl exec -it myapp-abc123 -- ps aux

ดูว่า process ที่คาดว่าจะรันอยู่จริงมั้ย

kubectl exec -it myapp-abc123 -- ss -tlnp

ดูว่ามี process listen port 3000 มั้ย — ไม่มี = app ไม่ได้ start จริงๆ

F. Resource limit ต่ำเกิน — OOM

kubectl describe pod | grep -i "OOMKilled\|exit code 137"

ถ้าเจอ — เพิ่ม memory:

resources:
  requests:
    memory: 256Mi
  limits:
    memory: 512Mi    # เพิ่มจาก 128

G. ConfigMap / Secret ไม่มี / key ผิด

kubectl describe pod | grep -i "configmap\|secret"

อาจเห็น:

Warning  Failed   ...  Error: configmap "myapp-config" not found

→ apply ConfigMap ก่อน หรือเช็ค namespace

H. Image pull เสร็จแต่ binary ไม่ทำงานบน arch นี้

container start ขึ้นมาแต่ exit ทันที:

kubectl logs myapp-abc123
# exec /app/server: exec format error

ใช้ image ที่ build x86 บน node ARM (หรือกลับกัน) — build multi-arch image:

docker buildx build --platform linux/amd64,linux/arm64 -t myapp:1.0 --push .

I. Network policy block traffic เข้า pod

probe จาก kubelet → pod ผ่าน node network ปกติไม่โดน NetworkPolicy แต่ถ้ามี policy เข้มมาก / CNI ผิดปกติ — อาจ block

kubectl get networkpolicy --all-namespaces

ลองลบ policy ชั่วคราว (dev cluster) ดู

Step 4: ทดสอบ Probe ตรงๆ

# port-forward ออกมาทดสอบ
kubectl port-forward myapp-abc123 3000:3000

# ใน terminal อื่น
curl -v http://localhost:3000/ready

ถ้าตอบ 500 — ดู app log ถ้าตอบ timeout — app hang

Step 5: ดู event ทั้ง namespace

kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

อาจเห็นเหตุการณ์ที่ describe pod ไม่บอก เช่น node disk pressure, scheduler issue

ตัวอย่าง Probe Config ที่ดี

spec:
  containers:
    - name: app
      image: myapp:1.0
      ports:
        - containerPort: 3000

      # Phase 1: รอให้ app start เสร็จ (อนุญาตช้าได้สูงสุด 5 นาที)
      startupProbe:
        httpGet:
          path: /health
          port: 3000
        failureThreshold: 30
        periodSeconds: 10

      # Phase 2: เช็คว่า process ยังมีชีวิต
      livenessProbe:
        httpGet:
          path: /health      # endpoint เบาๆ ไม่เช็ค dependency
          port: 3000
        periodSeconds: 30
        timeoutSeconds: 3
        failureThreshold: 3

      # Phase 3: เช็คว่าพร้อมรับ traffic
      readinessProbe:
        httpGet:
          path: /ready       # endpoint ที่เช็ค DB, cache
          port: 3000
        periodSeconds: 5
        timeoutSeconds: 5
        failureThreshold: 3

แยก endpoint ให้ชัด:

/health → 200 ถ้า process ยังตอบ (ใช้กับ liveness)
/ready → 200 ถ้าพร้อมรับ traffic จริง (ใช้กับ readiness)

App-side: เขียน health endpoint ที่ดี

// /health — minimal, ไม่ depend external
app.get('/health', (req, res) => {
  res.json({ status: 'ok', uptime: process.uptime() })
})

// /ready — เช็คทุก dependency
app.get('/ready', async (req, res) => {
  const checks = await Promise.allSettled([
    db.$queryRaw`SELECT 1`,
    redis.ping(),
  ])

  const healthy = checks.every((c) => c.status === 'fulfilled')

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ready' : 'not ready',
    checks: {
      database: checks[0].status === 'fulfilled',
      cache: checks[1].status === 'fulfilled',
    },
  })
})

ดี: probe fail แล้วเห็นใน log ของ app เองด้วย

Quick Debug Flow

READY 0/1
  ↓
kubectl describe pod → Events
  ↓
ดู error message ใน Events
  │
  ├─ "Liveness/Readiness probe failed"
  │     ↓
  │   ดู status code/error
  │     ↓
  │   curl ตรงเข้า pod ทดสอบ
  │
  ├─ "container exited"
  │     ↓
  │   kubectl logs --previous
  │
  ├─ "OOMKilled"
  │     ↓
  │   เพิ่ม memory limit
  │
  └─ "no endpoints"
        ↓
      เช็ค selector ตรงกับ pod label

Tools ที่ช่วยให้เร็ว

# ดู event ทั้งหมดใน namespace แบบ realtime
kubectl get events -w -n production

# stern — ดู log จากหลาย pod พร้อมกัน
stern -l app=myapp

# k9s — TUI สำหรับ K8s
k9s

# kubectl-tree (plugin) — ดู resource graph
kubectl tree deployment myapp

ป้องกันใน CI

ตั้ง check ใน CI ก่อน merge — ถ้า manifest มี container ที่ไม่มี readinessProbe ให้ fail:

# kubeconform / kube-score
kube-score score deployment.yaml

ผลลัพธ์:

[CRITICAL] Container Probes
   Container is missing a readinessProbe
   A readinessProbe should be defined.

เช็คลิสต์ TLDR

✅ kubectl describe pod — ดู Events
✅ kubectl logs + --previous — ดู error
✅ ทดสอบ probe ตรงๆ ผ่าน kubectl exec หรือ port-forward
✅ ดู resource limit (memory)
✅ ตรวจ ConfigMap / Secret ที่ pod ใช้
✅ เช็ค image arch ตรงกับ node มั้ย
✅ แยก /health (เบา) กับ /ready (เช็ค dependency)
✅ ใช้ startupProbe สำหรับ app ที่ start นาน

สรุป

ContainersNotReady = readiness probe ไม่ผ่าน

90% ของกรณี:

probe path/port ผิด
app start ช้ากว่า initialDelaySeconds
dependency (DB/cache) ตอบช้า/ตาย

ออกแบบ probe ให้ดีตั้งแต่แรก = ทีม debug ได้เร็วเมื่อมีปัญหา

startupProbe → รอให้ app boot
livenessProbe → ตรวจแค่ process
readinessProbe → ตรวจ dependency

ใช้ kubectl describe + kubectl logs --previous = แก้ได้ทุกเคส 95%

อ่านต่อ: Error 500 ใน Kubernetes — flow debug รวมทุก 5xx