K8s Pod Probe 헬스체크 전략 – 테오의 저장소

Pod Probe란

Kubernetes는 3종류의 Pod Probe로 컨테이너의 상태를 감시합니다. Probe가 없으면 kubelet은 컨테이너 프로세스가 실행 중이기만 하면 정상으로 판단하므로, 실제로는 DB 연결 실패나 데드락 상태인 Pod에 트래픽이 계속 전달됩니다. Probe를 올바르게 설정하면 장애 Pod 자동 제거·재시작이 가능해집니다.

Probe 종류	역할	실패 시 동작
Startup Probe	앱 초기화 완료 확인	컨테이너 재시작
Liveness Probe	앱이 살아있는지 확인	컨테이너 재시작
Readiness Probe	트래픽 수신 가능 여부	Service 엔드포인트에서 제거 (재시작 아님)

Startup Probe: 느린 초기화 보호

Spring Boot나 JVM 기반 앱은 시작에 30초 이상 걸릴 수 있습니다. Startup Probe가 없으면 Liveness Probe가 초기화 중에 실패로 판단하여 컨테이너를 무한 재시작합니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spring-api
spec:
  template:
    spec:
      containers:
        - name: app
          image: spring-api:latest
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30    # 5초 × 30회 = 최대 150초 대기
            timeoutSeconds: 3

Startup Probe가 성공하기 전까지 Liveness/Readiness Probe는 비활성화됩니다. failureThreshold × periodSeconds가 앱의 최대 시작 시간보다 충분히 커야 합니다.

Liveness Probe: 교착 상태 감지

앱이 실행 중이지만 요청을 처리할 수 없는 상태(데드락, 무한 루프, 메모리 누수)를 감지합니다:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10          # 10초마다 체크
  failureThreshold: 3        # 3회 연속 실패 시 재시작
  timeoutSeconds: 5          # 5초 내 응답 없으면 실패
  successThreshold: 1        # 1회 성공으로 복구

중요: Liveness Probe에 외부 의존성(DB, Redis)을 넣으면 안 됩니다. DB가 일시적으로 다운되었을 때 모든 Pod가 재시작되어 연쇄 장애가 발생합니다.

// ✅ 올바른 Liveness: 앱 프로세스 자체만 확인
@RestController
public class HealthController {

    @GetMapping("/health/live")
    public ResponseEntity<String> liveness() {
        // 앱이 응답 가능한지만 확인
        return ResponseEntity.ok("OK");
    }
}

// ❌ 잘못된 Liveness: 외부 의존성 포함
@GetMapping("/health/live")
public ResponseEntity<String> badLiveness() {
    // DB 연결 실패 시 Pod 재시작 → 연쇄 장애!
    jdbcTemplate.queryForObject("SELECT 1", Integer.class);
    return ResponseEntity.ok("OK");
}

Readiness Probe: 트래픽 제어

Readiness Probe는 Pod가 트래픽을 받을 준비가 되었는지 판단합니다. 실패하면 Service 엔드포인트에서 제거되어 트래픽이 차단되지만, Pod는 재시작되지 않습니다:

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3
  timeoutSeconds: 3
  successThreshold: 1

Readiness Probe에는 외부 의존성을 포함해야 합니다. DB나 Redis에 연결할 수 없으면 요청을 처리할 수 없으므로 트래픽을 받으면 안 됩니다:

// Spring Boot Actuator — 자동으로 DB, Redis 등 확인
// application.yml
management:
  endpoint:
    health:
      probes:
        enabled: true       # /actuator/health/liveness, /readiness 활성화
      group:
        liveness:
          include: livenessState      # 앱 프로세스만
        readiness:
          include:
            - readinessState
            - db                      # DataSource 연결
            - redis                   # Redis 연결
            - diskSpace               # 디스크 공간

NestJS Terminus 연동

NestJS Terminus로 Probe 엔드포인트를 구현합니다:

@Controller('health')
export class HealthController {
  constructor(
    private health: HealthCheckService,
    private db: TypeOrmHealthIndicator,
    private redis: MicroserviceHealthIndicator,
  ) {}

  // Liveness — 앱 자체만
  @Get('live')
  @HealthCheck()
  liveness() {
    return this.health.check([]);
  }

  // Readiness — 외부 의존성 포함
  @Get('ready')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('database'),
      () => this.redis.pingCheck('redis', {
        transport: Transport.REDIS,
        options: { host: 'redis', port: 6379 },
      }),
    ]);
  }
}

3가지 Probe 체크 방식

방식	설정	적합한 상황
HTTP GET	`httpGet`	웹 앱, REST API (가장 일반적)
TCP Socket	`tcpSocket`	DB, Redis 등 HTTP 없는 서비스
Exec	`exec.command`	커스텀 스크립트 필요 시
gRPC	`grpc`	gRPC 서비스 (K8s 1.24+)

# TCP Socket — DB 컨테이너용
livenessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10
  failureThreshold: 3

# Exec — 커스텀 스크립트
livenessProbe:
  exec:
    command:
      - sh
      - -c
      - "pg_isready -h localhost -p 5432 -U postgres"
  periodSeconds: 10
  failureThreshold: 3

# gRPC Health Check (K8s 1.24+)
livenessProbe:
  grpc:
    port: 50051
    service: "my.service.Health"
  periodSeconds: 10

실전 Probe 설정 완성본

Spring Boot + K8s 프로덕션 환경의 권장 설정입니다:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0       # 무중단 배포
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: app
          image: order-service:v2.1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

          # 1. Startup: 초기화 완료 대기 (최대 120초)
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 24
            timeoutSeconds: 3

          # 2. Liveness: 프로세스 생존 확인 (외부 의존성 제외)
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            periodSeconds: 15
            failureThreshold: 3
            timeoutSeconds: 5

          # 3. Readiness: 트래픽 수신 가능 여부 (DB, Redis 포함)
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            periodSeconds: 5
            failureThreshold: 3
            timeoutSeconds: 3
            successThreshold: 1

Graceful Shutdown과 Probe 연계

Rolling Update 시 트래픽 유실을 방지하려면 Probe와 graceful shutdown을 연계해야 합니다:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command: ["sh", "-c", "sleep 10"]
            # 10초 대기 → iptables 업데이트 시간 확보

종료 순서:

1단계: Pod가 Terminating 상태로 전환
2단계: Service 엔드포인트에서 제거 (새 트래픽 차단)
3단계: preStop 훅 실행 (sleep 10초 — iptables 전파 대기)
4단계: SIGTERM 전송 → 앱이 진행 중인 요청 완료
5단계: terminationGracePeriodSeconds 초과 시 SIGKILL

# Spring Boot graceful shutdown
server:
  shutdown: graceful
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

흔한 실수와 해결

실수	증상	해결
Liveness에 DB 체크 포함	DB 장애 시 전 Pod 재시작	Liveness는 앱만, Readiness에 DB
Startup Probe 미설정	느린 앱 CrashLoopBackOff	Startup Probe 추가
timeoutSeconds 너무 짧음	GC pause 중 Probe 실패	3~5초로 설정
Readiness 없이 배포	시작 중인 Pod에 트래픽 전달	Readiness 필수 설정
preStop 훅 미설정	Rolling Update 시 502 에러	preStop sleep 5~10초

# Probe 디버깅 명령어
kubectl describe pod order-service-xxx | grep -A 20 "Conditions"
kubectl get events --field-selector involvedObject.name=order-service-xxx
kubectl logs order-service-xxx --previous  # 재시작 전 로그

정리

Pod Probe는 Kubernetes 운영의 기본이지만, 잘못 설정하면 장애를 악화시킵니다. 핵심 원칙은 Liveness는 가볍게(앱 자체만), Readiness는 무겁게(의존성 포함), Startup은 넉넉하게입니다. PDB와 함께 설정하면 Rolling Update 시 무중단 배포가 완성됩니다.