Kubernetes Probes: Liveness – 테오의 저장소

Kubernetes Probes란? 컨테이너 헬스 체크의 핵심

Kubernetes에서 Pod가 Running 상태라고 해서 실제로 트래픽을 처리할 수 있는 것은 아닙니다. “애플리케이션이 데드락에 빠져 응답을 못 한다”, “DB 커넥션 풀이 고갈되어 요청을 처리할 수 없다”, “JVM이 아직 워밍업 중이라 준비가 안 됐다” — 이런 상황을 자동으로 감지하고 대응하는 것이 Probe(프로브)입니다.

Kubernetes는 3가지 Probe를 제공합니다: Liveness(생존), Readiness(준비), Startup(시작). 각각의 목적이 다르고, 잘못 설정하면 오히려 장애를 악화시킵니다. 이 글에서는 3가지 Probe의 동작 원리와 차이, HTTP/TCP/gRPC/Exec 핸들러 선택, 파라미터 튜닝 전략, 그리고 Resource Management와 결합한 안정적 운영 패턴까지 완전히 다룹니다.

3가지 Probe 비교: Liveness vs Readiness vs Startup

Probe	질문	실패 시 동작	주요 목적
Liveness	살아 있는가?	컨테이너 재시작	데드락, 무한루프 감지
Readiness	트래픽 받을 준비가 됐는가?	Service 엔드포인트에서 제거 (트래픽 차단)	초기화 대기, 의존성 장애
Startup	시작이 완료됐는가?	컨테이너 재시작 (failureThreshold 초과 시)	느린 시작 앱 보호 (JVM 등)

Probe 실행 타임라인

컨테이너 시작
│
├─ Startup Probe 시작 (설정된 경우)
│   ├─ 성공할 때까지 Liveness/Readiness는 비활성
│   ├─ 성공 → Startup Probe 종료, Liveness/Readiness 시작
│   └─ failureThreshold 초과 → 컨테이너 재시작
│
├─ Liveness Probe 시작
│   ├─ 성공 → 정상
│   └─ failureThreshold 초과 → 컨테이너 재시작 (restartPolicy에 따라)
│
└─ Readiness Probe 시작
    ├─ 성공 → Service 엔드포인트에 추가 (트래픽 수신)
    └─ 실패 → Service 엔드포인트에서 제거 (트래픽 차단)

핵심 차이: Liveness 실패는 컨테이너를 죽이고 재시작합니다. Readiness 실패는 컨테이너를 죽이지 않고 트래픽만 차단합니다. 이 차이를 혼동하면 심각한 운영 사고가 발생합니다.

Probe 핸들러 4가지: HTTP, TCP, gRPC, Exec

1. HTTP GET (가장 일반적)

livenessProbe:
  httpGet:
    path: /healthz         # 헬스 체크 경로
    port: 8080
    httpHeaders:            # 커스텀 헤더 (선택)
    - name: X-Custom-Header
      value: probe
  initialDelaySeconds: 10   # 첫 체크까지 대기
  periodSeconds: 15         # 체크 간격
  timeoutSeconds: 3         # 응답 대기 시간
  successThreshold: 1       # 성공 판정 횟수
  failureThreshold: 3       # 실패 허용 횟수

# HTTP 200~399 → 성공
# 그 외 (400, 500 등) → 실패

2. TCP Socket

# 포트 연결 가능 여부만 확인 (DB, Redis 등)
livenessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10
  failureThreshold: 3

3. gRPC (Kubernetes 1.27+ GA)

# gRPC Health Checking Protocol 사용
livenessProbe:
  grpc:
    port: 50051
    service: "myapp"      # 서비스 이름 (선택)
  periodSeconds: 10

4. Exec (커맨드 실행)

# 커맨드 exit code 0 → 성공, 그 외 → 실패
livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "pg_isready -U postgres"
  periodSeconds: 10

# 또는 파일 존재 여부 확인
livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy

핸들러 선택 기준

핸들러	적합한 경우	주의사항
HTTP GET	웹 서비스, REST API	엔드포인트가 가벼워야 함
TCP Socket	DB, 캐시, 메시지 큐	포트 열림 ≠ 서비스 정상
gRPC	gRPC 서비스	Health Protocol 구현 필요
Exec	커스텀 체크, 레거시 앱	프로세스 fork 오버헤드

파라미터 상세 설명과 튜닝

파라미터	기본값	설명	튜닝 기준
`initialDelaySeconds`	0	첫 Probe까지 대기 시간	앱 시작 시간 기반 (Startup Probe 사용 시 0)
`periodSeconds`	10	체크 간격	Liveness: 10~30s / Readiness: 5~10s
`timeoutSeconds`	1	응답 대기 시간	엔드포인트 응답 시간 + 여유
`successThreshold`	1	성공 판정 최소 횟수	Liveness: 항상 1 / Readiness: 1~3
`failureThreshold`	3	실패 허용 최대 횟수	Liveness: 3~5 / Startup: 시작시간/period

Liveness 감지 시간 계산:

# 장애 감지 → 재시작까지 최대 시간
maxDetectionTime = initialDelaySeconds + (periodSeconds × failureThreshold)

# 예: initialDelay=0, period=10, failure=3
# → 최대 30초 후 재시작

# 예: initialDelay=0, period=15, failure=5
# → 최대 75초 후 재시작

Startup Probe: 느린 시작 애플리케이션 보호

Spring Boot, JVM 기반 앱은 시작에 30초~2분이 걸릴 수 있습니다. Startup Probe 없이 Liveness만 사용하면 시작도 완료되기 전에 Liveness가 실패하여 무한 재시작 루프에 빠집니다:

# ❌ 잘못된 설정: 시작에 60초 걸리는 앱
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10   # 10초 후 첫 체크
  periodSeconds: 10
  failureThreshold: 3
  # 10 + (10 × 3) = 40초 후 재시작 → 앱이 60초 걸리므로 무한 재시작!

# ✅ Startup Probe로 시작 시간 보호
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 30       # 5 × 30 = 150초까지 시작 허용
  # Startup 성공 전까지 Liveness/Readiness 비활성

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 15
  failureThreshold: 3
  # Startup 성공 후에만 시작 → initialDelaySeconds 불필요!

핵심: Startup Probe가 있으면 initialDelaySeconds를 0으로 설정해도 됩니다. Startup Probe가 시작 완료를 보장하므로, Liveness는 운영 중 장애만 감지하면 됩니다.

실무 패턴 1: NestJS/Express 헬스 엔드포인트 설계

// NestJS: @nestjs/terminus 활용
@Controller('health')
export class HealthController {
  constructor(
    private health: HealthCheckService,
    private db: MikroOrmHealthIndicator,
    private redis: RedisHealthIndicator,
  ) {}

  // Liveness: 프로세스 살아있는지만 확인 (가볍게!)
  @Get('live')
  @HealthCheck()
  liveness() {
    return this.health.check([]);  // 아무 체크 없이 200 반환
  }

  // Readiness: 의존성까지 확인 (트래픽 받을 준비)
  @Get('ready')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('database', { timeout: 1500 }),
      () => this.redis.pingCheck('redis', { timeout: 1000 }),
    ]);
  }

  // Startup: 초기화 완료 여부
  @Get('startup')
  @HealthCheck()
  startup() {
    return this.health.check([
      () => this.db.pingCheck('database', { timeout: 3000 }),
    ]);
  }
}

# Pod 스펙
containers:
- name: app
  image: myapp:latest
  ports:
  - containerPort: 3000
  startupProbe:
    httpGet:
      path: /health/startup
      port: 3000
    periodSeconds: 5
    failureThreshold: 24     # 최대 120초 시작 대기
  livenessProbe:
    httpGet:
      path: /health/live
      port: 3000
    periodSeconds: 15
    failureThreshold: 3
    timeoutSeconds: 3
  readinessProbe:
    httpGet:
      path: /health/ready
      port: 3000
    periodSeconds: 5
    failureThreshold: 3
    timeoutSeconds: 3

실무 패턴 2: Spring Boot Actuator 연동

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health
  endpoint:
    health:
      probes:
        enabled: true         # /actuator/health/liveness, /readiness 활성화
      group:
        liveness:
          include: livenessState
        readiness:
          include: readinessState,db,redis
  health:
    db:
      enabled: true
    redis:
      enabled: true

# K8s Probe 설정
startupProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 5
  failureThreshold: 36     # 최대 180초 (JVM + Spring 초기화)
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 15
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

실무 패턴 3: Graceful Shutdown과 Probe 연동

# Pod 종료 시 흐름:
# 1. SIGTERM 수신
# 2. Readiness Probe 실패 → Service 엔드포인트에서 제거
# 3. 진행 중인 요청 처리 완료 (terminationGracePeriodSeconds 내)
# 4. SIGKILL (강제 종료)

spec:
  terminationGracePeriodSeconds: 60  # 기본 30초
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]
          # Service 엔드포인트 제거가 전파될 시간 확보
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 3000
      periodSeconds: 5

preStop + Readiness 조합이 중요한 이유: SIGTERM을 받은 즉시 Readiness를 실패시키면 좋지만, Service 엔드포인트 업데이트가 모든 노드에 전파되기까지 시간이 걸립니다. preStop: sleep 5로 전파 시간을 확보해야 노드 드레인 시 5xx를 방지할 수 있습니다.

흔한 실수와 안티패턴 6가지

1. Liveness에 외부 의존성 체크 포함

# ❌ DB가 다운되면 모든 Pod가 재시작 → 연쇄 장애!
livenessProbe:
  httpGet:
    path: /health       # DB + Redis + 외부 API 모두 체크
    port: 8080

# ✅ Liveness는 프로세스 자체만 확인
livenessProbe:
  httpGet:
    path: /health/live  # 단순 200 응답만
    port: 8080

# 외부 의존성은 Readiness에서 확인
readinessProbe:
  httpGet:
    path: /health/ready # DB + Redis 체크
    port: 8080

이것은 가장 치명적인 실수입니다. DB 장애 시 Liveness가 실패하면 모든 Pod가 재시작되고, 재시작된 Pod도 DB에 연결 못 해서 다시 재시작 → Cascading Failure(연쇄 장애)가 발생합니다.

2. 헬스 엔드포인트가 무거운 작업 수행

// ❌ 헬스 체크에서 풀 테이블 스캔
@Get('health')
async health() {
  const count = await this.userRepo.count();  // 대형 테이블!
  return { status: 'ok', users: count };
}

// ✅ 가벼운 쿼리만
@Get('health/ready')
async ready() {
  await this.em.execute('SELECT 1');  // 단순 ping
  return { status: 'ok' };
}

3. timeoutSeconds가 너무 짧음

# ❌ GC pause나 일시적 부하로 1초 초과 시 실패
livenessProbe:
  timeoutSeconds: 1     # 기본값이지만 너무 짧을 수 있음

# ✅ 여유를 두고 설정
livenessProbe:
  timeoutSeconds: 3     # JVM GC pause 고려

4. Startup Probe 없이 initialDelaySeconds만 사용

# ❌ 시작 시간이 가변적이면 initialDelay로 커버 불가
livenessProbe:
  initialDelaySeconds: 120   # 항상 120초 대기 → 비효율

# ✅ Startup Probe로 유연하게
startupProbe:
  periodSeconds: 5
  failureThreshold: 30       # 최대 150초, 빨리 시작하면 빨리 통과
livenessProbe:
  initialDelaySeconds: 0     # Startup 통과 후 즉시 시작

5. Readiness 없이 Liveness만 사용

# ❌ 앱이 시작 중인데 트래픽이 들어옴 → 5xx 에러
containers:
- name: app
  livenessProbe: { ... }
  # readinessProbe 없음!

# ✅ Readiness로 준비 완료 전 트래픽 차단
containers:
- name: app
  livenessProbe: { ... }
  readinessProbe: { ... }   # 반드시 함께 설정!

6. failureThreshold: 1 설정

# ❌ 네트워크 일시 장애 한 번에 컨테이너 재시작
livenessProbe:
  failureThreshold: 1    # 한 번 실패 = 재시작

# ✅ 일시적 장애 허용
livenessProbe:
  failureThreshold: 3    # 3번 연속 실패 시에만 재시작

워크로드별 Probe 설정 가이드

워크로드	시작 시간	Startup	Liveness	Readiness
NestJS/Express	~5s	선택	HTTP /health/live	HTTP /health/ready (DB체크)
Spring Boot	30~120s	필수	HTTP /actuator/health/liveness	HTTP /actuator/health/readiness
PostgreSQL	~10s	선택	Exec pg_isready	Exec pg_isready
Redis	~2s	불필요	TCP 6379	Exec redis-cli ping
gRPC 서비스	가변	권장	gRPC health	gRPC health

Probe 디버깅: 문제 진단 방법

# 1. Pod 이벤트에서 Probe 실패 확인
kubectl describe pod myapp-xyz
# Events:
#   Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503
#   Warning  Unhealthy  Readiness probe failed: connection refused
#   Normal   Killing    Container myapp failed liveness probe, will be restarted

# 2. Pod 재시작 횟수 확인
kubectl get pods -o wide
# NAME         READY   STATUS    RESTARTS   AGE
# myapp-xyz    0/1     Running   5          10m    ← 재시작 5번 = Probe 문제

# 3. 컨테이너 안에서 직접 헬스 체크 테스트
kubectl exec myapp-xyz -- wget -qO- http://localhost:3000/health/live
kubectl exec myapp-xyz -- wget -qO- http://localhost:3000/health/ready

# 4. 이전 컨테이너 로그 확인 (재시작 전)
kubectl logs myapp-xyz --previous

정리: Kubernetes Probes 설계 체크리스트

항목	체크
Liveness에 외부 의존성 체크 미포함 (연쇄 장애 방지)	☐
Readiness에 DB/Redis 등 의존성 체크 포함	☐
느린 시작 앱(JVM 등)은 Startup Probe 필수	☐
헬스 엔드포인트는 가벼운 체크만 (SELECT 1 수준)	☐
failureThreshold ≥ 3 (일시적 장애 허용)	☐
timeoutSeconds ≥ 2~3 (GC pause 고려)	☐
Liveness/Readiness 엔드포인트 분리 (/live vs /ready)	☐
preStop + terminationGracePeriodSeconds로 graceful shutdown	☐
장애 감지 시간 계산: period × failureThreshold	☐
Probe 실패 시 describe/logs로 원인 진단	☐

Kubernetes Probes는 “장애를 감지하고 자동 복구하는” 클러스터의 자가 치유 메커니즘입니다. 핵심은 Liveness는 가볍게(프로세스 생존만), Readiness는 꼼꼼하게(의존성 포함), Startup은 넉넉하게(시작 시간 보호) 설정하는 것입니다. 특히 Liveness에 외부 의존성을 포함하는 실수는 장애를 치유하는 것이 아니라 오히려 연쇄 장애를 일으키므로, 반드시 Liveness와 Readiness의 역할을 분리해야 합니다.