K8s Argo Rollouts 배포 전략 – 테오의 저장소

Argo Rollouts란?

Argo Rollouts는 Kubernetes의 기본 Deployment 롤링 업데이트를 넘어, 카나리(Canary), 블루-그린(Blue-Green), 프로그레시브 딜리버리 전략을 선언적으로 구현하는 컨트롤러입니다. 트래픽을 점진적으로 이동하며 메트릭 기반 자동 판단(Analysis)으로 안전한 배포를 보장합니다.

이 글에서는 카나리 배포 설정, 블루-그린 전환, AnalysisTemplate 메트릭 검증, 트래픽 관리(Istio/NGINX), 롤백 전략까지 실무 운영 패턴을 다룹니다.

Rollout 리소스 기본 구조

Argo Rollouts는 Deployment 대신 Rollout 리소스를 사용합니다. 스펙은 Deployment와 거의 동일하되 strategy 섹션이 핵심입니다.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 5
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myregistry/api-server:v2.1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
  strategy:
    canary:
      # 카나리 전략 상세 설정은 아래에서

카나리 배포: 점진적 트래픽 이동

spec:
  strategy:
    canary:
      # 카나리 Service와 안정 Service 분리
      canaryService: api-server-canary
      stableService: api-server-stable

      # 트래픽 라우팅 (NGINX Ingress 사용)
      trafficRouting:
        nginx:
          stableIngress: api-server-ingress
          additionalIngressAnnotations:
            canary-by-header: X-Canary
            canary-by-header-value: "true"

      # 단계별 배포 전략
      steps:
        # 1단계: 5% 트래픽으로 시작
        - setWeight: 5
        - pause: { duration: 2m }

        # 2단계: 메트릭 분석 실행
        - analysis:
            templates:
              - templateName: success-rate-analysis
            args:
              - name: service-name
                value: api-server-canary

        # 3단계: 20%로 증가
        - setWeight: 20
        - pause: { duration: 5m }

        # 4단계: 수동 승인 대기
        - pause: {}  # duration 없으면 수동 승인 필요

        # 5단계: 50%로 증가
        - setWeight: 50
        - pause: { duration: 10m }

        # 6단계: 최종 분석
        - analysis:
            templates:
              - templateName: full-analysis
            args:
              - name: service-name
                value: api-server-canary

        # 7단계: 100% (자동 프로모트)
        - setWeight: 100

steps의 각 단계에서 pause로 대기, analysis로 메트릭 검증, setWeight로 트래픽 비율을 조절합니다. pause: {}는 수동 승인이 필요한 게이트입니다.

블루-그린 배포

spec:
  strategy:
    blueGreen:
      # 활성 Service (현재 프로덕션 트래픽)
      activeService: api-server-active
      # 프리뷰 Service (새 버전 테스트용)
      previewService: api-server-preview

      # 프리뷰 레플리카 수 (기본: 동일)
      previewReplicaCount: 2

      # 자동 프로모션 활성화 여부
      autoPromotionEnabled: false

      # 자동 프로모션 대기 시간 (autoPromotionEnabled: true일 때)
      # autoPromotionSeconds: 300

      # 프로모션 전 분석 실행
      prePromotionAnalysis:
        templates:
          - templateName: smoke-test
        args:
          - name: preview-url
            value: http://api-server-preview.production.svc.cluster.local

      # 프로모션 후 분석 (롤백 판단)
      postPromotionAnalysis:
        templates:
          - templateName: success-rate-analysis
        args:
          - name: service-name
            value: api-server-active

      # 이전 ReplicaSet 유지 시간 (롤백 대비)
      scaleDownDelaySeconds: 600

      # 프로모션 시 이전 RS를 즉시 축소하지 않음
      scaleDownDelayRevisionLimit: 2

블루-그린은 전환이 즉각적이라 카나리보다 단순하지만, 리소스가 2배 필요합니다. K8s Pod Probe 헬스체크 전략과 함께 readinessProbe를 설정하면 프리뷰 환경의 준비 상태를 정확히 판단할 수 있습니다.

AnalysisTemplate: 메트릭 기반 자동 판단

Argo Rollouts의 핵심 차별점은 Analysis입니다. Prometheus, Datadog, CloudWatch 등에서 메트릭을 수집하여 배포 성공/실패를 자동 판단합니다.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-analysis
spec:
  args:
    - name: service-name
  metrics:
    # 성공률 검증 (99% 이상)
    - name: success-rate
      interval: 30s
      count: 10           # 10번 측정
      successCondition: result[0] >= 0.99
      failureLimit: 3     # 3번 실패하면 롤백
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2.."
            }[2m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[2m]))

    # P99 레이턴시 검증 (500ms 이하)
    - name: latency-p99
      interval: 30s
      count: 10
      successCondition: result[0] <= 500
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_milliseconds_bucket{
                service="{{args.service-name}}"
              }[2m])) by (le)
            )

    # 에러율 검증 (1% 이하)
    - name: error-rate
      interval: 30s
      count: 10
      successCondition: result[0] <= 0.01
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"5.."
            }[2m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[2m]))

Webhook 기반 분석 (E2E 테스트)

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoke-test
spec:
  args:
    - name: preview-url
  metrics:
    - name: smoke-test
      count: 1
      provider:
        job:
          spec:
            backoffLimit: 1
            template:
              spec:
                containers:
                  - name: smoke-test
                    image: myregistry/smoke-test:latest
                    env:
                      - name: TARGET_URL
                        value: "{{args.preview-url}}"
                    command:
                      - /bin/sh
                      - -c
                      - |
                        # API 헬스체크
                        curl -sf ${TARGET_URL}/health || exit 1
                        # 핵심 엔드포인트 검증
                        curl -sf ${TARGET_URL}/api/v1/status || exit 1
                        # 응답 시간 검증
                        RESPONSE_TIME=$(curl -o /dev/null -s -w '%{time_total}' ${TARGET_URL}/api/v1/products)
                        if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
                          echo "Response too slow: ${RESPONSE_TIME}s"
                          exit 1
                        fi
                        echo "All smoke tests passed"
                restartPolicy: Never

Istio 트래픽 라우팅 연동

spec:
  strategy:
    canary:
      canaryService: api-server-canary
      stableService: api-server-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: api-server-vsvc
              routes:
                - primary
          destinationRule:
            name: api-server-destrule
            canarySubsetName: canary
            stableSubsetName: stable
      steps:
        - setWeight: 10
        # 헤더 기반 카나리 (내부 테스터만)
        - setHeaderRoute:
            name: internal-canary
            match:
              - headerName: X-Canary-Test
                headerValue:
                  exact: "true"
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
---
# Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-server-vsvc
spec:
  hosts:
    - api.example.com
  http:
    - name: primary
      route:
        - destination:
            host: api-server-stable
            subset: stable
          weight: 100
        - destination:
            host: api-server-canary
            subset: canary
          weight: 0

운영 명령어와 롤백

# 롤아웃 상태 확인
kubectl argo rollouts get rollout api-server -n production -w

# 수동 프로모트 (pause 단계 승인)
kubectl argo rollouts promote api-server -n production

# 즉시 전체 프로모트 (남은 단계 스킵)
kubectl argo rollouts promote api-server --full -n production

# 롤백 (이전 안정 버전으로)
kubectl argo rollouts abort api-server -n production
kubectl argo rollouts undo api-server -n production

# 특정 리비전으로 롤백
kubectl argo rollouts undo api-server --to-revision=3 -n production

# 카나리 재시작 (처음부터 다시)
kubectl argo rollouts retry rollout api-server -n production

# Analysis 결과 확인
kubectl argo rollouts get rollout api-server -n production
kubectl get analysisrun -n production -l rollouts-pod-template-hash

전략 비교

항목	카나리	블루-그린	롤링 (기본)
트래픽 제어	점진적 (5→20→50→100%)	즉시 전환	파드 단위
리소스 오버헤드	낮음	2배	최소
롤백 속도	즉시 (weight 0)	즉시 (Service 전환)	느림 (재배포)
메트릭 분석	단계별 가능	pre/post 가능	불가
적합한 서비스	대규모 트래픽 API	무중단 필수 서비스	내부 서비스

마무리

Argo Rollouts는 카나리/블루-그린 배포를 선언적으로 구현하고, AnalysisTemplate으로 메트릭 기반 자동 판단을 수행하여 안전한 프로그레시브 딜리버리를 보장합니다. K8s HPA 오토스케일링과 함께 사용하면, 배포 중 트래픽 증가에도 자동으로 대응하는 완전한 배포 파이프라인을 구축할 수 있습니다.