K8s 로깅: Fluent Bit + Loki – 테오의 저장소

K8s 로그 수집이 어려운 이유

Kubernetes 환경에서 로그 관리는 전통적인 서버와 근본적으로 다르다. Pod는 언제든 종료·재생성되고, 노드 간 이동하며, 컨테이너 로그는 노드의 /var/log/containers/에 임시 저장될 뿐 영구 보관되지 않는다. 체계적인 로그 수집 파이프라인 없이는 장애 원인 추적이 불가능하다.

이 글에서는 Fluent Bit로 노드별 로그를 수집하고, Grafana Loki로 중앙 저장·검색하는 경량 로깅 스택의 구축 방법을 다룬다. ELK 대비 리소스 사용량이 1/10 수준이면서도 실무에 충분한 검색 성능을 제공한다.

아키텍처: Fluent Bit → Loki → Grafana

컴포넌트	역할	배포 방식
Fluent Bit	로그 수집·파싱·전송	DaemonSet (모든 노드)
Loki	로그 저장·인덱싱·쿼리	StatefulSet or 단일 Pod
Grafana	로그 검색·시각화·알림	Deployment

Fluent Bit은 C로 작성된 초경량 로그 프로세서다. Fluentd(Ruby)의 약 1/50 메모리로 동작하며, Kubernetes 메타데이터(Pod명, 네임스페이스, 레이블) 자동 enrichment를 지원한다.

Fluent Bit DaemonSet 배포

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        - operator: Exists    # 모든 노드(master 포함)에 배포
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.1
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 128Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: containers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: config
              mountPath: /fluent-bit/etc/
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: containers
          hostPath:
            path: /var/lib/docker/containers
        - name: config
          configMap:
            name: fluent-bit-config

DaemonSet이므로 클러스터의 모든 노드에 자동으로 1개씩 배포된다. tolerations: Exists로 master/taint 노드에도 배치하여 로그 수집 사각지대를 없앤다.

Fluent Bit 설정: 파싱·필터·라우팅

# fluent-bit.conf
[SERVICE]
    Flush         5
    Log_Level     info
    Daemon        off
    Parsers_File  parsers.conf
    HTTP_Server   On          # 메트릭 엔드포인트
    HTTP_Listen   0.0.0.0
    HTTP_Port     2020
    storage.path  /var/log/flb-storage/
    storage.sync  normal
    storage.backlog.mem_limit 10M

# 컨테이너 로그 입력
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            cri               # containerd CRI 포맷
    DB                /var/log/flb_kube.db
    DB.locking        true
    Mem_Buf_Limit     10MB
    Skip_Long_Lines   On
    Refresh_Interval  5

# Kubernetes 메타데이터 enrichment
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On        # JSON 로그 자동 파싱
    Merge_Log_Key       log
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Labels              On
    Annotations         Off

# kube-system 네임스페이스 제외 (노이즈 감소)
[FILTER]
    Name    grep
    Match   kube.*
    Exclude $kubernetes['namespace_name'] kube-system

# 민감 정보 마스킹
[FILTER]
    Name    modify
    Match   kube.*
    Condition Key_Value_Matches log password|secret|token
    Set      log [REDACTED]

# Loki로 전송
[OUTPUT]
    Name        loki
    Match       kube.*
    Host        loki.logging.svc.cluster.local
    Port        3100
    Labels      job=fluent-bit, namespace=$kubernetes['namespace_name'], app=$kubernetes['labels']['app'], pod=$kubernetes['pod_name']
    Auto_Kubernetes_Labels Off
    Tenant_ID   default
    Batch_Wait  1
    Batch_Size  1048576       # 1MB 배치
    Line_Format json

핵심 설정 해설:

설정	역할	권장값
`Mem_Buf_Limit`	메모리 버퍼 한도 (초과 시 디스크 버퍼링)	5~10MB
`DB`	파일 읽기 오프셋 저장 (재시작 시 이어서 수집)	필수
`Merge_Log`	JSON 로그를 자동 파싱하여 필드 추출	On
`storage.path`	디스크 버퍼링 경로 (Loki 다운 시 로그 유실 방지)	설정 필수
`Labels`	Loki 라벨 (인덱싱 대상, 적을수록 좋음)	namespace, app, pod

Loki 배포: SimpleScalable 모드

# Helm으로 Loki 배포 (권장)
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki -n logging --create-namespace 
  --set loki.auth_enabled=false 
  --set loki.commonConfig.replication_factor=1 
  --set singleBinary.replicas=1 
  --set loki.storage.type=filesystem 
  --set singleBinary.persistence.size=50Gi 
  --set loki.limits_config.retention_period=720h 
  --set loki.compactor.retention_enabled=true

# 또는 직접 매니페스트 (소규모 클러스터)
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: loki
  namespace: logging
spec:
  serviceName: loki
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
        - name: loki
          image: grafana/loki:3.0.0
          args: ["-config.file=/etc/loki/config.yaml"]
          ports:
            - containerPort: 3100
          volumeMounts:
            - name: config
              mountPath: /etc/loki
            - name: data
              mountPath: /loki
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 2Gi
      volumes:
        - name: config
          configMap:
            name: loki-config
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: "2024-01-01"
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 720h          # 30일 보관
  max_query_series: 500
  max_query_parallelism: 2
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

compactor:
  working_directory: /loki/compactor
  retention_enabled: true
  compaction_interval: 10m
  retention_delete_delay: 2h

LogQL: Loki 쿼리 언어

Loki의 쿼리 언어 LogQL은 PromQL에서 영감을 받았다. 라벨 셀렉터로 스트림을 선택하고, 파이프라인으로 필터링·파싱한다.

# 기본 로그 검색
{namespace="production", app="api-server"}

# 키워드 필터링
{namespace="production", app="api-server"} |= "error"
{namespace="production"} != "healthcheck" |= "timeout"

# 정규식 필터링
{app="api-server"} |~ "status=[45]\d{2}"

# JSON 로그 파싱 + 필드 필터링
{app="api-server"} | json | response_time > 1000
{app="api-server"} | json | level="error" | line_format "{{.method}} {{.path}} - {{.message}}"

# 메트릭 쿼리: 분당 에러 수
rate({app="api-server"} |= "error" [5m])

# 상위 5개 에러 경로
topk(5, sum by(path) (
  rate({app="api-server"} | json | level="error" [1h])
))

# P99 응답 시간 (JSON 로그에서 추출)
quantile_over_time(0.99,
  {app="api-server"} | json | unwrap response_time [5m]
) by (path)

연산자	용도	예시
`\|=`	포함 (contains)	`\|= "error"`
`!=`	미포함	`!= "healthcheck"`
`\|~`	정규식 매칭	`\|~ "5\d{2}"`
`\| json`	JSON 파싱	`\| json \| level="error"`
`rate()`	초당 로그 발생률	`rate({app="x"} [5m])`
`unwrap`	필드 값을 숫자로 추출	`unwrap duration`

Grafana 알림: 에러 급증 감지

# Grafana Alert Rule (LogQL 메트릭 쿼리)
# 5분간 에러 로그가 분당 50건 초과 시 알림
sum(rate({namespace="production"} | json | level="error" [5m])) > 50

# Slack/Discord 알림 설정
# Grafana → Alerting → Contact Points에서 Webhook 설정
# Alert Rule → Labels에 severity=critical 추가

애플리케이션 로그 포맷: 구조화된 JSON

Fluent Bit의 Merge_Log가 제대로 동작하려면 애플리케이션이 구조화된 JSON으로 로그를 출력해야 한다.

// NestJS: Pino JSON 로거
import { LoggerModule } from 'nestjs-pino';

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        level: process.env.LOG_LEVEL || 'info',
        transport: process.env.NODE_ENV !== 'production'
          ? { target: 'pino-pretty' }  // 개발: 사람이 읽기 쉬운 포맷
          : undefined,                  // 운영: JSON
        serializers: {
          req: (req) => ({
            method: req.method,
            url: req.url,
            userAgent: req.headers['user-agent'],
          }),
          res: (res) => ({
            statusCode: res.statusCode,
          }),
        },
      },
    }),
  ],
})

// 출력 예시 (운영 환경)
// {"level":"info","time":1709000000,"method":"GET","path":"/api/users","statusCode":200,"responseTime":45,"traceId":"abc-123"}

// Spring Boot: Logback JSON Encoder
// logback-spring.xml
<configuration>
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
      <includeMdcKeyName>traceId</includeMdcKeyName>
      <includeMdcKeyName>userId</includeMdcKeyName>
    </encoder>
  </appender>
  <root level="INFO">
    <appender-ref ref="STDOUT" />
  </root>
</configuration>

ELK vs Loki: 언제 무엇을 선택할까

기준	ELK (Elasticsearch)	Loki
인덱싱	전문(full-text) 인덱싱	라벨만 인덱싱 (로그 본문은 미인덱싱)
검색 속도	매우 빠름	라벨 쿼리 빠름, 본문 검색은 느림
리소스	높음 (노드당 수 GB RAM)	낮음 (512MB~2GB)
스토리지	인덱스로 인해 원본의 2~3배	압축 저장 (원본의 0.3~0.5배)
운영 복잡도	높음 (샤드, 인덱스 관리)	낮음 (단일 바이너리 가능)
적합 사례	보안 로그 분석, 복잡한 집계	K8s 운영 로그, 비용 효율적 로깅

80%의 팀에게는 Loki가 정답이다. 대부분의 로그 검색은 “특정 Pod의 최근 에러”이지, “전체 로그에서 특정 단어 출현 빈도 집계”가 아니다.

운영 체크리스트

라벨 카디널리티: Loki 라벨은 낮은 카디널리티를 유지한다. userId나 requestId를 라벨로 쓰면 인덱스가 폭발한다. 이런 값은 로그 본문에 넣고 | json으로 쿼리한다
보관 기간: retention_period를 설정하고 compactor를 활성화한다. 미설정 시 스토리지가 무한 증가한다
버퍼링: Fluent Bit의 storage.path를 설정하여 Loki 장애 시 로그 유실을 방지한다
리소스: Fluent Bit은 Resource requests/limits를 타이트하게 설정한다. DaemonSet이므로 노드 수만큼 곱해진다
멀티라인 로그: Java 스택 트레이스 등 멀티라인 로그는 Fluent Bit의 multiline.parser를 설정해야 하나의 로그 엔트리로 묶인다
Grafana 대시보드: Micrometer 메트릭과 Loki 로그를 같은 Grafana 대시보드에서 상관 분석하면 장애 원인 추적이 훨씬 빠르다

Fluent Bit + Loki 스택은 Kubernetes 환경에서 가장 비용 효율적인 로깅 솔루션이다. DaemonSet으로 노드별 로그를 수집하고, Loki의 라벨 기반 인덱싱으로 빠르게 검색하며, LogQL 메트릭 쿼리로 알림까지 설정할 수 있다. 핵심은 라벨 카디널리티를 낮게 유지하고, 애플리케이션이 구조화된 JSON으로 로그를 출력하는 것이다.