Linux perf 성능 프로파일링 – 테오의 저장소

Linux perf란?

perf는 Linux 커널에 내장된 성능 분석 도구다. CPU 프로파일링, 캐시 미스 분석, 시스템 콜 추적, 함수 호출 빈도 측정 등 하드웨어·소프트웨어 수준의 성능 이벤트를 카운팅하고 샘플링한다. strace나 top으로는 알 수 없는 CPU 병목 지점을 정확히 찾아낼 수 있다.

이 글에서는 perf의 핵심 서브커맨드(stat, record, report, top), 하드웨어 카운터 활용, Flame Graph 생성, 그리고 Java/Node.js 애플리케이션 프로파일링까지 실전 운영 패턴을 다룬다.

설치 및 기본 확인

# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-$(uname -r)

# RHEL/CentOS
sudo yum install perf

# 버전 확인
perf version

# 커널 심볼 접근 허용 (프로파일링에 필수)
sudo sysctl -w kernel.perf_event_paranoid=-1
sudo sysctl -w kernel.kptr_restrict=0
# 영구 설정: /etc/sysctl.d/99-perf.conf

perf stat: 이벤트 카운팅

perf stat은 프로그램 실행 중 발생한 하드웨어·소프트웨어 이벤트를 카운팅한다. CPU 사이클, 명령어 수, 캐시 미스 비율 등을 한눈에 파악할 수 있다.

# 기본 통계
perf stat ./my-program

# 출력 예시:
#  5,234,567,890  cycles             # 5.2B CPU 사이클
#  3,456,789,012  instructions       # 3.4B 명령어 (IPC: 0.66)
#     45,678,901  cache-references
#      2,345,678  cache-misses       # 5.13% of all cache refs
#        123,456  page-faults
#          2,345  context-switches
#
#  2.345678 seconds time elapsed

# 특정 이벤트만 측정
perf stat -e cycles,instructions,cache-misses,branch-misses ./my-program

# 실행 중인 프로세스에 attach (10초간)
perf stat -p $(pidof node) -a sleep 10

# CPU별 통계
perf stat -e cycles,instructions -A -a sleep 5

# 반복 측정 (5회, 평균·표준편차 계산)
perf stat -r 5 ./my-program

IPC(Instructions Per Cycle) 해석

IPC 값	의미	조치
> 1.0	CPU 효율적 사용	양호, I/O 병목 확인
0.5 ~ 1.0	보통	캐시 미스, 분기 예측 실패 확인
< 0.5	CPU 비효율	메모리 병목, 캐시 최적화 필요

perf record + report: 샘플링 프로파일링

perf record는 일정 주기로 CPU 상태를 샘플링하여 어떤 함수에서 시간을 가장 많이 소비하는지 기록한다.

# 프로그램 실행하며 프로파일링
perf record -g ./my-program
# -g: 콜 그래프(call graph) 포함

# 실행 중인 프로세스 프로파일링 (30초)
perf record -g -p $(pidof java) -- sleep 30

# 전체 시스템 프로파일링
perf record -g -a -- sleep 10

# 높은 샘플링 주파수 (더 정밀, CPU 오버헤드 증가)
perf record -F 99 -g -p $(pidof node) -- sleep 30
# -F 99: 초당 99회 샘플링 (99Hz, 소수를 써서 주기적 패턴과의 동기화 회피)

# 결과 분석
perf report
# 또는 TUI 없이 텍스트 출력
perf report --stdio

perf report 출력 읽기

# perf report --stdio 예시
# Overhead  Command  Shared Object      Symbol
#  25.32%   node     node               [.] v8::internal::Compiler::Compile
#  18.45%   node     libc.so.6          [.] __memmove_avx_unaligned
#  12.78%   node     node               [.] v8::internal::Heap::AllocateRaw
#   8.91%   node     [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
#   6.23%   node     node               [.] v8::internal::MarkCompactCollector

# Overhead: 전체 샘플 중 이 함수가 차지하는 비율
# [.]: 유저 스페이스  [k]: 커널 스페이스
# → V8 컴파일 25%, 메모리 복사 18%, GC 12% 소비

perf top: 실시간 프로파일링

# 시스템 전체 실시간 모니터링 (top처럼)
sudo perf top

# 특정 프로세스만
sudo perf top -p $(pidof java)

# 콜 그래프 포함
sudo perf top -g

# 특정 이벤트 기준
sudo perf top -e cache-misses

Flame Graph 생성

Flame Graph는 perf 데이터를 시각화하여 병목 지점을 직관적으로 파악할 수 있게 한다. 넓은 막대가 CPU 시간을 많이 소비하는 함수다.

# FlameGraph 도구 설치
git clone https://github.com/brendangregg/FlameGraph.git

# 1. 프로파일링 데이터 수집
perf record -F 99 -g -p $(pidof java) -- sleep 30

# 2. 스크립트로 변환
perf script > perf.out

# 3. 스택 접기
./FlameGraph/stackcollapse-perf.pl perf.out > perf.folded

# 4. SVG 생성
./FlameGraph/flamegraph.pl perf.folded > flamegraph.svg

# 한 줄로:
perf script | ./FlameGraph/stackcollapse-perf.pl | 
  ./FlameGraph/flamegraph.pl > flamegraph.svg

# Differential Flame Graph (전후 비교)
# 최적화 전
perf record -F 99 -g -o before.data -- ./my-program
perf script -i before.data | ./FlameGraph/stackcollapse-perf.pl > before.folded

# 최적화 후
perf record -F 99 -g -o after.data -- ./my-program
perf script -i after.data | ./FlameGraph/stackcollapse-perf.pl > after.folded

# 비교 SVG
./FlameGraph/difffolded.pl before.folded after.folded | 
  ./FlameGraph/flamegraph.pl > diff.svg

하드웨어 카운터 심화

# 사용 가능한 이벤트 목록
perf list

# L1/L2/L3 캐시 분석
perf stat -e L1-dcache-loads,L1-dcache-load-misses,
L1-icache-load-misses,LLC-loads,LLC-load-misses ./my-program

# 분기 예측 분석
perf stat -e branches,branch-misses ./my-program

# TLB 미스 분석 (메모리 매핑 오버헤드)
perf stat -e dTLB-loads,dTLB-load-misses,
iTLB-loads,iTLB-load-misses ./my-program

# NUMA 관련 (멀티소켓 서버)
perf stat -e node-loads,node-load-misses,
node-stores,node-store-misses ./my-program

캐시 미스율 해석

캐시 레벨	정상 미스율	주의 미스율
L1 Data	< 5%	> 10%
LLC (L3)	< 20%	> 40%
Branch	< 2%	> 5%
TLB	< 1%	> 5% (Huge Pages 고려)

Java 애플리케이션 프로파일링

JVM은 JIT 컴파일 코드를 동적으로 생성하므로 기본 perf에서는 [unknown]으로 표시된다. perf-map-agent로 심볼을 매핑해야 한다.

# JVM 옵션 추가
java -XX:+PreserveFramePointer 
     -XX:+UnlockDiagnosticVMOptions 
     -XX:+DebugNonSafepoints 
     -jar myapp.jar

# perf-map-agent로 심볼 맵 생성
# (JDK 17+에서는 -XX:+DumpPerfMapAtExit 사용)
java -XX:+PreserveFramePointer 
     -XX:+DumpPerfMapAtExit 
     -jar myapp.jar &

# 프로파일링
perf record -F 99 -g -p $(pidof java) -- sleep 30
perf report

# async-profiler (더 쉬운 대안)
# Java 전용이지만 perf 기반이고 Flame Graph 직접 생성
./asprof -d 30 -f flamegraph.html $(pidof java)

Node.js 애플리케이션 프로파일링

# Node.js perf 지원 활성화
node --perf-basic-prof app.js &

# /tmp/perf-.map 파일이 생성됨
# V8 JIT 코드의 심볼 매핑 포함

# 프로파일링
perf record -F 99 -g -p $(pidof node) -- sleep 30

# Flame Graph 생성
perf script | ./FlameGraph/stackcollapse-perf.pl | 
  ./FlameGraph/flamegraph.pl --colors=js > node-flamegraph.svg
# --colors=js: JavaScript 함수를 색상으로 구분

perf trace: 시스템 콜 추적

# strace 대체 (오버헤드가 훨씬 적음)
perf trace ./my-program

# 특정 시스템 콜만 추적
perf trace -e read,write,open,close -p $(pidof node) -- sleep 10

# 요약 통계
perf trace -s ./my-program
#  syscall     calls  errors  total(ms)  min(ms)  avg(ms)  max(ms)
#  read         1234       0     45.678    0.001    0.037    2.345
#  write         567       0     12.345    0.002    0.021    1.234
#  futex         890      12    234.567    0.001    0.263   50.123
# → futex 호출이 많고 max가 높으면 락 경합 의심

perf sched: 스케줄러 분석

# 스케줄링 이벤트 기록
perf sched record -- sleep 10

# 지연 시간 분석
perf sched latency
#  Task              | Runtime ms | Switches | Avg delay ms | Max delay ms
#  java:12345        |   4523.456 |     2345 |        0.045 |       12.345
#  node:23456        |   2345.678 |     1234 |        0.023 |        5.678

# 타임라인 시각화
perf sched timehist
# 각 CPU에서 어떤 태스크가 언제 실행되었는지 시간순 출력

실전 트러블슈팅 워크플로

# 1단계: 전체 시스템 현황 파악
perf stat -a -- sleep 10

# 2단계: CPU 핫스팟 식별
perf record -F 99 -g -a -- sleep 30
perf report --stdio --sort comm,dso

# 3단계: 특정 프로세스 심화 분석
perf record -F 99 -g -p $PID -- sleep 30
perf report

# 4단계: Flame Graph로 시각화
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# 5단계: 특정 이벤트 드릴다운
perf record -e cache-misses -g -p $PID -- sleep 30
perf report  # 캐시 미스가 많은 함수 확인

# 6단계: 코드 레벨 분석
perf annotate -s function_name
# 어셈블리 코드에서 핫스팟 라인 표시

마무리

perf는 Linux 성능 분석의 스위스 아미 나이프다. CPU 병목, 캐시 미스, 스케줄링 지연, 시스템 콜 오버헤드까지 하나의 도구로 분석할 수 있다.

실무 체크리스트:

perf stat으로 IPC부터 확인 — CPU bound vs Memory bound 판별
perf record + Flame Graph로 핫스팟 함수 시각화
JVM/Node.js는 심볼 매핑 옵션 필수 (PreserveFramePointer, --perf-basic-prof)
perf trace로 strace 대체 — 프로덕션 오버헤드 최소화

Linux eBPF 관측성 심화와 결합하면 커널 수준의 더 깊은 분석이 가능하고, Linux cgroup v2 리소스 제어에서 CPU 할당 최적화까지 연결할 수 있다.