[Feature] Prometheus + Grafana 모니터링 시스템 구축#138
Conversation
📝 WalkthroughWalkthroughPrometheus·Grafana 기반 모니터링 스택과 Micrometer Prometheus 레지스트리 의존성을 추가하고, 애플리케이션의 Actuator/metrics 설정 및 프로필 조정, 도커 컴포즈 모니터링 스택, Grafana 프로비저닝(데이터소스·대시보드·알림), Prometheus 구성, 및 로컬/개발 전용 모니터링 테스트 컨트롤러를 추가했습니다. Changes
Sequence Diagram(s)sequenceDiagram
participant App as "Spring App\n(com.sopt.cherrish)"
participant Prom as "Prometheus\n(prom/prometheus:9090)"
participant Graf as "Grafana\n(grafana:3000)"
participant Discord as "Discord\n(Webhook)"
App->>Prom: /actuator/prometheus 노출 (스크래핑 대상)
Prom->>Prom: 스크래프(간격 15s)
Graf->>Prom: 대시보드/알림 쿼리(평가)
Graf->>Graf: 알림 룰 평가 (Error Rate, High Latency, Metrics Down)
Graf-->>Discord: Discord webhook으로 알림 전송
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Fix all issues with AI agents
In `@docker-compose.monitoring.yml`:
- Around line 27-28: The docker-compose uses a dangerous default admin password
via GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}; remove the hardcoded
fallback and require an explicit secret by changing to
GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} (and consider the same for
GF_SECURITY_ADMIN_USER), and add startup validation or container healthcheck
that fails fast when GRAFANA_PASSWORD is empty so deployments without a strong
password do not start.
- Around line 36-37: Update the compose so Grafana waits for Prometheus to be
healthy rather than just started: add a healthcheck block to the Prometheus
service (define a reliable check that verifies Prometheus readiness) and change
Grafana's depends_on from the simple list to the condition form that references
prometheus with condition: service_healthy (i.e., use depends_on: prometheus:
condition: service_healthy). Ensure the healthcheck command and interval/retries
are appropriate for Prometheus readiness.
In `@monitoring/grafana/provisioning/alerting/rules.yml`:
- Line 7: 현재 그룹 평가 주기인 interval: 1m이 PromQL 쿼리의 5m 범위와 중복 평가를 초래할 수 있으니,
alerting 규칙의 'interval' 설정을 검토하고 필요하면 값을 늘리거나(예: interval: 2m) 반대로 쿼리의 범위를 줄여(예:
1m) 민감도와 리소스 사용량 균형을 맞추세요; 변경 대상은 provisioning/alerting/rules.yml에서 'interval'
항목이며, 변경 후에는 알림 빈도와 중복 평가 여부를 테스트해 결과를 확인해주세요.
- Around line 56-57: Change the alert rules so that missing metrics don’t
silently show OK: update noDataState (currently set to OK) to a non-OK state
(e.g., NoData or Alert) for the rules that define noDataState and execErrState,
and/or add a dedicated "Metrics Collection Health" alert that monitors
up{job="cherrish"} so scrapes/down metrics trigger alerts; modify the entries
referencing noDataState and execErrState in the rules.yml and add the
up{job="cherrish"} rule as suggested.
In `@monitoring/grafana/provisioning/dashboards/dashboard.yml`:
- Around line 1-11: Change the provider block in dashboard.yml to set
disableDeletion: true to prevent accidental removal of provisioned dashboards
(look for the providers list and the entry with name: 'Cherrish Dashboards'),
and consider increasing updateIntervalSeconds from 30 to a higher value (e.g.,
300) for production; ensure options.path remains
/etc/grafana/provisioning/dashboards/json and keep orgId, folder and type
unchanged.
In `@monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json`:
- Line 501: The dashboard's refresh interval is set too aggressively via the
JSON key "refresh": "5s"; update the "refresh" value in cherrish-overview.json
from "5s" to a less frequent interval such as "30s" or "1m" to reduce load on
Prometheus/Grafana and avoid unnecessary scraping; locate the "refresh" entry
(currently "refresh": "5s") and replace it with the chosen value, then validate
the JSON and reload the dashboard.
- Around line 47-54: Update the threshold "steps" values to match each panel's
"unit" instead of the blanket 80: for the "CPU Usage" panel (unit "percentunit",
max 1) change the red threshold step value from 80 to 0.8; for the "JVM Heap
Memory" panel (unit "bytes") either convert the 80 to a byte value (e.g., 0.8 *
configured max heap in bytes) or switch that panel's unit to a percentage unit
and set the threshold value to 0.8; for the "GC Pause Time" panel (unit "s")
replace 80 with a realistic seconds threshold (e.g., 0.5 for 500ms or another
appropriate value). Edit the JSON objects under each panel's "thresholds.steps"
and/or "unit" properties accordingly.
In `@monitoring/prometheus/prometheus.yml`:
- Around line 10-16: The current Prometheus scrape job for job_name
'cherrish-server' hardcodes a local Docker target and leaves production config
commented out; instead, make the target configurable per environment by either
(a) splitting prometheus.yml into environment-specific files and loading the
correct one during deployment, or (b) parameterizing the static_configs target
using environment variables (referencing the job_name 'cherrish-server' and
metrics_path '/actuator/prometheus') so the target host and port come from
CHERRISH_SERVER_HOST/CHERRISH_SERVER_PORT (with sensible defaults) and remove
manual comment toggling.
In `@src/main/resources/application-monitoring.yaml`:
- Around line 9-10: 관리용 Prometheus 접근 속성인 management.endpoint.prometheus.access에
현재 값이 비어 있으니 의도한 접근 수준에 따라 명시적으로 'none', 'read-only' 또는 'unrestricted' 중 하나를
할당하거나 기본 동작을 사용하려면 해당 속성 라인(access:)을 삭제하세요; 설정 변경은 application-monitoring.yaml의
management.endpoint.prometheus.access 항목을 찾아 적용하세요.
In `@src/main/resources/application-prod.yaml`:
- Around line 12-17: The management endpoints are exposed without proper auth;
update production config or add security: either implement Spring Security to
protect management endpoints and enforce IP-based access (configure access
control for Prometheus via prometheus.access and secure endpoints under
management.* and endpoint.health.show-details), or remove the monitoring profile
from production by ensuring the "monitoring" profile is not active in production
application-prod.yaml and keep management.server.port: 8081 while
disabling/promoting safe defaults for management.endpoint.* to avoid
unauthenticated metric/health exposure.
monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json
Outdated
Show resolved
Hide resolved
monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json
Outdated
Show resolved
Hide resolved
Merge branch '129-feature/prometheus-grafana-monitoring-system' of https://github.com/TEAM-Cherrish/Cherrish-Server into 129-feature/prometheus-grafana-monitoring-system
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@monitoring/grafana/provisioning/alerting/policies.yml`:
- Around line 3-10: Current group_by uses only grafana_folder which can merge
unrelated alerts; update the policies block so group_by includes a distinct rule
identifier (e.g., add alertname or the rule UID label) in addition to
grafana_folder to ensure alerts are grouped per rule. Locate the policies entry
and modify the group_by array to include "alertname" (or your rule UID label)
alongside "grafana_folder", keeping group_wait/group_interval/repeat_interval
as-is.
In `@monitoring/prometheus/prometheus.prod.yml`:
- Around line 10-13: The Prometheus scrape job for job_name 'cherrish-server'
uses a static target 'cherrish-server:8081' which may not resolve in production;
verify that the DNS name 'cherrish-server' resolves in your production runtime
(Kubernetes/Docker Swarm) and if not, switch this scrape config to use the
appropriate service discovery (e.g., kubernetes_sd_configs or
docker_sd_configs), a fully qualified domain name, or an environment-specific
variable so it matches how the service is exposed in production; check related
app config in application-prod.yaml to ensure port 8081 matches and update the
prometheus scrape target or discovery method accordingly.
In `@monitoring/prometheus/prometheus.yml`:
- Around line 10-13: The Prometheus scrape target uses host.docker.internal:8080
which may fail on Linux because docker-compose.monitoring.yml lacks the
extra_hosts mapping; update the prometheus service in
docker-compose.monitoring.yml to include extra_hosts:
["host.docker.internal:host-gateway"] so host.docker.internal resolves, or
alternatively change the target in monitoring/prometheus/prometheus.yml for
job_name 'cherrish-server' to use the internal service hostname (e.g.,
cherrish-server:8081 as in prometheus.prod.yml) to ensure cross-platform
scraping works.
♻️ Duplicate comments (1)
monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json (1)
407-415: 응답시간 패널 threshold 값이 단위 대비 과도함
unit: "s"인 응답시간 패널에서 red threshold가80s로 설정되어 있어 실제 SLO와 크게 어긋납니다. thresholdsStyle가 현재 off이더라도, 추후 활성화 시 잘못된 신호가 될 수 있으니 합리적인 SLO 값(예: 0.5~1s 수준)으로 조정해 주세요.♻️ 예시 수정안
- { "color": "red", "value": 80 } + { "color": "red", "value": 0.5 }
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@docker-compose.monitoring.yml`:
- Line 16: 현재 compose에 남아있는 '--web.enable-lifecycle' 플래그는 프로덕션에서 /-/reload 및
/-/quit 엔드포인트를 인증 없이 노출할 수 있으므로 제거하거나 환경별로 비활성화해야 합니다; 수정 방법은
docker-compose.monitoring.yml에서 '--web.enable-lifecycle' 항목을 프로덕션 빌드/배포에서는 삭제하거나
PROM_ENABLE_LIFECYCLE 같은 환경변수로 감싸서 development에서만 설정되도록 변경하고(또는 배포 레이어에서 리버스
프록시/네트워크 정책으로 /-/reload 및 /-/quit 엔드포인트 접근을 차단하도록 구성) 해당 플래그와 영향받는
엔드포인트('/-/reload', '/-/quit')를 명확히 확인해 적용하세요.
- Line 1: docker-compose.monitoring.yml의 최상단에 있는 'version' 필드는 Docker Compose
V2에서 무시되므로 해당 파일의 'version: '3.8'' 항목을 삭제하세요; 즉 파일에서 최상위 키명인 version을 제거하고 나머지
서비스/volumes/networks 정의는 그대로 두면 됩니다.
- Line 30: Update the Grafana image tag used in the docker-compose service (the
line containing image: grafana/grafana:10.0.0) to the current stable release by
replacing 10.0.0 with 12.3.1 (i.e., image: grafana/grafana:12.3.1) or with a
configurable variable (e.g., ${GRAFANA_IMAGE:-grafana/grafana:12.3.1}) so the
service uses the latest supported/stable Grafana release and can be updated
easily.
- Line 5: Update the Prometheus image tag in docker-compose.monitoring.yml:
replace the outdated image string "prom/prometheus:v2.45.0" with the current
stable release "prom/prometheus:3.9.1" (ensuring any related service name or
labels referencing Prometheus remain unchanged), then redeploy to pick up the
security-fixed version.
♻️ Duplicate comments (1)
docker-compose.monitoring.yml (1)
35-36: 기본 관리자 비밀번호 관련 보안 이슈는 이전 리뷰에서 논의되었으며, 프로덕션 환경은 별도로 관리한다고 확인되었습니다.
…tps://github.com/TEAM-Cherrish/Cherrish-Server into 129-feature/prometheus-grafana-monitoring-system
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docker-compose.monitoring.yml`:
- Around line 1-25: Add explicit CPU and memory limits for the prometheus
service to prevent it from exhausting host resources: update the prometheus
service (service name "prometheus") to include resource constraints—for Docker
Compose v3 use deploy.resources.limits with cpu and memory (e.g., cpu: "1.0",
memory: "2G"), and if supporting older Compose formats add equivalent mem_limit
and cpus entries—so the container has bounded memory/CPU while retaining the
existing command, volumes, healthcheck, networks, and restart settings.
- Around line 27-45: The Grafana service lacks resource limits; update the
grafana service block (service named "grafana", image "grafana/grafana:11.6.9")
to include resource constraints by adding deploy.resources.limits (e.g., cpu and
memory) and deploy.resources.reservations to cap and reserve CPU/memory for the
container; if using plain docker-compose (non-swarm) add equivalent
mem_limit/cpu_shares or use compose v2/3 fields appropriate for your setup so
Grafana cannot exhaust host resources.
| services: | ||
| prometheus: | ||
| image: prom/prometheus:v3.5.1 | ||
| container_name: cherrish-prometheus | ||
| ports: | ||
| - "9090:9090" | ||
| volumes: | ||
| - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml | ||
| - prometheus_data:/prometheus | ||
| command: | ||
| - '--config.file=/etc/prometheus/prometheus.yml' | ||
| - '--storage.tsdb.path=/prometheus' | ||
| - '--storage.tsdb.retention.time=15d' | ||
| - '--web.enable-lifecycle' | ||
| healthcheck: | ||
| test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"] | ||
| interval: 10s | ||
| timeout: 5s | ||
| retries: 3 | ||
| start_period: 10s | ||
| extra_hosts: | ||
| - "host.docker.internal:host-gateway" | ||
| networks: | ||
| - monitoring | ||
| restart: unless-stopped |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
프로덕션 안정성을 위해 리소스 제한 추가를 권장합니다.
Prometheus 컨테이너에 메모리/CPU 제한이 없으면, 메트릭 데이터 증가 시 호스트 리소스를 과도하게 사용할 수 있습니다. 특히 15일 retention과 함께 사용하면 메모리 사용량이 점진적으로 증가할 수 있습니다.
♻️ 리소스 제한 추가 예시
prometheus:
image: prom/prometheus:v3.5.1
container_name: cherrish-prometheus
+ deploy:
+ resources:
+ limits:
+ memory: 2G
+ cpus: '1.0'
+ reservations:
+ memory: 512M
ports:
- "9090:9090"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| services: | |
| prometheus: | |
| image: prom/prometheus:v3.5.1 | |
| container_name: cherrish-prometheus | |
| ports: | |
| - "9090:9090" | |
| volumes: | |
| - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml | |
| - prometheus_data:/prometheus | |
| command: | |
| - '--config.file=/etc/prometheus/prometheus.yml' | |
| - '--storage.tsdb.path=/prometheus' | |
| - '--storage.tsdb.retention.time=15d' | |
| - '--web.enable-lifecycle' | |
| healthcheck: | |
| test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"] | |
| interval: 10s | |
| timeout: 5s | |
| retries: 3 | |
| start_period: 10s | |
| extra_hosts: | |
| - "host.docker.internal:host-gateway" | |
| networks: | |
| - monitoring | |
| restart: unless-stopped | |
| services: | |
| prometheus: | |
| image: prom/prometheus:v3.5.1 | |
| container_name: cherrish-prometheus | |
| deploy: | |
| resources: | |
| limits: | |
| memory: 2G | |
| cpus: '1.0' | |
| reservations: | |
| memory: 512M | |
| ports: | |
| - "9090:9090" | |
| volumes: | |
| - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml | |
| - prometheus_data:/prometheus | |
| command: | |
| - '--config.file=/etc/prometheus/prometheus.yml' | |
| - '--storage.tsdb.path=/prometheus' | |
| - '--storage.tsdb.retention.time=15d' | |
| - '--web.enable-lifecycle' | |
| healthcheck: | |
| test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"] | |
| interval: 10s | |
| timeout: 5s | |
| retries: 3 | |
| start_period: 10s | |
| extra_hosts: | |
| - "host.docker.internal:host-gateway" | |
| networks: | |
| - monitoring | |
| restart: unless-stopped |
🤖 Prompt for AI Agents
In `@docker-compose.monitoring.yml` around lines 1 - 25, Add explicit CPU and
memory limits for the prometheus service to prevent it from exhausting host
resources: update the prometheus service (service name "prometheus") to include
resource constraints—for Docker Compose v3 use deploy.resources.limits with cpu
and memory (e.g., cpu: "1.0", memory: "2G"), and if supporting older Compose
formats add equivalent mem_limit and cpus entries—so the container has bounded
memory/CPU while retaining the existing command, volumes, healthcheck, networks,
and restart settings.
| grafana: | ||
| image: grafana/grafana:11.6.9 | ||
| container_name: cherrish-grafana | ||
| ports: | ||
| - "3000:3000" | ||
| environment: | ||
| - GF_SECURITY_ADMIN_USER=admin | ||
| - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin} | ||
| - GF_USERS_ALLOW_SIGN_UP=false | ||
| - DISCORD_MONITORING_WEBHOOK_URL=${DISCORD_MONITORING_WEBHOOK_URL} | ||
| volumes: | ||
| - ./monitoring/grafana/provisioning:/etc/grafana/provisioning | ||
| - grafana_data:/var/lib/grafana | ||
| networks: | ||
| - monitoring | ||
| depends_on: | ||
| prometheus: | ||
| condition: service_healthy | ||
| restart: unless-stopped |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Grafana 컨테이너에도 리소스 제한 추가를 권장합니다.
Grafana도 대시보드 복잡도와 동시 사용자 수에 따라 리소스 사용량이 증가할 수 있습니다.
♻️ 리소스 제한 추가 예시
grafana:
image: grafana/grafana:11.6.9
container_name: cherrish-grafana
+ deploy:
+ resources:
+ limits:
+ memory: 512M
+ cpus: '0.5'
+ reservations:
+ memory: 128M
ports:
- "3000:3000"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| grafana: | |
| image: grafana/grafana:11.6.9 | |
| container_name: cherrish-grafana | |
| ports: | |
| - "3000:3000" | |
| environment: | |
| - GF_SECURITY_ADMIN_USER=admin | |
| - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin} | |
| - GF_USERS_ALLOW_SIGN_UP=false | |
| - DISCORD_MONITORING_WEBHOOK_URL=${DISCORD_MONITORING_WEBHOOK_URL} | |
| volumes: | |
| - ./monitoring/grafana/provisioning:/etc/grafana/provisioning | |
| - grafana_data:/var/lib/grafana | |
| networks: | |
| - monitoring | |
| depends_on: | |
| prometheus: | |
| condition: service_healthy | |
| restart: unless-stopped | |
| grafana: | |
| image: grafana/grafana:11.6.9 | |
| container_name: cherrish-grafana | |
| deploy: | |
| resources: | |
| limits: | |
| memory: 512M | |
| cpus: '0.5' | |
| reservations: | |
| memory: 128M | |
| ports: | |
| - "3000:3000" | |
| environment: | |
| - GF_SECURITY_ADMIN_USER=admin | |
| - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin} | |
| - GF_USERS_ALLOW_SIGN_UP=false | |
| - DISCORD_MONITORING_WEBHOOK_URL=${DISCORD_MONITORING_WEBHOOK_URL} | |
| volumes: | |
| - ./monitoring/grafana/provisioning:/etc/grafana/provisioning | |
| - grafana_data:/var/lib/grafana | |
| networks: | |
| - monitoring | |
| depends_on: | |
| prometheus: | |
| condition: service_healthy | |
| restart: unless-stopped |
🤖 Prompt for AI Agents
In `@docker-compose.monitoring.yml` around lines 27 - 45, The Grafana service
lacks resource limits; update the grafana service block (service named
"grafana", image "grafana/grafana:11.6.9") to include resource constraints by
adding deploy.resources.limits (e.g., cpu and memory) and
deploy.resources.reservations to cap and reserve CPU/memory for the container;
if using plain docker-compose (non-swarm) add equivalent mem_limit/cpu_shares or
use compose v2/3 fields appropriate for your setup so Grafana cannot exhaust
host resources.
🛠 Related issue 🛠
✏️ Work Description ✏️
Phase 1: 애플리케이션 메트릭 노출
build.gradle에micrometer-registry-prometheus의존성 추가application-monitoring.yaml프로필 생성 (metrics 공통 설정)application.yaml에 monitoring 프로필 include 추가application-prod.yaml에 Actuator 별도 포트(8081) 설정application-dev.yaml중복 설정 정리Phase 2: 모니터링 인프라 구축
docker-compose.monitoring.yml작성 (Prometheus + Grafana)monitoring/prometheus/prometheus.yml설정monitoring/grafana/provisioning/데이터소스 및 대시보드 프로비저닝Phase 3: Grafana 대시보드 구성
Phase 4: 알림 설정
기타
MonitoringTestController추가 (local/dev 프로필 전용, 테스트용)📸 Screenshot 📸
😅 Uncompleted Tasks 😅
DISCORD_MONITORING_WEBHOOK_URL환경변수 설정 필요📢 To Reviewers 📢
application-monitoring.yaml이 새로 추가되었고,profiles.include로 모든 환경에 적용됩니다MonitoringTestController는@Profile({"local", "dev"})로 프로덕션에서는 비활성화됩니다DISCORD_MONITORING_WEBHOOK_URL=... docker-compose -f docker-compose.monitoring.yml up -d