Skip to content

[Feature] Prometheus + Grafana 모니터링 시스템 구축#138

Open
Kimgyuilli wants to merge 29 commits intodevelopfrom
129-feature/prometheus-grafana-monitoring-system
Open

[Feature] Prometheus + Grafana 모니터링 시스템 구축#138
Kimgyuilli wants to merge 29 commits intodevelopfrom
129-feature/prometheus-grafana-monitoring-system

Conversation

@Kimgyuilli
Copy link
Contributor

🛠 Related issue 🛠

✏️ Work Description ✏️

  • Phase 1: 애플리케이션 메트릭 노출

    • build.gradlemicrometer-registry-prometheus 의존성 추가
    • application-monitoring.yaml 프로필 생성 (metrics 공통 설정)
    • application.yaml에 monitoring 프로필 include 추가
    • application-prod.yaml에 Actuator 별도 포트(8081) 설정
    • application-dev.yaml 중복 설정 정리
  • Phase 2: 모니터링 인프라 구축

    • docker-compose.monitoring.yml 작성 (Prometheus + Grafana)
    • monitoring/prometheus/prometheus.yml 설정
    • monitoring/grafana/provisioning/ 데이터소스 및 대시보드 프로비저닝
  • Phase 3: Grafana 대시보드 구성

    • JVM 메트릭 패널 (Heap Memory, GC Pause, Threads, CPU Usage)
    • HTTP 요청 패널 (Request Rate, Error Rate, Response Time Percentiles, Throughput)
    • 대시보드 JSON 프로비저닝 저장
  • Phase 4: 알림 설정

    • Discord Contact Point 설정
    • Alert Rules 설정 (High Error Rate, High Latency)
    • Notification Policy 설정
    • 알림 메시지 형식 커스터마이징
  • 기타

    • MonitoringTestController 추가 (local/dev 프로필 전용, 테스트용)

📸 Screenshot 📸

설명 사진
Grafana 대시보드 image
Discord 알림 예시 image

😅 Uncompleted Tasks 😅

  • 프로덕션 배포 시 DISCORD_MONITORING_WEBHOOK_URL 환경변수 설정 필요
  • Security Group 설정 (8081 포트 Prometheus 서버만 허용)
  • Spring Security 도입 후 Actuator IP 제한 추가 예정

📢 To Reviewers 📢

  • application-monitoring.yaml이 새로 추가되었고, profiles.include로 모든 환경에 적용됩니다
  • 프로덕션에서는 Actuator가 8081 포트로 분리되어 외부 노출을 방지합니다
  • MonitoringTestController@Profile({"local", "dev"})로 프로덕션에서는 비활성화됩니다
  • 모니터링 스택 실행: DISCORD_MONITORING_WEBHOOK_URL=... docker-compose -f docker-compose.monitoring.yml up -d

@Kimgyuilli Kimgyuilli requested a review from ssyoung02 January 22, 2026 10:18
@Kimgyuilli Kimgyuilli self-assigned this Jan 22, 2026
@Kimgyuilli Kimgyuilli added ✨ Feature 기능 개발 규일🍊 규일 담당 작업 labels Jan 22, 2026
@Kimgyuilli Kimgyuilli linked an issue Jan 22, 2026 that may be closed by this pull request
4 tasks
@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Prometheus·Grafana 기반 모니터링 스택과 Micrometer Prometheus 레지스트리 의존성을 추가하고, 애플리케이션의 Actuator/metrics 설정 및 프로필 조정, 도커 컴포즈 모니터링 스택, Grafana 프로비저닝(데이터소스·대시보드·알림), Prometheus 구성, 및 로컬/개발 전용 모니터링 테스트 컨트롤러를 추가했습니다.

Changes

Cohort / File(s) 변경 요약
빌드 · 애플리케이션 설정
build.gradle, src/main/resources/application.yaml, src/main/resources/application-dev.yaml, src/main/resources/application-monitoring.yaml, src/main/resources/application-prod.yaml
io.micrometer:micrometer-registry-prometheus 의존성 추가; monitoring 프로필 포함; Actuator/Prometheus/metrics 노출 및 관리 포트/설정 변경
도커 컴포즈 (모니터링 스택)
docker-compose.monitoring.yml
Prometheus 및 Grafana 서비스 정의(이미지, 포트, 볼륨, 네트워크, 환경변수, 헬스체크, depends_on) 추가
Prometheus 구성
monitoring/prometheus/prometheus.yml, monitoring/prometheus/prometheus.prod.yml
글로벌 scrape/evaluation 간격 15s; prometheuscherrish-server 스크랩 타겟(개발: host.docker.internal:8080, prod: cherrish-server:8081) 추가
Grafana 프로비저닝 - 데이터소스·대시보드
monitoring/grafana/provisioning/datasources/datasource.yml, monitoring/grafana/provisioning/dashboards/dashboard.yml, monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json
Prometheus를 기본 비편집 데이터소스로 설정; 대시보드 프로비저닝 및 cherrish-overview JSON 추가(여러 JVM/HTTP/성능 패널)
Grafana 프로비저닝 - 알림
monitoring/grafana/provisioning/alerting/contactpoints.yml, monitoring/grafana/provisioning/alerting/policies.yml, monitoring/grafana/provisioning/alerting/rules.yml
Discord webhook contactPoint, 알림 그룹화 정책 및 에러율·지연·메트릭 수집 다운 규칙 추가
모니터링 테스트 엔드포인트
src/main/java/com/sopt/cherrish/global/monitoring/MonitoringTestController.java
로컬/개발 전용 테스트 엔드포인트 추가: /api/monitoring/test/error (예외 발생), /api/monitoring/test/slow (2초 대기)

Sequence Diagram(s)

sequenceDiagram
  participant App as "Spring App\n(com.sopt.cherrish)"
  participant Prom as "Prometheus\n(prom/prometheus:9090)"
  participant Graf as "Grafana\n(grafana:3000)"
  participant Discord as "Discord\n(Webhook)"

  App->>Prom: /actuator/prometheus 노출 (스크래핑 대상)
  Prom->>Prom: 스크래프(간격 15s)
  Graf->>Prom: 대시보드/알림 쿼리(평가)
  Graf->>Graf: 알림 룰 평가 (Error Rate, High Latency, Metrics Down)
  Graf-->>Discord: Discord webhook으로 알림 전송
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • ssyoung02
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title '[Feature] Prometheus + Grafana 모니터링 시스템 구축' clearly and concisely summarizes the main objective of implementing a Prometheus + Grafana monitoring system.
Description check ✅ Passed The pull request description is comprehensive and directly related to the changeset, detailing all four phases of implementation, affected files, configuration changes, and additional context for reviewers.
Linked Issues check ✅ Passed All coding objectives from issue #129 and its sub-issues (#130-#133) are met: metrics exposure via Micrometer (Phase 1), monitoring infrastructure with docker-compose/Prometheus/Grafana provisioning (Phase 2), JVM/HTTP dashboards (Phase 3), and Discord alerting (Phase 4).
Out of Scope Changes check ✅ Passed All changes are directly aligned with the monitoring system implementation scope. MonitoringTestController and profile configurations are necessary supporting changes, not out-of-scope additions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 129-feature/prometheus-grafana-monitoring-system

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Fix all issues with AI agents
In `@docker-compose.monitoring.yml`:
- Around line 27-28: The docker-compose uses a dangerous default admin password
via GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}; remove the hardcoded
fallback and require an explicit secret by changing to
GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} (and consider the same for
GF_SECURITY_ADMIN_USER), and add startup validation or container healthcheck
that fails fast when GRAFANA_PASSWORD is empty so deployments without a strong
password do not start.
- Around line 36-37: Update the compose so Grafana waits for Prometheus to be
healthy rather than just started: add a healthcheck block to the Prometheus
service (define a reliable check that verifies Prometheus readiness) and change
Grafana's depends_on from the simple list to the condition form that references
prometheus with condition: service_healthy (i.e., use depends_on: prometheus:
condition: service_healthy). Ensure the healthcheck command and interval/retries
are appropriate for Prometheus readiness.

In `@monitoring/grafana/provisioning/alerting/rules.yml`:
- Line 7: 현재 그룹 평가 주기인 interval: 1m이 PromQL 쿼리의 5m 범위와 중복 평가를 초래할 수 있으니,
alerting 규칙의 'interval' 설정을 검토하고 필요하면 값을 늘리거나(예: interval: 2m) 반대로 쿼리의 범위를 줄여(예:
1m) 민감도와 리소스 사용량 균형을 맞추세요; 변경 대상은 provisioning/alerting/rules.yml에서 'interval'
항목이며, 변경 후에는 알림 빈도와 중복 평가 여부를 테스트해 결과를 확인해주세요.
- Around line 56-57: Change the alert rules so that missing metrics don’t
silently show OK: update noDataState (currently set to OK) to a non-OK state
(e.g., NoData or Alert) for the rules that define noDataState and execErrState,
and/or add a dedicated "Metrics Collection Health" alert that monitors
up{job="cherrish"} so scrapes/down metrics trigger alerts; modify the entries
referencing noDataState and execErrState in the rules.yml and add the
up{job="cherrish"} rule as suggested.

In `@monitoring/grafana/provisioning/dashboards/dashboard.yml`:
- Around line 1-11: Change the provider block in dashboard.yml to set
disableDeletion: true to prevent accidental removal of provisioned dashboards
(look for the providers list and the entry with name: 'Cherrish Dashboards'),
and consider increasing updateIntervalSeconds from 30 to a higher value (e.g.,
300) for production; ensure options.path remains
/etc/grafana/provisioning/dashboards/json and keep orgId, folder and type
unchanged.

In `@monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json`:
- Line 501: The dashboard's refresh interval is set too aggressively via the
JSON key "refresh": "5s"; update the "refresh" value in cherrish-overview.json
from "5s" to a less frequent interval such as "30s" or "1m" to reduce load on
Prometheus/Grafana and avoid unnecessary scraping; locate the "refresh" entry
(currently "refresh": "5s") and replace it with the chosen value, then validate
the JSON and reload the dashboard.
- Around line 47-54: Update the threshold "steps" values to match each panel's
"unit" instead of the blanket 80: for the "CPU Usage" panel (unit "percentunit",
max 1) change the red threshold step value from 80 to 0.8; for the "JVM Heap
Memory" panel (unit "bytes") either convert the 80 to a byte value (e.g., 0.8 *
configured max heap in bytes) or switch that panel's unit to a percentage unit
and set the threshold value to 0.8; for the "GC Pause Time" panel (unit "s")
replace 80 with a realistic seconds threshold (e.g., 0.5 for 500ms or another
appropriate value). Edit the JSON objects under each panel's "thresholds.steps"
and/or "unit" properties accordingly.

In `@monitoring/prometheus/prometheus.yml`:
- Around line 10-16: The current Prometheus scrape job for job_name
'cherrish-server' hardcodes a local Docker target and leaves production config
commented out; instead, make the target configurable per environment by either
(a) splitting prometheus.yml into environment-specific files and loading the
correct one during deployment, or (b) parameterizing the static_configs target
using environment variables (referencing the job_name 'cherrish-server' and
metrics_path '/actuator/prometheus') so the target host and port come from
CHERRISH_SERVER_HOST/CHERRISH_SERVER_PORT (with sensible defaults) and remove
manual comment toggling.

In `@src/main/resources/application-monitoring.yaml`:
- Around line 9-10: 관리용 Prometheus 접근 속성인 management.endpoint.prometheus.access에
현재 값이 비어 있으니 의도한 접근 수준에 따라 명시적으로 'none', 'read-only' 또는 'unrestricted' 중 하나를
할당하거나 기본 동작을 사용하려면 해당 속성 라인(access:)을 삭제하세요; 설정 변경은 application-monitoring.yaml의
management.endpoint.prometheus.access 항목을 찾아 적용하세요.

In `@src/main/resources/application-prod.yaml`:
- Around line 12-17: The management endpoints are exposed without proper auth;
update production config or add security: either implement Spring Security to
protect management endpoints and enforce IP-based access (configure access
control for Prometheus via prometheus.access and secure endpoints under
management.* and endpoint.health.show-details), or remove the monitoring profile
from production by ensuring the "monitoring" profile is not active in production
application-prod.yaml and keep management.server.port: 8081 while
disabling/promoting safe defaults for management.endpoint.* to avoid
unauthenticated metric/health exposure.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@monitoring/grafana/provisioning/alerting/policies.yml`:
- Around line 3-10: Current group_by uses only grafana_folder which can merge
unrelated alerts; update the policies block so group_by includes a distinct rule
identifier (e.g., add alertname or the rule UID label) in addition to
grafana_folder to ensure alerts are grouped per rule. Locate the policies entry
and modify the group_by array to include "alertname" (or your rule UID label)
alongside "grafana_folder", keeping group_wait/group_interval/repeat_interval
as-is.

In `@monitoring/prometheus/prometheus.prod.yml`:
- Around line 10-13: The Prometheus scrape job for job_name 'cherrish-server'
uses a static target 'cherrish-server:8081' which may not resolve in production;
verify that the DNS name 'cherrish-server' resolves in your production runtime
(Kubernetes/Docker Swarm) and if not, switch this scrape config to use the
appropriate service discovery (e.g., kubernetes_sd_configs or
docker_sd_configs), a fully qualified domain name, or an environment-specific
variable so it matches how the service is exposed in production; check related
app config in application-prod.yaml to ensure port 8081 matches and update the
prometheus scrape target or discovery method accordingly.

In `@monitoring/prometheus/prometheus.yml`:
- Around line 10-13: The Prometheus scrape target uses host.docker.internal:8080
which may fail on Linux because docker-compose.monitoring.yml lacks the
extra_hosts mapping; update the prometheus service in
docker-compose.monitoring.yml to include extra_hosts:
["host.docker.internal:host-gateway"] so host.docker.internal resolves, or
alternatively change the target in monitoring/prometheus/prometheus.yml for
job_name 'cherrish-server' to use the internal service hostname (e.g.,
cherrish-server:8081 as in prometheus.prod.yml) to ensure cross-platform
scraping works.
♻️ Duplicate comments (1)
monitoring/grafana/provisioning/dashboards/json/cherrish-overview.json (1)

407-415: 응답시간 패널 threshold 값이 단위 대비 과도함

unit: "s"인 응답시간 패널에서 red threshold가 80s로 설정되어 있어 실제 SLO와 크게 어긋납니다. thresholdsStyle가 현재 off이더라도, 추후 활성화 시 잘못된 신호가 될 수 있으니 합리적인 SLO 값(예: 0.5~1s 수준)으로 조정해 주세요.

♻️ 예시 수정안
-              { "color": "red", "value": 80 }
+              { "color": "red", "value": 0.5 }

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docker-compose.monitoring.yml`:
- Line 16: 현재 compose에 남아있는 '--web.enable-lifecycle' 플래그는 프로덕션에서 /-/reload 및
/-/quit 엔드포인트를 인증 없이 노출할 수 있으므로 제거하거나 환경별로 비활성화해야 합니다; 수정 방법은
docker-compose.monitoring.yml에서 '--web.enable-lifecycle' 항목을 프로덕션 빌드/배포에서는 삭제하거나
PROM_ENABLE_LIFECYCLE 같은 환경변수로 감싸서 development에서만 설정되도록 변경하고(또는 배포 레이어에서 리버스
프록시/네트워크 정책으로 /-/reload 및 /-/quit 엔드포인트 접근을 차단하도록 구성) 해당 플래그와 영향받는
엔드포인트('/-/reload', '/-/quit')를 명확히 확인해 적용하세요.
- Line 1: docker-compose.monitoring.yml의 최상단에 있는 'version' 필드는 Docker Compose
V2에서 무시되므로 해당 파일의 'version: '3.8'' 항목을 삭제하세요; 즉 파일에서 최상위 키명인 version을 제거하고 나머지
서비스/volumes/networks 정의는 그대로 두면 됩니다.
- Line 30: Update the Grafana image tag used in the docker-compose service (the
line containing image: grafana/grafana:10.0.0) to the current stable release by
replacing 10.0.0 with 12.3.1 (i.e., image: grafana/grafana:12.3.1) or with a
configurable variable (e.g., ${GRAFANA_IMAGE:-grafana/grafana:12.3.1}) so the
service uses the latest supported/stable Grafana release and can be updated
easily.
- Line 5: Update the Prometheus image tag in docker-compose.monitoring.yml:
replace the outdated image string "prom/prometheus:v2.45.0" with the current
stable release "prom/prometheus:3.9.1" (ensuring any related service name or
labels referencing Prometheus remain unchanged), then redeploy to pick up the
security-fixed version.
♻️ Duplicate comments (1)
docker-compose.monitoring.yml (1)

35-36: 기본 관리자 비밀번호 관련 보안 이슈는 이전 리뷰에서 논의되었으며, 프로덕션 환경은 별도로 관리한다고 확인되었습니다.

@Kimgyuilli Kimgyuilli changed the title [Feature] Prometheus + Grafana 모니터링 시스템 구축 [WIP][Feature] Prometheus + Grafana 모니터링 시스템 구축 Jan 22, 2026
@Kimgyuilli Kimgyuilli changed the title [WIP][Feature] Prometheus + Grafana 모니터링 시스템 구축 [Feature] Prometheus + Grafana 모니터링 시스템 구축 Jan 23, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docker-compose.monitoring.yml`:
- Around line 1-25: Add explicit CPU and memory limits for the prometheus
service to prevent it from exhausting host resources: update the prometheus
service (service name "prometheus") to include resource constraints—for Docker
Compose v3 use deploy.resources.limits with cpu and memory (e.g., cpu: "1.0",
memory: "2G"), and if supporting older Compose formats add equivalent mem_limit
and cpus entries—so the container has bounded memory/CPU while retaining the
existing command, volumes, healthcheck, networks, and restart settings.
- Around line 27-45: The Grafana service lacks resource limits; update the
grafana service block (service named "grafana", image "grafana/grafana:11.6.9")
to include resource constraints by adding deploy.resources.limits (e.g., cpu and
memory) and deploy.resources.reservations to cap and reserve CPU/memory for the
container; if using plain docker-compose (non-swarm) add equivalent
mem_limit/cpu_shares or use compose v2/3 fields appropriate for your setup so
Grafana cannot exhaust host resources.

Comment on lines +1 to +25
services:
prometheus:
image: prom/prometheus:v3.5.1
container_name: cherrish-prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 10s
timeout: 5s
retries: 3
start_period: 10s
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- monitoring
restart: unless-stopped
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

프로덕션 안정성을 위해 리소스 제한 추가를 권장합니다.

Prometheus 컨테이너에 메모리/CPU 제한이 없으면, 메트릭 데이터 증가 시 호스트 리소스를 과도하게 사용할 수 있습니다. 특히 15일 retention과 함께 사용하면 메모리 사용량이 점진적으로 증가할 수 있습니다.

♻️ 리소스 제한 추가 예시
  prometheus:
    image: prom/prometheus:v3.5.1
    container_name: cherrish-prometheus
+   deploy:
+     resources:
+       limits:
+         memory: 2G
+         cpus: '1.0'
+       reservations:
+         memory: 512M
    ports:
      - "9090:9090"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
services:
prometheus:
image: prom/prometheus:v3.5.1
container_name: cherrish-prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 10s
timeout: 5s
retries: 3
start_period: 10s
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- monitoring
restart: unless-stopped
services:
prometheus:
image: prom/prometheus:v3.5.1
container_name: cherrish-prometheus
deploy:
resources:
limits:
memory: 2G
cpus: '1.0'
reservations:
memory: 512M
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 10s
timeout: 5s
retries: 3
start_period: 10s
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- monitoring
restart: unless-stopped
🤖 Prompt for AI Agents
In `@docker-compose.monitoring.yml` around lines 1 - 25, Add explicit CPU and
memory limits for the prometheus service to prevent it from exhausting host
resources: update the prometheus service (service name "prometheus") to include
resource constraints—for Docker Compose v3 use deploy.resources.limits with cpu
and memory (e.g., cpu: "1.0", memory: "2G"), and if supporting older Compose
formats add equivalent mem_limit and cpus entries—so the container has bounded
memory/CPU while retaining the existing command, volumes, healthcheck, networks,
and restart settings.

Comment on lines +27 to +45
grafana:
image: grafana/grafana:11.6.9
container_name: cherrish-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
- DISCORD_MONITORING_WEBHOOK_URL=${DISCORD_MONITORING_WEBHOOK_URL}
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
networks:
- monitoring
depends_on:
prometheus:
condition: service_healthy
restart: unless-stopped
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Grafana 컨테이너에도 리소스 제한 추가를 권장합니다.

Grafana도 대시보드 복잡도와 동시 사용자 수에 따라 리소스 사용량이 증가할 수 있습니다.

♻️ 리소스 제한 추가 예시
  grafana:
    image: grafana/grafana:11.6.9
    container_name: cherrish-grafana
+   deploy:
+     resources:
+       limits:
+         memory: 512M
+         cpus: '0.5'
+       reservations:
+         memory: 128M
    ports:
      - "3000:3000"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
grafana:
image: grafana/grafana:11.6.9
container_name: cherrish-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
- DISCORD_MONITORING_WEBHOOK_URL=${DISCORD_MONITORING_WEBHOOK_URL}
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
networks:
- monitoring
depends_on:
prometheus:
condition: service_healthy
restart: unless-stopped
grafana:
image: grafana/grafana:11.6.9
container_name: cherrish-grafana
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 128M
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
- DISCORD_MONITORING_WEBHOOK_URL=${DISCORD_MONITORING_WEBHOOK_URL}
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
networks:
- monitoring
depends_on:
prometheus:
condition: service_healthy
restart: unless-stopped
🤖 Prompt for AI Agents
In `@docker-compose.monitoring.yml` around lines 27 - 45, The Grafana service
lacks resource limits; update the grafana service block (service named
"grafana", image "grafana/grafana:11.6.9") to include resource constraints by
adding deploy.resources.limits (e.g., cpu and memory) and
deploy.resources.reservations to cap and reserve CPU/memory for the container;
if using plain docker-compose (non-swarm) add equivalent mem_limit/cpu_shares or
use compose v2/3 fields appropriate for your setup so Grafana cannot exhaust
host resources.

Copy link
Contributor

@ssyoung02 ssyoung02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

고생하셨습니다!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

✨ Feature 기능 개발 규일🍊 규일 담당 작업

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[✨ Feature] Prometheus + Grafana 모니터링 시스템 구축

2 participants

Comments