Skip to content

Conversation

@coderloganli
Copy link
Owner

Summary

This PR adds comprehensive Prometheus monitoring to GoChat, instrumenting all services with metrics collection for QPS, latency, connections, and RPC calls. The implementation includes:

  • HTTP metrics: Request count, duration, in-flight requests by service/method/path
  • RPC metrics: Server/client request tracking, duration histograms, in-flight requests
  • Connection metrics: Active connections and total connection counts by type (WebSocket/TCP)
  • Message metrics: Message send/receive counters
  • User operation metrics: Login/register operation tracking
  • Redis metrics: Operation counts by command type

Key Changes

New Packages

  • pkg/metrics/metrics.go - Centralized Prometheus metric definitions (147 lines)
  • pkg/metrics/server.go - Standalone metrics HTTP server for RPC services (42 lines)
  • pkg/middleware/prometheus.go - Gin middleware for HTTP request instrumentation (35 lines)
  • pkg/middleware/rpcx.go - RPCX plugin for RPC server instrumentation using goroutine ID tracking (68 lines)

Service Instrumentation

  • API Service (api/) - Added Prometheus middleware and /metrics endpoint, instrumented user operations
  • Logic Service (logic/) - Added metrics server on port 9091, instrumented RPC server and Redis operations
  • Connect Services (connect/) - Metrics servers on ports 9092 (WebSocket) and 9093 (TCP), connection and message tracking
  • Task Service (task/) - Metrics server on port 9094, RPC client metrics for calls to Connect services
  • Site Service (site/) - Added /metrics endpoint on port 8080

Infrastructure

  • deployments/prometheus/prometheus.yml - Prometheus scrape configuration for all services (51 lines)
  • docker-compose.yml - Added Prometheus container, exposed metrics ports (9091-9094, 8080)
  • Updated all services to expose metrics endpoints

Dependency Management

  • Added github.com/prometheus/client_golang v1.14.0
  • Removed vendor directory (3,470 files) - now using Go modules exclusively
  • Updated .gitignore to exclude vendor/

Merged Changes from Master

  • Upgraded to Go 1.23 (from Go 1.18)
  • Updated Dockerfile to use golang:1.23 and debian:bookworm-slim
  • Updated CI/CD workflow to Go 1.23 with branch name sanitization
  • Fixed deprecated ioutil.ReadFileos.ReadFile
  • Updated netcat package to netcat-openbsd

Metrics Exposed

HTTP Metrics

  • gochat_http_requests_total - Counter by service, method, path, status
  • gochat_http_request_duration_seconds - Histogram by service, method, path
  • gochat_http_requests_in_flight - Gauge by service

RPC Metrics

  • gochat_rpc_server_requests_total - Counter by service, method, status
  • gochat_rpc_server_duration_seconds - Histogram by service, method
  • gochat_rpc_server_requests_in_flight - Gauge by service
  • gochat_rpc_client_requests_total - Counter by service, method, status
  • gochat_rpc_client_duration_seconds - Histogram by service, target, method

Connection Metrics

  • gochat_connections_active - Gauge by service, type (websocket/tcp)
  • gochat_connections_total - Counter by service, type, status

Message Metrics

  • gochat_messages_total - Counter by service, direction (sent/received)

Application Metrics

  • gochat_user_operations_total - Counter by operation (login/register), status
  • gochat_redis_operations_total - Counter by service, command

Metrics Endpoints

Each service exposes metrics on the following endpoints:

Testing

Start Services

docker compose up -d

Verify Metrics

# Check individual service metrics
curl http://localhost:7070/metrics | grep gochat
curl http://localhost:9091/metrics | grep gochat
curl http://localhost:9092/metrics | grep gochat

# Access Prometheus UI
open http://localhost:9090

Example Queries

In Prometheus UI, try these queries:

# HTTP request rate by service
rate(gochat_http_requests_total[5m])

# P99 HTTP latency
histogram_quantile(0.99, rate(gochat_http_request_duration_seconds_bucket[5m]))

# Active connections
gochat_connections_active

# RPC call success rate
sum(rate(gochat_rpc_server_requests_total{status="success"}[5m])) by (service)

Test User Registration

# Register a user and check metrics
curl -X POST http://localhost:7070/user/register \
  -H "Content-Type: application/json" \
  -d '{"userName": "test", "password": "test123"}'

# Check user operation metrics
curl http://localhost:7070/metrics | grep gochat_user_operations_total

Bug Fixes

During implementation, fixed a critical RPCX plugin bug where PreCall was returning modified context instead of original args, causing reflection errors:

  • Error: reflect: Call using *context.valueCtx as type *proto.RegisterRequest
  • Fix: Changed to use sync.Map with goroutine ID for timing storage, returning args unchanged (see pkg/middleware/rpcx.go:40-47)

Files Changed

Modified: 23 files (+568, -54)

  • Infrastructure: .github/workflows/ci-cd.yml, Makefile, docker-compose.yml, docker/Dockerfile
  • Dependencies: go.mod, go.sum, .gitignore
  • Services: api/, connect/, logic/, task/, site/
  • New packages: pkg/metrics/, pkg/middleware/
  • Configuration: deployments/prometheus/prometheus.yml

🤖 Generated with Claude Code

LockGit and others added 25 commits April 1, 2023 15:10
2.增加构建文件和指令

Signed-off-by: zjzjzjzj1874 <zjzjzjzj1874@gmail.com>
- Refactor from single-container to 8 independent services
- Enable horizontal scaling for Logic, Connect, Task, and API services
- Add automatic container IP registration to etcd for RPC communication
- Implement health checks and dependency management
- Create separate dev/prod Docker Compose configurations
- Add Makefile commands for easy deployment
- Update documentation with quick start guide

Improvements over original:
- 44MB runtime images (vs 1.93GB single container)
- Independent service scaling
- Process isolation and resource limits
- Health-based dependency management
- Production-ready deployment

Original project: https://github.com/LockGit/gochat
- Removed docker/dev/ and docker/prod/ (Supervisord configs)
- Removed run.sh and reload.sh (legacy deployment scripts)
- Updated README.md to remove legacy deployment section

The application now uses Docker Compose multi-container deployment exclusively.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ection

Replaced shell script preprocessing with clean Go implementation:

- Added GetContainerIP() and GetServiceAddress() to tools/network.go
- Updated logic/publish.go to auto-detect container IP for etcd registration
- Updated connect/rpc.go to auto-detect container IP for etcd registration
- Updated Dockerfile to remove entrypoint.sh and use direct CMD
- Updated docker-compose.yml to use direct command without environment preprocessing
- Removed docker/entrypoint.sh

This approach is cleaner and more maintainable:
- No shell preprocessing required
- Services auto-detect their container IP at runtime
- Proper multi-container deployment with correct service discovery
- All services register with actual container IPs (e.g., tcp@172.28.0.4:6900)

Tested and verified:
- All services start successfully
- Services register correctly in etcd with container IPs
- Multi-container deployment fully functional

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented comprehensive CI/CD pipeline for automated testing, building, and deployment:

**CI/CD Workflow (.github/workflows/ci-cd.yml):**
- Test job: go fmt, go vet, tests with Redis service container
- Build job: Docker image with layer caching
- Push job: Push to Docker Hub with multiple tags (latest, branch, git-sha)
- Deploy jobs: Optional deployment to dev/staging/prod environments (commented out)

**Testing Infrastructure:**
- Makefile: Added test, test-coverage, test-unit, test-integration targets
- Makefile: Added fmt, fmt-check, vet, lint for code quality
- Makefile: Added build-binary, build-image for building
- docker-compose.test.yml: Test environment with Redis and etcd
- docker/Dockerfile.test: Test runner Dockerfile

**Deployment:**
- scripts/deploy.sh: Manual deployment script for all environments
- config/staging/: Staging configuration (copied from dev)
- docker-compose.staging.yml: Staging environment overrides

**Documentation:**
- README.md: Added CI/CD Pipeline section with setup instructions
- README.md: Added GitHub Secrets configuration guide
- README.md: Added branch strategy and manual deployment commands
- CHANGELOG.md: Documented all CI/CD pipeline changes

**Removed:**
- .travis.yml: Replaced with GitHub Actions

**Branch Strategy:**
- dev → Development environment (auto-deploy)
- staging → Staging environment (auto-deploy)
- master → Production environment (manual approval)

**Image Tags:**
- latest - Latest from master
- <branch> - Latest from branch
- <git-sha> - Specific commit for rollback

GitHub Secrets required: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
Optional: Server SSH credentials for deployment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed os.Signal channels from unbuffered to buffered (capacity 1).
This fixes the error: 'misuse of unbuffered os.Signal channel as argument to signal.Notify'

Fixed in:
- main.go: Line 44
- api/chat.go: Line 55
Fixed linting error: 'logrus.Error call has possible formatting directive %s'
Changed to logrus.Errorf() to properly support format string in db/db.go:40
Fixed all logrus logging format issues:
- connect/server.go:71 - Changed logrus.Warn to logrus.Warnf
- connect/rpc.go:115 - Changed logrus.Info to logrus.Infof
- api/chat.go:63 - Added missing format directive %v to logrus.Errorf

Ran go fmt ./... to format all code.
Verified with go vet ./... - all checks pass.

Tested locally before pushing.
Moved environment-specific docker-compose files:
- docker-compose.dev.yml → deployments/
- docker-compose.prod.yml → deployments/
- docker-compose.staging.yml → deployments/
- docker-compose.test.yml → deployments/

Updated references in:
- Makefile (all compose-* targets)
- scripts/deploy.sh
- README.md (Quick Start section)

This improves project organization by separating deployment configs from root directory.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… support

Problem:
- Integration tests were failing in CI (TCP server not running)
- Tests had infinite loops and poor error handling
- No separation between fast unit tests and slow integration tests

Solution:
Implemented proper test separation with two-stage CI pipeline:

**Unit Tests (Stage 1 - Fast)**:
- Run with `go test -short` flag
- Integration tests check `testing.Short()` and skip
- No service dependencies required
- Runs in ~2 seconds

**Integration Tests (Stage 2 - Full Stack)**:
- Start all services via docker-compose
- Run `go test` from host against exposed container ports
- Tests connect to localhost:7001 (TCP), localhost:6379 (Redis)
- Integration tests execute fully (not skipped!)

Changes:
- pkg/stickpackage/stickpackage_test.go:
  * Added `testing.Short()` check to skip in unit test stage
  * Removed infinite loops, added 10s timeout
  * Proper error handling with t.Fatalf/t.Logf
  * Connection timeout with graceful fallback

- task/queue_test.go:
  * Added `testing.Short()` check
  * Better error handling for empty queue

- deployments/docker-compose.test.yml:
  * Simplified to expose ports (2379, 6379, 7001, 7002)
  * Tests run from host, not in container
  * Removed test-runner service

- .github/workflows/ci-cd.yml:
  * Split "test" job into "unit-test" and "integration-test"
  * unit-test: runs `go test -short` (fast)
  * integration-test: starts services, runs `go test` (full)
  * build: depends on BOTH test jobs passing

- Makefile:
  * test-unit: `go test -short`
  * test-integration: starts services, runs tests, stops services

- Removed docker/Dockerfile.test (no longer needed)

CI Pipeline Flow:
```
Unit Tests → Integration Tests → Build → Push → Deploy
   (2s)          (~2min)         (3min)
```

This ensures:
✓ Fast feedback from unit tests
✓ Integration tests actually RUN (not skipped) in separate stage
✓ Multi-container deployment is properly tested
✓ No test hangs or infinite loops

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Problem:
GitHub Actions runners use Docker Compose V2 which uses the command
`docker compose` (without hyphen) instead of legacy `docker-compose`.
Integration tests were failing with: "docker-compose: command not found"

Changes:
- .github/workflows/ci-cd.yml: Updated all docker-compose → docker compose
- Makefile: Updated all compose commands (dev, prod, scale, logs, ps, clean, test-integration)
- scripts/deploy.sh: Updated deployment commands
- README.md: Updated example commands in documentation
- deployments/docker-compose.test.yml: Updated usage comment

Docker Compose V2 is the modern version bundled with Docker Desktop
and available by default in GitHub Actions runners.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Merged changes from master:
- Upgraded to Go 1.23
- Updated dependencies to latest versions
- Removed vendor directory (using Go modules)

Prometheus monitoring additions:
- Added metrics server endpoints on ports 9091-9094 and 8080
- Instrumented API service with HTTP metrics and user operation tracking
- Instrumented Logic service with RPC server metrics
- Instrumented Connect services (WebSocket/TCP) with metrics servers
- Added Prometheus service to docker-compose.yml
- Created Prometheus configuration for scraping all services

All conflicts resolved. Ready for deployment.
Logan added 3 commits December 31, 2025 17:59
- Reorder imports in logic/publish.go to follow gofmt standards
- Mark github.com/google/uuid as indirect dependency
- Update google.golang.org/protobuf to v1.28.1
- Removed 'promMetrics' import that was not being used
- Verified locally: all tests pass, build succeeds, formatting correct
- site/site.go: Added /metrics endpoint to HTTP mux
- task/task.go: Added metrics server on port 9094

These were missing after the merge with master. Verified locally:
- Build successful
- All tests pass (task: 7.247s)
- Formatting correct
@coderloganli coderloganli merged commit 32b5b82 into master Dec 31, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants