Skip to content

hra42/go-vault

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go-Vault

A high-performance containerized Go application that continuously exports PostgreSQL data to Parquet files with Hive-style partitioning, smart resume capabilities, and enterprise-grade monitoring.

🚀 Features

  • Continuous Export: Automated exports on configurable schedules (minutes to days)
  • Smart Resume: Automatically detects and resumes from the last exported timestamp
  • Hive Partitioning: Industry-standard organization: table=<name>/year=YYYY/month=MM/day=DD/
  • High Performance:
    • Batch processing with configurable batch sizes
    • Optimal Parquet compression (Snappy)
    • Concurrent table exports
    • Memory-efficient streaming
  • Data Integrity:
    • Atomic file writes prevent corruption
    • Timestamp-based filenames for precise tracking
    • Graceful error handling with continuation
  • Operational Excellence:
    • Health check endpoints for container orchestration
    • Comprehensive metrics API
    • Structured logging with operation tracking
    • Graceful shutdown with timeout
  • Production Ready:
    • Docker containerized with multi-stage builds
    • Non-root user execution
    • Resource limits support
    • Volume mounting for data persistence

📋 Table of Contents

🏃 Quick Start

Using Docker Compose (Recommended)

  1. Clone the repository:
git clone https://github.com/hra42/go-vault.git
cd go-vault
  1. Configure your PostgreSQL connection in docker-compose.yml:
environment:
  - POSTGRES_HOST=your-postgres-host
  - POSTGRES_DATABASE=your-database
  - POSTGRES_USERNAME=your-username
  - POSTGRES_PASSWORD=your-password
  1. Start the service:
docker-compose up -d
  1. Verify it's running:
# Check health
curl http://localhost:8080/health

# View metrics
curl http://localhost:8080/metrics

Using Docker Directly

# Build the image
docker build -t go-vault .

# Run the container
docker run -d \
  --name go-vault \
  -e POSTGRES_HOST=postgres.example.com \
  -e POSTGRES_PORT=5432 \
  -e POSTGRES_DATABASE=metrics \
  -e POSTGRES_USERNAME=postgres \
  -e POSTGRES_PASSWORD=your-secure-password \
  -e EXPORT_INTERVAL=1h \
  -v $(pwd)/parquet-data:/data \
  -p 8080:8080 \
  --restart unless-stopped \
  go-vault

🏗️ Architecture

Component Overview

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   PostgreSQL    │────▶│   Exporter   │────▶│  Parquet Files  │
│    Database     │     │   Service    │     │  (Hive Format)  │
└─────────────────┘     └──────────────┘     └─────────────────┘
                               │
                               ▼
                        ┌──────────────┐
                        │ Health/Metrics│
                        │   Endpoints   │
                        └──────────────┘

Key Components

  1. Table Discovery: Automatically finds tables with timestamp columns
  2. Resume Logic: Scans existing Parquet files to determine starting point
  3. Batch Processor: Fetches data in configurable batches for memory efficiency
  4. Parquet Writer: Converts PostgreSQL data to compressed Parquet format
  5. Scheduler: Cron-based scheduling system for regular exports
  6. Health Server: Provides monitoring endpoints for operational visibility

⚙️ Configuration

All configuration is done through environment variables:

Variable Description Default Example
POSTGRES_HOST PostgreSQL hostname localhost db.example.com
POSTGRES_PORT PostgreSQL port 5432 5432
POSTGRES_DATABASE Target database name metrics production_metrics
POSTGRES_USERNAME Database username postgres readonly_user
POSTGRES_PASSWORD Database password (required) secure-password
DATA_PATH Directory for Parquet files /data /mnt/parquet
EXPORT_INTERVAL Export frequency 1h 30m, 2h, 24h
HEALTH_PORT Port for health/metrics API 8080 9090

Export Interval Examples

  • 5m - Every 5 minutes
  • 30m - Every 30 minutes
  • 1h - Every hour (default)
  • 6h - Every 6 hours
  • 24h - Daily

📁 Data Organization

The service uses Hive-style partitioning for optimal query performance:

/data/
├── table=users/
│   ├── year=2025/
│   │   ├── month=01/
│   │   │   ├── day=01/
│   │   │   │   ├── 1735689600000_1735693200000.parquet
│   │   │   │   └── 1735693200000_1735696800000.parquet
│   │   │   └── day=02/
│   │   │       └── 1735776000000_1735779600000.parquet
│   │   └── month=02/
│   │       └── day=01/
│   │           └── 1738454400000_1738458000000.parquet
│   └── year=2024/
│       └── month=12/
│           └── day=31/
│               └── 1735603200000_1735606800000.parquet
└── table=orders/
    └── year=2025/
        └── month=01/
            └── day=01/
                └── 1735689600000_1735696800000.parquet

File Naming Convention

Files are named with Unix timestamps (milliseconds):

  • Format: <start_timestamp>_<end_timestamp>.parquet
  • Example: 1735689600000_1735693200000.parquet
  • This represents data from 2025-01-01 00:00:00 to 2025-01-01 01:00:00 UTC

📊 Monitoring

Health Check Endpoint

GET http://localhost:8080/health

Response:

{
  "status": "healthy"
}

Status codes:

  • 200 OK - Service is healthy
  • 503 Service Unavailable - Service is unhealthy

Metrics Endpoint

GET http://localhost:8080/metrics

Response:

{
  "rows_processed": {
    "users": 1543210,
    "orders": 892341,
    "products": 45678
  },
  "files_created": 156,
  "error_count": 2,
  "last_export_time": "2025-01-22T10:30:00Z",
  "avg_export_duration": "45s",
  "last_errors": [
    "connection timeout to PostgreSQL",
    "disk space insufficient"
  ]
}

Ready Endpoint

GET http://localhost:8080/ready

Always returns 200 OK when the service is running.

🔌 API Reference

Health Endpoints

Endpoint Method Description Response Codes
/health GET Service health status 200, 503
/ready GET Service readiness 200
/metrics GET Export metrics and statistics 200

🛠️ Development

Prerequisites

  • Go 1.24 or higher
  • Docker and Docker Compose (optional)
  • PostgreSQL instance with data

Local Development Setup

  1. Clone the repository:
git clone https://github.com/hra42/go-vault.git
cd go-vault
  1. Install dependencies:
go mod download
  1. Set environment variables:
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DATABASE=testdb
export POSTGRES_USERNAME=postgres
export POSTGRES_PASSWORD=password
export DATA_PATH=./data
export EXPORT_INTERVAL=5m
  1. Run the application:
go run cmd/exporter/main.go

Building from Source

# Build for current platform
go build -o exporter cmd/exporter/main.go

# Build for Linux (for Docker)
GOOS=linux GOARCH=amd64 go build -o exporter cmd/exporter/main.go

Running Tests

# Run all tests
go test ./...

# Run with coverage
go test -cover ./...

# Run specific package tests
go test ./internal/parquet

🚀 Production Deployment

Docker Deployment

  1. Build the production image:
docker build -t go-vault:latest .
  1. Create a dedicated network:
docker network create parquet-export
  1. Run with production settings:
docker run -d \
  --name go-vault \
  --network parquet-export \
  --restart unless-stopped \
  --memory="2g" \
  --cpus="2.0" \
  -e POSTGRES_HOST=prod-db.internal \
  -e POSTGRES_DATABASE=production \
  -e POSTGRES_USERNAME=readonly_user \
  -e POSTGRES_PASSWORD=${DB_PASSWORD} \
  -e EXPORT_INTERVAL=30m \
  -v /mnt/parquet-storage:/data \
  -p 8080:8080 \
  go-vault:latest

Kubernetes Deployment

Create a ConfigMap for configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: go-vault-config
data:
  POSTGRES_HOST: "postgres-service.default.svc.cluster.local"
  POSTGRES_PORT: "5432"
  POSTGRES_DATABASE: "metrics"
  EXPORT_INTERVAL: "1h"

Deploy the service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-vault
spec:
  replicas: 1
  selector:
    matchLabels:
      app: go-vault
  template:
    metadata:
      labels:
        app: go-vault
    spec:
      containers:
      - name: exporter
        image: go-vault:latest
        envFrom:
        - configMapRef:
            name: go-vault-config
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        volumeMounts:
        - name: parquet-data
          mountPath: /data
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
      volumes:
      - name: parquet-data
        persistentVolumeClaim:
          claimName: parquet-pvc

Performance Tuning

  1. Batch Size: Adjust batch size based on available memory and row size
  2. Export Interval: Balance between data freshness and system load
  3. Connection Pool: Configure PostgreSQL connection pool settings
  4. Resource Limits: Set appropriate CPU and memory limits

Security Considerations

  1. Database Credentials: Use secrets management (Kubernetes Secrets, HashiCorp Vault, etc.)
  2. Network Security: Restrict database access to the exporter service
  3. File Permissions: Ensure proper permissions on the data volume
  4. Non-root User: Container runs as non-root user by default

🔧 Troubleshooting

Common Issues

Service Won't Start

Check logs:

docker logs go-vault

Common causes:

  • Incorrect PostgreSQL credentials
  • Database unreachable
  • Insufficient permissions

No Data Being Exported

  1. Check if tables have timestamp columns:
SELECT table_name, column_name 
FROM information_schema.columns 
WHERE data_type LIKE 'timestamp%';
  1. Verify health endpoint:
curl http://localhost:8080/health
  1. Check metrics for errors:
curl http://localhost:8080/metrics | jq .last_errors

Disk Space Issues

Monitor disk usage:

df -h /path/to/parquet-data

Consider:

  • Implementing data retention policies
  • Compressing older partitions
  • Moving data to object storage

Debug Mode

Enable verbose logging:

docker run -e LOG_LEVEL=debug ...

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📝 License

This project is licensed under The Unlicense - see the LICENSE file for details.

🙏 Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published