Go-Vault

A high-performance containerized Go application that continuously exports PostgreSQL data to Parquet files with Hive-style partitioning, smart resume capabilities, and enterprise-grade monitoring.

🚀 Features

Continuous Export: Automated exports on configurable schedules (minutes to days)
Smart Resume: Automatically detects and resumes from the last exported timestamp
Hive Partitioning: Industry-standard organization: table=<name>/year=YYYY/month=MM/day=DD/
High Performance:
- Batch processing with configurable batch sizes
- Optimal Parquet compression (Snappy)
- Concurrent table exports
- Memory-efficient streaming
Data Integrity:
- Atomic file writes prevent corruption
- Timestamp-based filenames for precise tracking
- Graceful error handling with continuation
Operational Excellence:
- Health check endpoints for container orchestration
- Comprehensive metrics API
- Structured logging with operation tracking
- Graceful shutdown with timeout
Production Ready:
- Docker containerized with multi-stage builds
- Non-root user execution
- Resource limits support
- Volume mounting for data persistence

🏃 Quick Start

Using Docker Compose (Recommended)

Clone the repository:

git clone https://github.com/hra42/go-vault.git
cd go-vault

Configure your PostgreSQL connection in docker-compose.yml:

environment:
  - POSTGRES_HOST=your-postgres-host
  - POSTGRES_DATABASE=your-database
  - POSTGRES_USERNAME=your-username
  - POSTGRES_PASSWORD=your-password

Start the service:

docker-compose up -d

Verify it's running:

# Check health
curl http://localhost:8080/health

# View metrics
curl http://localhost:8080/metrics

Using Docker Directly

# Build the image
docker build -t go-vault .

# Run the container
docker run -d \
  --name go-vault \
  -e POSTGRES_HOST=postgres.example.com \
  -e POSTGRES_PORT=5432 \
  -e POSTGRES_DATABASE=metrics \
  -e POSTGRES_USERNAME=postgres \
  -e POSTGRES_PASSWORD=your-secure-password \
  -e EXPORT_INTERVAL=1h \
  -v $(pwd)/parquet-data:/data \
  -p 8080:8080 \
  --restart unless-stopped \
  go-vault

🏗️ Architecture

Component Overview

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   PostgreSQL    │────▶│   Exporter   │────▶│  Parquet Files  │
│    Database     │     │   Service    │     │  (Hive Format)  │
└─────────────────┘     └──────────────┘     └─────────────────┘
                               │
                               ▼
                        ┌──────────────┐
                        │ Health/Metrics│
                        │   Endpoints   │
                        └──────────────┘

Key Components

Table Discovery: Automatically finds tables with timestamp columns
Resume Logic: Scans existing Parquet files to determine starting point
Batch Processor: Fetches data in configurable batches for memory efficiency
Parquet Writer: Converts PostgreSQL data to compressed Parquet format
Scheduler: Cron-based scheduling system for regular exports
Health Server: Provides monitoring endpoints for operational visibility

⚙️ Configuration

All configuration is done through environment variables:

Variable	Description	Default	Example
`POSTGRES_HOST`	PostgreSQL hostname	`localhost`	`db.example.com`
`POSTGRES_PORT`	PostgreSQL port	`5432`	`5432`
`POSTGRES_DATABASE`	Target database name	`metrics`	`production_metrics`
`POSTGRES_USERNAME`	Database username	`postgres`	`readonly_user`
`POSTGRES_PASSWORD`	Database password	(required)	`secure-password`
`DATA_PATH`	Directory for Parquet files	`/data`	`/mnt/parquet`
`EXPORT_INTERVAL`	Export frequency	`1h`	`30m`, `2h`, `24h`
`HEALTH_PORT`	Port for health/metrics API	`8080`	`9090`

Export Interval Examples

5m - Every 5 minutes
30m - Every 30 minutes
1h - Every hour (default)
6h - Every 6 hours
24h - Daily

📁 Data Organization

The service uses Hive-style partitioning for optimal query performance:

/data/
├── table=users/
│   ├── year=2025/
│   │   ├── month=01/
│   │   │   ├── day=01/
│   │   │   │   ├── 1735689600000_1735693200000.parquet
│   │   │   │   └── 1735693200000_1735696800000.parquet
│   │   │   └── day=02/
│   │   │       └── 1735776000000_1735779600000.parquet
│   │   └── month=02/
│   │       └── day=01/
│   │           └── 1738454400000_1738458000000.parquet
│   └── year=2024/
│       └── month=12/
│           └── day=31/
│               └── 1735603200000_1735606800000.parquet
└── table=orders/
    └── year=2025/
        └── month=01/
            └── day=01/
                └── 1735689600000_1735696800000.parquet

File Naming Convention

Files are named with Unix timestamps (milliseconds):

Format: <start_timestamp>_<end_timestamp>.parquet
Example: 1735689600000_1735693200000.parquet
This represents data from 2025-01-01 00:00:00 to 2025-01-01 01:00:00 UTC

📊 Monitoring

Health Check Endpoint

GET http://localhost:8080/health

Response:

{
  "status": "healthy"
}

Status codes:

200 OK - Service is healthy
503 Service Unavailable - Service is unhealthy

Metrics Endpoint

GET http://localhost:8080/metrics

Response:

{
  "rows_processed": {
    "users": 1543210,
    "orders": 892341,
    "products": 45678
  },
  "files_created": 156,
  "error_count": 2,
  "last_export_time": "2025-01-22T10:30:00Z",
  "avg_export_duration": "45s",
  "last_errors": [
    "connection timeout to PostgreSQL",
    "disk space insufficient"
  ]
}

Ready Endpoint

GET http://localhost:8080/ready

Always returns 200 OK when the service is running.

🔌 API Reference

Health Endpoints

Endpoint	Method	Description	Response Codes
`/health`	GET	Service health status	200, 503
`/ready`	GET	Service readiness	200
`/metrics`	GET	Export metrics and statistics	200

🛠️ Development

Prerequisites

Go 1.24 or higher
Docker and Docker Compose (optional)
PostgreSQL instance with data

Local Development Setup

Clone the repository:

git clone https://github.com/hra42/go-vault.git
cd go-vault

Install dependencies:

go mod download

Set environment variables:

export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DATABASE=testdb
export POSTGRES_USERNAME=postgres
export POSTGRES_PASSWORD=password
export DATA_PATH=./data
export EXPORT_INTERVAL=5m

Run the application:

go run cmd/exporter/main.go

Building from Source

# Build for current platform
go build -o exporter cmd/exporter/main.go

# Build for Linux (for Docker)
GOOS=linux GOARCH=amd64 go build -o exporter cmd/exporter/main.go

Running Tests

# Run all tests
go test ./...

# Run with coverage
go test -cover ./...

# Run specific package tests
go test ./internal/parquet

🚀 Production Deployment

Docker Deployment

Build the production image:

docker build -t go-vault:latest .

Create a dedicated network:

docker network create parquet-export

Run with production settings:

docker run -d \
  --name go-vault \
  --network parquet-export \
  --restart unless-stopped \
  --memory="2g" \
  --cpus="2.0" \
  -e POSTGRES_HOST=prod-db.internal \
  -e POSTGRES_DATABASE=production \
  -e POSTGRES_USERNAME=readonly_user \
  -e POSTGRES_PASSWORD=${DB_PASSWORD} \
  -e EXPORT_INTERVAL=30m \
  -v /mnt/parquet-storage:/data \
  -p 8080:8080 \
  go-vault:latest

Kubernetes Deployment

Create a ConfigMap for configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: go-vault-config
data:
  POSTGRES_HOST: "postgres-service.default.svc.cluster.local"
  POSTGRES_PORT: "5432"
  POSTGRES_DATABASE: "metrics"
  EXPORT_INTERVAL: "1h"

Deploy the service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-vault
spec:
  replicas: 1
  selector:
    matchLabels:
      app: go-vault
  template:
    metadata:
      labels:
        app: go-vault
    spec:
      containers:
      - name: exporter
        image: go-vault:latest
        envFrom:
        - configMapRef:
            name: go-vault-config
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        volumeMounts:
        - name: parquet-data
          mountPath: /data
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
      volumes:
      - name: parquet-data
        persistentVolumeClaim:
          claimName: parquet-pvc

Performance Tuning

Batch Size: Adjust batch size based on available memory and row size
Export Interval: Balance between data freshness and system load
Connection Pool: Configure PostgreSQL connection pool settings
Resource Limits: Set appropriate CPU and memory limits

Security Considerations

Database Credentials: Use secrets management (Kubernetes Secrets, HashiCorp Vault, etc.)
Network Security: Restrict database access to the exporter service
File Permissions: Ensure proper permissions on the data volume
Non-root User: Container runs as non-root user by default

🔧 Troubleshooting

Common Issues

Service Won't Start

Check logs:

docker logs go-vault

Common causes:

Incorrect PostgreSQL credentials
Database unreachable
Insufficient permissions

No Data Being Exported

Check if tables have timestamp columns:

SELECT table_name, column_name 
FROM information_schema.columns 
WHERE data_type LIKE 'timestamp%';

Verify health endpoint:

curl http://localhost:8080/health

Check metrics for errors:

curl http://localhost:8080/metrics | jq .last_errors

Disk Space Issues

Monitor disk usage:

df -h /path/to/parquet-data

Consider:

Implementing data retention policies
Compressing older partitions
Moving data to object storage

Debug Mode

Enable verbose logging:

docker run -e LOG_LEVEL=debug ...

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under The Unlicense - see the LICENSE file for details.

🙏 Acknowledgments

Apache Arrow for the excellent Parquet library
lib/pq for PostgreSQL driver
robfig/cron for scheduling

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
cmd/exporter		cmd/exporter
internal		internal
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

hra42/go-vault

Folders and files

Latest commit

History

Repository files navigation