Skip to content

πŸš€ Production-ready observability & cost monitoring for AI agents. Features real-time metrics, LangSmith tracing, and automated alerting.

License

Notifications You must be signed in to change notification settings

josephsenior/agent-observability-platform

Repository files navigation

Agent Observability Platform

Python FastAPI License LangChain

Production-ready observability and monitoring platform for AI agents. Features real-time metrics tracking, cost monitoring, automatic alerting, LangSmith integration, and comprehensive dashboards. Built with FastAPI, LangChain, and modern observability best practices.

What This Platform Does

This platform provides comprehensive observability for AI agent systems:

  • Real-Time Metrics: Track request rates, latency, success rates, and error rates
  • Cost Tracking: Monitor API costs with daily breakdowns and optimization suggestions
  • Alerting System: Automatic alerts for cost thresholds, error rates, and latency issues
  • LangSmith Integration: Seamless integration with LangSmith for detailed tracing
  • Performance Analytics: Percentile-based latency analysis (p50, p95, p99)
  • Rate Limiting: Built-in rate limiting to protect your API usage
  • Health Monitoring: System health checks and status endpoints

Key Features

Observability

  • Request tracing with LangSmith
  • Real-time metrics collection
  • Historical data analysis
  • Performance percentile tracking
  • Error tracking and analysis

Cost Management

  • Real-time cost tracking per request
  • Daily and monthly cost projections
  • Cost optimization suggestions
  • Token usage monitoring
  • Model-specific cost calculations

Alerting

  • Configurable alert thresholds
  • Multiple alert types (cost, error rate, latency)
  • Alert severity levels
  • Alert resolution tracking
  • Real-time alert notifications

Production Features

  • Rate limiting per agent
  • Request queuing
  • Health check endpoints
  • Error handling and recovery
  • Scalable architecture

Installation

Prerequisites

  • Python 3.8 or higher
  • Gemini API key
  • (Optional) LangSmith API key for enhanced tracing

Setup

  1. Navigate to the project directory:
cd agent_observability_platform
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
cp .env.example .env

Edit .env and add your API keys:

GEMINI_API_KEY=your_GEMINI_API_KEY_here
LANGSMITH_API_KEY=your_langsmith_api_key_here  # Optional
LANGSMITH_PROJECT=agent-observability-platform
LANGSMITH_TRACING=true
  1. Create necessary directories:
mkdir -p data/metrics

Usage

Starting the Server

Run the application:

python main.py

The server will start on http://localhost:8000 by default.

Using the Web Dashboard

  1. Open your browser and navigate to http://localhost:8000
  2. View real-time metrics and statistics
  3. Monitor active alerts
  4. Review recent request metrics
  5. Track costs and performance

API Endpoints

Execute Agent Request

POST /api/agents/execute
Content-Type: application/json

{
  "agent_name": "my-agent",
  "prompt": "What is the capital of France?",
  "model": "gpt-3.5-turbo",
  "temperature": 0.7
}

Get Metrics Summary

GET /api/metrics/summary?agent_name=my-agent&hours=24

Get Recent Metrics

GET /api/metrics/recent?agent_name=my-agent&limit=100

Get Daily Cost

GET /api/cost/daily?agent_name=my-agent

Get Active Alerts

GET /api/alerts

Resolve Alert

POST /api/alerts/{alert_id}/resolve

Health Check

GET /api/health

Project Structure

agent_observability_platform/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── main.py              # FastAPI application
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ database.py          # Database management
β”‚   β”‚   β”œβ”€β”€ langsmith_client.py  # LangSmith integration
β”‚   β”‚   └── rate_limiter.py      # Rate limiting
β”‚   β”œβ”€β”€ monitoring/
β”‚   β”‚   β”œβ”€β”€ metrics_collector.py # Metrics collection
β”‚   β”‚   β”œβ”€β”€ alert_manager.py     # Alert management
β”‚   β”‚   └── cost_tracker.py      # Cost tracking
β”‚   └── models/
β”‚       β”œβ”€β”€ metrics.py           # Pydantic models
β”‚       └── database.py          # SQLAlchemy models
β”œβ”€β”€ frontend/
β”‚   └── templates/
β”‚       └── index.html           # Monitoring dashboard
β”œβ”€β”€ data/
β”‚   └── metrics/                 # Metrics storage
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/                    # Unit tests
β”‚   └── integration/             # Integration tests
β”œβ”€β”€ main.py                      # Application entry point
β”œβ”€β”€ requirements.txt             # Python dependencies
└── README.md                    # This file

Configuration

Environment variables in .env:

  • GEMINI_API_KEY: Required - Your Gemini API key
  • LANGSMITH_API_KEY: Optional - LangSmith API key for tracing
  • LANGSMITH_PROJECT: LangSmith project name
  • LANGSMITH_TRACING: Enable/disable LangSmith tracing (true/false)
  • DATABASE_URL: Database connection string
  • RATE_LIMIT_ENABLED: Enable rate limiting (true/false)
  • RATE_LIMIT_PER_MINUTE: Default rate limit per minute
  • ALERT_COST_THRESHOLD: Daily cost threshold for alerts (USD)
  • ALERT_ERROR_RATE_THRESHOLD: Error rate threshold (0.0-1.0)
  • ALERT_LATENCY_THRESHOLD_MS: Latency threshold in milliseconds

LangSmith Integration

The platform integrates with LangSmith for enhanced observability:

  1. Set LANGSMITH_API_KEY in your .env file
  2. Set LANGSMITH_TRACING=true
  3. All agent requests will be automatically traced
  4. View detailed traces in the LangSmith dashboard
  5. Trace IDs are stored with metrics for correlation

Alerting

The platform automatically monitors:

  • Cost Alerts: Triggered when daily cost exceeds threshold
  • Error Rate Alerts: Triggered when error rate exceeds threshold
  • Latency Alerts: Triggered when average latency exceeds threshold

Alerts can be resolved through the API or dashboard.

Cost Tracking

The platform tracks costs based on:

  • Model pricing (GPT-4, GPT-3.5-turbo, etc.)
  • Input and output token usage
  • Per-request cost calculation
  • Daily and monthly projections

Cost optimization suggestions are provided based on usage patterns.

Production Deployment

For production deployment:

  1. Use PostgreSQL instead of SQLite
  2. Set up Redis for distributed rate limiting
  3. Configure proper CORS origins
  4. Set up monitoring and logging
  5. Use environment-specific configurations
  6. Enable HTTPS
  7. Set up backup strategies for metrics data

Limitations

  • Rate limiting is in-memory (use Redis for distributed systems)
  • SQLite database (upgrade to PostgreSQL for production)
  • Basic alerting (extend for email/Slack notifications)

Future Enhancements

  • Email/Slack alert notifications
  • Advanced analytics and reporting
  • Multi-tenant support
  • Custom dashboards
  • Export metrics to external systems
  • Machine learning-based anomaly detection
  • A/B testing framework
  • Performance benchmarking

Tech Stack

  • FastAPI: Modern async web framework
  • LangChain: Agent framework integration
  • LangSmith: Tracing and observability
  • SQLite/PostgreSQL: Metrics storage
  • Redis: Rate limiting and caching
  • Python 3.8+: Core language

Use Cases

  • Production Monitoring: Real-time monitoring of agent systems
  • Cost Management: Track and optimize API costs
  • Performance Analytics: Identify bottlenecks and optimize
  • Alert Management: Proactive issue detection
  • Observability: Comprehensive system visibility

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

πŸš€ Production-ready observability & cost monitoring for AI agents. Features real-time metrics, LangSmith tracing, and automated alerting.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published