A production-ready observability platform for monitoring, tracking, and optimizing AI agent performance. The system provides comprehensive metrics collection, cost tracking, alerting, and real-time dashboards.
The core of the platform collects and stores metrics for every agent request:
- Request Tracking: Unique request IDs, timestamps, agent names
- Performance Metrics: Latency (p50, p95, p99), success/failure rates
- Cost Metrics: Token usage, cost per request, daily totals
- Error Tracking: Error messages, failure reasons, stack traces
- LangSmith Integration: Trace IDs for correlation with LangSmith
SQLite database (upgradeable to PostgreSQL) stores:
- Request Metrics: All request data with timestamps
- Alerts: Alert history and resolution tracking
- Agent Configurations: Agent settings and metadata
- Records all request metrics
- Aggregates statistics
- Provides summary views
- Calculates costs based on model pricing
- Tracks daily/monthly costs
- Provides optimization suggestions
- Monitors thresholds
- Creates alerts automatically
- Manages alert lifecycle
- Automatic tracing for all agent requests
- Trace ID correlation with metrics
- Direct links to LangSmith dashboard
- Optional integration (works without LangSmith)
- Per-agent rate limiting
- Configurable limits
- In-memory implementation (Redis-ready)
- Request queuing support
FastAPI-based REST API with:
- Agent execution endpoints
- Metrics retrieval endpoints
- Alert management endpoints
- Health check endpoints
- Real-time dashboard data
Modern web interface showing:
- Real-time statistics
- Active alerts
- Recent metrics table
- Cost breakdowns
- Performance charts
Agent Request
↓
[Rate Limiter] → Check rate limits
↓
[Agent Execution] → Execute with LangSmith tracing
↓
[Metrics Collector] → Record metrics
↓
[Cost Tracker] → Calculate costs
↓
[Database] → Store metrics
↓
[Alert Manager] → Check thresholds, create alerts
↓
[Dashboard] → Display real-time data
- FastAPI: Modern async web framework
- SQLAlchemy: Database ORM with async support
- LangSmith: Observability and tracing
- LangChain: Agent framework integration
- HTML/CSS/JavaScript: Modern dashboard
- Real-time Updates: Auto-refresh every 30 seconds
- SQLite: Development database
- PostgreSQL: Production-ready option
- LangSmith: External observability platform
- Custom Metrics: Internal tracking system
All database operations and API endpoints are async for better performance.
Separate components for metrics, alerts, and cost tracking allow independent scaling.
Platform works with or without LangSmith, making it flexible for different setups.
Simple implementation that can be upgraded to Redis for distributed systems.
Easy setup with SQLite, but designed to work with PostgreSQL in production.
- In-memory rate limiting (single server)
- SQLite database (single file)
- No distributed tracing aggregation
- Redis for distributed rate limiting
- PostgreSQL for scalable database
- Message queue for async processing
- CDN for static assets
- Load balancer for multiple instances
- API key management via environment variables
- Rate limiting to prevent abuse
- Input validation on all endpoints
- SQL injection protection via ORM
- CORS configuration for frontend
- Async database operations
- Indexed database queries
- Efficient metric aggregation
- Cached summary calculations
- Minimal frontend dependencies
- Application Metrics: Tracked internally
- External Observability: LangSmith integration
- Health Checks: Built-in health endpoints
- Alerting: Automatic threshold monitoring
- Cost Tracking: Real-time cost analysis
- Microservices architecture
- Event-driven alerting
- Real-time WebSocket updates
- Advanced analytics engine
- Machine learning anomaly detection
- Multi-region deployment
- GraphQL API option
- Plugin system for custom metrics