Monitoring with Prometheus and Grafana

This guide explains how to set up Prometheus and Grafana monitoring for the Claude Development Pipeline agent workflow metrics.

Overview

The monitoring system tracks:

Agent invocation counts and rates
Execution duration metrics (with percentiles)
Success/failure rates
Phase distribution
Parallel execution patterns
Session metrics
Database statistics

Prerequisites

Python dependencies:
```
pip install -r requirements.txt
```
Prometheus installed (Download)
Grafana installed (Download)

Quick Start

1. Start the Prometheus Exporter

# Start the exporter (default port: 9090)
python3 tools/prometheus_exporter.py

# Or with custom options
python3 tools/prometheus_exporter.py --port 9091 --interval 10

The exporter will:

Connect to the SQLite database at logs/agent_workflow.db
Expose metrics at http://localhost:9090/metrics
Update metrics every 15 seconds (configurable)

2. Configure Prometheus

Copy or merge the provided configuration:

# If Prometheus is installed via Homebrew (macOS)
cp config/prometheus.yml /usr/local/etc/prometheus/prometheus.yml

# Or specify config when starting Prometheus
prometheus --config.file=config/prometheus.yml

Start Prometheus:

prometheus --config.file=config/prometheus.yml

Verify at http://localhost:9090/targets that the agent_workflow target is UP

3. Import Grafana Dashboard

Access Grafana at http://localhost:3000 (default credentials: admin/admin)
Add Prometheus data source:
- Navigate to Configuration → Data Sources
- Add New Data Source → Prometheus
- URL: http://localhost:9090
- Save & Test
Import the dashboard:
- Navigate to Dashboards → Import
- Upload config/grafana-dashboard.json
- Select your Prometheus datasource
- Import

Available Metrics

Counter Metrics

agent_invocation_total: Total invocations by agent, phase, status, model
agent_error_total: Total errors by agent and phase
phase_distribution_total: Distribution across workflow phases

Histogram Metrics

agent_duration_seconds: Execution duration with percentiles
session_duration_seconds: Session duration distribution

Gauge Metrics

agent_last_execution_timestamp: Last execution time per agent
agent_success_rate: Success rate percentage per agent
active_sessions_count: Currently active sessions
parallel_executions_current: Current parallel executions
total_unique_agents: Total unique agents in system
total_sessions: Total logged sessions
avg_agents_per_session: Average agents per session

Info Metrics

agent_workflow_database_info: Database metadata

Dashboard Panels

The provided Grafana dashboard includes:

Overview Stats: Total invocations, unique agents, sessions, success rate
Agent Performance Timeline: 95th percentile execution times
Agent Performance Table: Invocation counts and success rates
Phase Distribution: Donut chart of workflow phases
Average Execution Time: Bar chart by agent
Agent Invocation Rate: Time series of invocation rates
System Metrics: Parallel executions, active sessions, avg agents/session

Prometheus Queries

Useful PromQL queries for custom panels:

# Top 5 slowest agents (95th percentile)
topk(5, histogram_quantile(0.95, rate(agent_duration_seconds_bucket[5m])))

# Agents with declining success rate
agent_success_rate < 80

# Phase distribution percentage
sum by (phase) (phase_distribution_total) / sum(phase_distribution_total) * 100

# Parallel execution efficiency
parallel_executions_current / active_sessions_count

# Error rate by agent (last 5 minutes)
rate(agent_error_total[5m])

# Session completion rate
rate(session_duration_seconds_count[1h])

Alert Rules

Create agent_workflow_rules.yml for alerting:

groups:
  - name: agent_workflow_alerts
    interval: 30s
    rules:
      - alert: AgentExecutionSlow
        expr: histogram_quantile(0.95, rate(agent_duration_seconds_bucket[5m])) > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_name }} is slow"
          description: "95th percentile > 5 minutes"
      
      - alert: HighErrorRate
        expr: rate(agent_error_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.agent_name }}"
      
      - alert: LowSuccessRate
        expr: agent_success_rate < 70
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low success rate for {{ $labels.agent_name }}"

Running as a Service

systemd (Linux)

Create /etc/systemd/system/agent-workflow-exporter.service:

[Unit]
Description=Agent Workflow Prometheus Exporter
After=network.target

[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/agent-workflow
ExecStart=/usr/bin/python3 /path/to/agent-workflow/tools/prometheus_exporter.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable agent-workflow-exporter
sudo systemctl start agent-workflow-exporter

launchd (macOS)

Create ~/Library/LaunchAgents/com.agent-workflow.exporter.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.agent-workflow.exporter</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/bin/python3</string>
        <string>/path/to/agent-workflow/tools/prometheus_exporter.py</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardErrorPath</key>
    <string>/tmp/agent-workflow-exporter.err</string>
    <key>StandardOutPath</key>
    <string>/tmp/agent-workflow-exporter.out</string>
</dict>
</plist>

Load the service:

launchctl load ~/Library/LaunchAgents/com.agent-workflow.exporter.plist

Troubleshooting

Exporter Issues

Database not found:
- Ensure logs/agent_workflow.db exists
- Run an agent workflow to create the database
Port already in use:
- Use a different port: --port 9091
- Check what's using the port: lsof -i :9090
No metrics appearing:
- Check exporter logs for errors
- Verify database has data: sqlite3 logs/agent_workflow.db "SELECT COUNT(*) FROM agent_invocations"

Prometheus Issues

Target DOWN in Prometheus:
- Check exporter is running
- Verify network connectivity
- Check Prometheus logs
No data in queries:
- Wait for scrape interval (15s)
- Check metric names match
- Verify time range in query

Grafana Issues

No data in dashboard:
- Verify Prometheus datasource is working
- Check time range selector
- Refresh dashboard (F5)
Panels showing errors:
- Check datasource variable is set
- Verify metrics exist in Prometheus
- Check panel query syntax

Advanced Configuration

Custom Metrics

To add custom metrics, edit tools/prometheus_exporter.py:

# Add new metric
custom_metric = Gauge(
    'custom_metric_name',
    'Description of metric',
    ['label1', 'label2'],
    registry=registry
)

# In _collect_custom_metrics method
custom_metric.labels(label1='value1', label2='value2').set(42)

Performance Tuning

For large databases or high-frequency updates:

Increase update interval: --interval 30
Add database indexes for frequently queried columns
Use connection pooling in the exporter
Consider using PostgreSQL instead of SQLite

Multi-Instance Setup

To monitor multiple projects:

Run exporters on different ports
Add multiple targets in Prometheus config
Use instance labels to distinguish in Grafana

Integration with CI/CD

Add monitoring to your CI/CD pipeline:

# GitHub Actions example
- name: Start Monitoring
  run: |
    python3 tools/prometheus_exporter.py --port 9090 &
    echo $! > exporter.pid
    
- name: Run Workflow
  run: /dev-orchestrator TICKET-123
  
- name: Collect Metrics
  run: |
    curl -s http://localhost:9090/metrics | grep agent_invocation_total
    kill $(cat exporter.pid)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring with Prometheus and Grafana

Overview

Prerequisites

Quick Start

1. Start the Prometheus Exporter

2. Configure Prometheus

3. Import Grafana Dashboard

Available Metrics

Counter Metrics

Histogram Metrics

Gauge Metrics

Info Metrics

Dashboard Panels

Prometheus Queries

Alert Rules

Running as a Service

systemd (Linux)

launchd (macOS)

Troubleshooting

Exporter Issues

Prometheus Issues

Grafana Issues

Advanced Configuration

Custom Metrics

Performance Tuning

Multi-Instance Setup

Integration with CI/CD

Related Documentation

FilesExpand file tree

MONITORING.md

Latest commit

History

MONITORING.md

File metadata and controls

Monitoring with Prometheus and Grafana

Overview

Prerequisites

Quick Start

1. Start the Prometheus Exporter

2. Configure Prometheus

3. Import Grafana Dashboard

Available Metrics

Counter Metrics

Histogram Metrics

Gauge Metrics

Info Metrics

Dashboard Panels

Prometheus Queries

Alert Rules

Running as a Service

systemd (Linux)

launchd (macOS)

Troubleshooting

Exporter Issues

Prometheus Issues

Grafana Issues

Advanced Configuration

Custom Metrics

Performance Tuning

Multi-Instance Setup

Integration with CI/CD

Related Documentation