This guide explains how to set up Prometheus and Grafana monitoring for the Claude Development Pipeline agent workflow metrics.
The monitoring system tracks:
- Agent invocation counts and rates
- Execution duration metrics (with percentiles)
- Success/failure rates
- Phase distribution
- Parallel execution patterns
- Session metrics
- Database statistics
-
Python dependencies:
pip install -r requirements.txt
-
Prometheus installed (Download)
-
Grafana installed (Download)
# Start the exporter (default port: 9090)
python3 tools/prometheus_exporter.py
# Or with custom options
python3 tools/prometheus_exporter.py --port 9091 --interval 10The exporter will:
- Connect to the SQLite database at
logs/agent_workflow.db - Expose metrics at
http://localhost:9090/metrics - Update metrics every 15 seconds (configurable)
-
Copy or merge the provided configuration:
# If Prometheus is installed via Homebrew (macOS) cp config/prometheus.yml /usr/local/etc/prometheus/prometheus.yml # Or specify config when starting Prometheus prometheus --config.file=config/prometheus.yml
-
Start Prometheus:
prometheus --config.file=config/prometheus.yml
-
Verify at
http://localhost:9090/targetsthat theagent_workflowtarget is UP
-
Access Grafana at
http://localhost:3000(default credentials: admin/admin) -
Add Prometheus data source:
- Navigate to Configuration → Data Sources
- Add New Data Source → Prometheus
- URL:
http://localhost:9090 - Save & Test
-
Import the dashboard:
- Navigate to Dashboards → Import
- Upload
config/grafana-dashboard.json - Select your Prometheus datasource
- Import
agent_invocation_total: Total invocations by agent, phase, status, modelagent_error_total: Total errors by agent and phasephase_distribution_total: Distribution across workflow phases
agent_duration_seconds: Execution duration with percentilessession_duration_seconds: Session duration distribution
agent_last_execution_timestamp: Last execution time per agentagent_success_rate: Success rate percentage per agentactive_sessions_count: Currently active sessionsparallel_executions_current: Current parallel executionstotal_unique_agents: Total unique agents in systemtotal_sessions: Total logged sessionsavg_agents_per_session: Average agents per session
agent_workflow_database_info: Database metadata
The provided Grafana dashboard includes:
- Overview Stats: Total invocations, unique agents, sessions, success rate
- Agent Performance Timeline: 95th percentile execution times
- Agent Performance Table: Invocation counts and success rates
- Phase Distribution: Donut chart of workflow phases
- Average Execution Time: Bar chart by agent
- Agent Invocation Rate: Time series of invocation rates
- System Metrics: Parallel executions, active sessions, avg agents/session
Useful PromQL queries for custom panels:
# Top 5 slowest agents (95th percentile)
topk(5, histogram_quantile(0.95, rate(agent_duration_seconds_bucket[5m])))
# Agents with declining success rate
agent_success_rate < 80
# Phase distribution percentage
sum by (phase) (phase_distribution_total) / sum(phase_distribution_total) * 100
# Parallel execution efficiency
parallel_executions_current / active_sessions_count
# Error rate by agent (last 5 minutes)
rate(agent_error_total[5m])
# Session completion rate
rate(session_duration_seconds_count[1h])
Create agent_workflow_rules.yml for alerting:
groups:
- name: agent_workflow_alerts
interval: 30s
rules:
- alert: AgentExecutionSlow
expr: histogram_quantile(0.95, rate(agent_duration_seconds_bucket[5m])) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent_name }} is slow"
description: "95th percentile > 5 minutes"
- alert: HighErrorRate
expr: rate(agent_error_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.agent_name }}"
- alert: LowSuccessRate
expr: agent_success_rate < 70
for: 10m
labels:
severity: warning
annotations:
summary: "Low success rate for {{ $labels.agent_name }}"Create /etc/systemd/system/agent-workflow-exporter.service:
[Unit]
Description=Agent Workflow Prometheus Exporter
After=network.target
[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/agent-workflow
ExecStart=/usr/bin/python3 /path/to/agent-workflow/tools/prometheus_exporter.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl enable agent-workflow-exporter
sudo systemctl start agent-workflow-exporterCreate ~/Library/LaunchAgents/com.agent-workflow.exporter.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.agent-workflow.exporter</string>
<key>ProgramArguments</key>
<array>
<string>/usr/bin/python3</string>
<string>/path/to/agent-workflow/tools/prometheus_exporter.py</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardErrorPath</key>
<string>/tmp/agent-workflow-exporter.err</string>
<key>StandardOutPath</key>
<string>/tmp/agent-workflow-exporter.out</string>
</dict>
</plist>Load the service:
launchctl load ~/Library/LaunchAgents/com.agent-workflow.exporter.plist-
Database not found:
- Ensure
logs/agent_workflow.dbexists - Run an agent workflow to create the database
- Ensure
-
Port already in use:
- Use a different port:
--port 9091 - Check what's using the port:
lsof -i :9090
- Use a different port:
-
No metrics appearing:
- Check exporter logs for errors
- Verify database has data:
sqlite3 logs/agent_workflow.db "SELECT COUNT(*) FROM agent_invocations"
-
Target DOWN in Prometheus:
- Check exporter is running
- Verify network connectivity
- Check Prometheus logs
-
No data in queries:
- Wait for scrape interval (15s)
- Check metric names match
- Verify time range in query
-
No data in dashboard:
- Verify Prometheus datasource is working
- Check time range selector
- Refresh dashboard (F5)
-
Panels showing errors:
- Check datasource variable is set
- Verify metrics exist in Prometheus
- Check panel query syntax
To add custom metrics, edit tools/prometheus_exporter.py:
# Add new metric
custom_metric = Gauge(
'custom_metric_name',
'Description of metric',
['label1', 'label2'],
registry=registry
)
# In _collect_custom_metrics method
custom_metric.labels(label1='value1', label2='value2').set(42)For large databases or high-frequency updates:
- Increase update interval:
--interval 30 - Add database indexes for frequently queried columns
- Use connection pooling in the exporter
- Consider using PostgreSQL instead of SQLite
To monitor multiple projects:
- Run exporters on different ports
- Add multiple targets in Prometheus config
- Use instance labels to distinguish in Grafana
Add monitoring to your CI/CD pipeline:
# GitHub Actions example
- name: Start Monitoring
run: |
python3 tools/prometheus_exporter.py --port 9090 &
echo $! > exporter.pid
- name: Run Workflow
run: /dev-orchestrator TICKET-123
- name: Collect Metrics
run: |
curl -s http://localhost:9090/metrics | grep agent_invocation_total
kill $(cat exporter.pid)