docs: add research on API agent integration approaches for WAA by abrichr · Pull Request #11 · OpenAdaptAI/openadapt-ml

abrichr · 2026-01-23T15:31:50Z

Summary

This PR documents the analysis of integrating API-backed agents (Claude Sonnet 4.5 / GPT-5.1) into Windows Agent Arena's benchmark framework.

Problem

The current Dockerfile attempts to patch WAA's run.py at build time using sed/Python string replacement, but this fails due to:

Order dependencies (Python not installed when patch runs)
String matching fragility (whitespace sensitivity)
No verification of patch success

Solution Analysis

The document analyzes 6 alternative approaches:

Approach	Complexity	Fragility	Recommended?
1. Runtime patch (docker exec)	Low	High	Maybe
2. Pre-patched file	Very Low	Medium	Short-term
3. Volume mount	Medium	Medium	No
4. Fork upstream	High (init)	Low	Long-term
5. Import hook	High	Very High	No
6. Wrapper script	Medium	Low	YES

Recommendation

The wrapper script approach (Approach 6) is recommended because:

Zero modification to WAA's original code
Clear separation of concerns
Future-proof against upstream changes
Easy to test and extend

Test plan

Document created at docs/research/api_agent_integration_approaches.md
Review and discuss approach with team
Implement recommended approach in follow-up PR

Generated with Claude Code

Phase 1 of viewer consolidation plan: Foundation Changes: - Add openadapt-viewer as local file dependency in pyproject.toml - Create openadapt_ml/training/viewer_components.py adapter module * screenshot_with_predictions() - Screenshot with human/AI overlays * training_metrics() - Training stats metrics grid * playback_controls() - Playback UI controls * correctness_badge() - Pass/fail badge component * generate_comparison_summary() - Model comparison summary - Add tests/test_viewer_screenshots.py with component validation tests - Add openadapt_ml/training/viewer_migration_example.py validation example Design: - Zero breaking changes to existing viewer.py code - Adapter pattern wraps openadapt-viewer with ML-specific context - Functions accept openadapt-ml data structures - Can be incrementally adopted in future phases Next steps (Phase 2): - Gradually migrate viewer.py to use these adapters - Replace inline HTML generation with component calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Restored and enhanced the workflow segmentation system from commit dd9a393 with new integration for openadapt-capture format. ## What's Added ### Core Segmentation Pipeline (4 stages): 1. **Stage 1 - Frame Description (VLM)**: - Converts screenshots + actions into semantic descriptions - Supports Gemini, Claude, GPT-4o backends - Automatic caching for efficiency - File: openadapt_ml/segmentation/frame_describer.py 2. **Stage 2 - Episode Extraction (LLM)**: - Identifies coherent workflow boundaries - Few-shot prompting for better quality - Confidence-based filtering - File: openadapt_ml/segmentation/segment_extractor.py 3. **Stage 3 - Deduplication (Embeddings)**: - Finds similar workflows across recordings - Agglomerative clustering with cosine similarity - Supports OpenAI or local HuggingFace embeddings - File: openadapt_ml/segmentation/deduplicator.py 4. **Stage 4 - Annotation (VLM Quality Control)**: - Auto-annotates episodes for training data quality - Detects failures, boundary issues, incompleteness - Human-in-the-loop review workflow - File: openadapt_ml/segmentation/annotator.py ### Integration Features: - **CaptureAdapter**: Loads recordings from openadapt-capture SQLite format - File: openadapt_ml/segmentation/adapters/capture_adapter.py - Automatically used when capture.db is detected - Converts events to segmentation format - **Unified Pipeline**: Run all stages with single API - File: openadapt_ml/segmentation/pipeline.py - Automatic intermediate result caching - Resume support for interrupted runs - **CLI Interface**: Full command-line interface for all stages - File: openadapt_ml/segmentation/cli.py - Commands: describe, extract, deduplicate, annotate, review, export-gold - **Comprehensive Documentation**: - File: openadapt_ml/segmentation/README.md - 20+ code examples - Complete API reference - Integration guide - Cost estimates and performance benchmarks ## Use Cases 1. **Training Data Curation**: Extract and filter high-quality demonstration episodes 2. **Demo Retrieval**: Build searchable libraries for demo-conditioned prompting 3. **Workflow Documentation**: Auto-generate step-by-step guides from recordings ## Data Schemas All schemas use Pydantic for type safety (openadapt_ml/segmentation/schemas.py): - ActionTranscript: Frame-by-frame semantic descriptions - Episode: Coherent workflow segment with boundaries - CanonicalEpisode: Deduplicated workflow definition - EpisodeAnnotation: Quality assessment for training data ## Example Usage ```python from openadapt_ml.segmentation import SegmentationPipeline, PipelineConfig config = PipelineConfig( vlm_model="gemini-2.0-flash", llm_model="gpt-4o", similarity_threshold=0.85 ) pipeline = SegmentationPipeline(config) result = pipeline.run( recordings=["/path/to/recording1", "/path/to/recording2"], output_dir="workflow_library" ) print(f"Found {result.unique_episodes} unique workflows") ``` ## Next Steps See openadapt_ml/segmentation/README.md for: - P0: Integration tests with real openadapt-capture recordings - P0: Visualization generator for segment boundaries - P1: Improved prompt engineering and cost optimization - P2: Active learning and multi-modal features Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Features added: - Azure ML job tracking: Shows recent jobs from last 7 days with status - Cost tracking: Real-time uptime, hourly rate, and cost estimation - VM activity detection: Identifies what VM is currently doing - Evaluation history: Past benchmark runs and success rates (--details flag) - Enhanced UI: Structured dashboard with clear sections and icons New utility functions in vm_monitor.py: - fetch_azure_ml_jobs(): Fetch recent Azure ML jobs with filtering - calculate_vm_costs(): Calculate VM costs with hourly/daily/weekly rates - get_vm_uptime_hours(): Get VM uptime from Azure activity logs - detect_vm_activity(): Detect current VM activity (idle, running, setup) - get_evaluation_history(): Load past evaluation runs from results dir CLI enhancements: - Added --details flag for extended information - Improved output formatting with sections and separators - Better error handling and status icons - Preserved existing SSH tunnel and dashboard functionality Documentation: - Updated CLAUDE.md with new features and usage examples - Added detailed docstrings to all new functions This consolidates VM monitoring into a single enhanced command rather than creating duplicate dashboards, following the viewer consolidation strategy. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update CaptureAdapter to work with actual openadapt-capture database format. Key changes: - Use screen.frame events instead of generic event types - Pair action events (mouse.down + mouse.up → single click) - Map frame events to screenshots via timestamp matching - Update event type filtering to match openadapt-capture schema - Improve frame-to-action association logic This enables the segmentation pipeline to process real capture recordings from openadapt-capture instead of requiring simulated data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enhance vm monitor command to provide complete VM usage tracking: - Real-time VM status (size, IP, power state) - Activity detection (idle, benchmark running, setup) - Cost tracking (uptime hours, hourly rate, total cost) - Azure ML jobs list (last 7 days with status) - Evaluation history (with --details flag) - Mock mode for testing without VM (--mock flag) Add new API endpoints to local.py dashboard server: - /api/benchmark/status - current job status with ETA - /api/benchmark/costs - cost breakdown (Azure VM, API, GPU) - /api/benchmark/metrics - performance metrics by domain - /api/benchmark/workers - worker status and utilization - /api/benchmark/runs - list all benchmark runs - /api/benchmark/tasks/{run}/{task} - task execution details Update README with VM monitor section including screenshots and usage examples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add comprehensive test plan and results for workflow segmentation pipeline: - Test plan with 8 stages from environment setup to documentation - Test results documenting real capture processing outcomes - Test files for CaptureAdapter and segmentation pipeline Add VM monitor screenshot generation scripts and documentation: - Scripts for automated dashboard screenshot generation - Implementation plan for VM monitor screenshot feature - Analysis of screenshot capture approaches Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Archive OpenAdapter (incomplete pre-refactor cloud deployment POC) - Document key takeaways and lessons learned - Reference modern cloud infrastructure in openadapt-ml - Add guidelines for when to archive repositories OpenAdapter was an incomplete proof-of-concept from October 2024 with only 165 lines of code and no ecosystem usage. Cloud deployment is now production-ready in openadapt_ml/cloud/ and benchmarks/azure.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add search bar to viewer controls with Ctrl+F / Cmd+F keyboard shortcut - Implement advanced token-based search across step indices, action types, and text - Search filters step list in real-time with result count display - Clear button and Escape key support for resetting search - Consistent UI styling with existing viewer components - Integrates with existing step list filtering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove non-existent openadapt_ml.shared_ui import from viewer.py - Skip anthropic test when anthropic package not installed (optional dependency) - Skip viewer_components test when openadapt-viewer not installed (optional dependency) All tests now pass (334 passed, 6 skipped). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add azure_ops_tracker.py for real-time status tracking via SSE - Add azure_ops_viewer.py with live VNC iframe embed - Add /api/azure-ops-status and /api/azure-ops-sse endpoints - Add progress bar, cost tracking, elapsed time display - Add copy logs button and auto-scroll controls feat(cli): add new VM management commands - Add vm start-windows command - Add vm restart-windows command - Add vm check-build command - Add vm screenshot command for capturing dashboards - Fix container restart to always use --cap-add NET_ADMIN feat(infra): add screenshot capture infrastructure - Add capture_screenshots.py script - Configure BuildKit GC with 30GB limit - Fix Dockerfile OEM path and networking docs: add Azure dashboard spec and update CLI documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… install - Add automatic Docker build cache cleanup before waa-auto builds - Fix all VERSION=11e → VERSION=11 for fully unattended Windows install (Enterprise Evaluation shows edition picker dialog; Pro does not) - Update CLAUDE.md documentation with disk space management solution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Root cause: CLI used VERSION=11 but Dockerfile uses VERSION=11e. This caused XML patches (applied for 11e) to be ignored at runtime. Enterprise Eval (11e) has built-in GVLK key - never prompts for product key. Fixes: openadapt-evals-b3l Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

VERSION=11e (Enterprise Eval) has built-in GVLK - never prompts. VERSION=11 (Pro) may prompt for product key. Previous documentation was backwards, causing confusion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous approach copied windowsarena's autounattend.xml over dockurr/windows's version, which broke the OOBE flow. Changes: - Remove COPY commands that replaced the base image's XML files - Add conditional sed patch that only adds InstallFrom element if needed - Reorder Dockerfile to install Python deps before running python3 commands - Add clear comments explaining the OEM mechanism This fixes Windows installation failures where the OOBE would hang or show incorrect dialogs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Major cleanup of benchmarks CLI: Removed deprecated handlers (~1200 lines): - setup-waa: Replaced by top-level 'waa' command - run-waa: Replaced by top-level 'waa' command - prepare-windows: Replaced by top-level 'waa' command - waa-native: Replaced by scripts/waa_bootstrap_local.sh Added features: - cleanup_waa_resources(): Auto-cleanup leftover Azure resources (NICs, VNETs, NSGs, PublicIPs, disks) before VM creation - Updated default VM size to Standard_D8ds_v5 (300GB temp storage) - Updated help text with temp storage sizes for each VM option - Added deprecation notice to legacy viewer command The cleanup function prevents "resource already exists" errors when previous VM deletion was incomplete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Added comprehensive guidelines for Claude Code sessions: CLI-First Rule: - Never use raw az/ssh commands that require permission - Always use or extend the CLI for VM operations - Example pattern for adding new CLI functionality Standard VM Configuration Workflow: - Delete VM, update code, recreate (vs. trying to resize) - Current VM defaults (D8ds_v5, eastus, Ubuntu 22.04) This reduces friction by documenting the pre-approved command patterns and standard operating procedures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Close unclosed code block (lines 33-41) - Remove hardcoded absolute path, use relative description Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Key fixes to waa_deploy/Dockerfile: - Don't replace dockurr/windows autounattend.xml, only patch with InstallFrom element to prevent "Select the operating system" dialog - Use sed instead of python3 for run.py patching (Python installed later) - Fix entrypoint: use /run/entry.sh instead of non-existent /copy-oem.sh This enables fully automated Windows 11 Enterprise Eval installation with VERSION=11e, no manual intervention required. WAA server starts automatically via FirstLogonCommands. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Analyze the problem of integrating api-claude and api-openai agents into Windows Agent Arena's run.py. The current build-time patching via sed/Python fails due to order dependencies and string matching fragility. Document 6 alternative approaches with tradeoffs: 1. Runtime patching via docker exec 2. Ship pre-patched run.py 3. Volume mount at runtime 4. Fork WAA upstream 5. Python import hook (experimental) 6. Wrapper script (recommended) Recommend the wrapper script approach for its zero modification to WAA code and clear separation of concerns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

abrichr · 2026-01-24T15:35:32Z

Closing as superseded by #10 which was merged with more recent commits (through Jan 24). The API agent integration research doc (docs/research/waa_api_agent_integration.md) can be added in a separate PR if needed.

abrichr and others added 22 commits January 18, 2026 19:06

fix: resolve ruff linting and formatting issues

b4caa01

docs: fix inverted VERSION documentation in CLAUDE.md

4e0b2a4

VERSION=11e (Enterprise Eval) has built-in GVLK - never prompts. VERSION=11 (Pro) may prompt for product key. Previous documentation was backwards, causing confusion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: automate vanilla WAA bootstrap

8fe7e6f

docs: clarify unattended WAA bootstrap

fd11326

docs: fix markdown formatting in waa_vanilla_automation.md

95e96a1

- Close unclosed code block (lines 33-41) - Remove hardcoded absolute path, use relative description Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

abrichr closed this Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add research on API agent integration approaches for WAA#11

docs: add research on API agent integration approaches for WAA#11
abrichr wants to merge 22 commits intomainfrom
docs/api-agent-integration-approaches

abrichr commented Jan 23, 2026

Uh oh!

abrichr commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Jan 23, 2026

Summary

Problem

Solution Analysis

Recommendation

Test plan

Uh oh!

abrichr commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant