Skip to content

Latest commit

 

History

History
794 lines (619 loc) · 20.7 KB

File metadata and controls

794 lines (619 loc) · 20.7 KB

CLAUDE.md - OmniMCP Implementation Guide

Overview

This document describes how to implement OmniMCP, a system for UI automation through visual understanding and the Model Context Protocol (MCP).

Core Architecture

The system consists of these essential components:

  1. VisualState - Current screen state
  2. MCP Server - Protocol implementation
  3. Input Control - UI actions
  4. UI Parser Integration - Visual analysis

Implementation Approach

1. Start with VisualState

class VisualState:
    def __init__(self):
        self.elements = []
        self.timestamp = None
        self.screen_dimensions = None
        
    def update(self, screenshot):
        """Update visual state from screenshot.
        
        Critical function that maintains screen state.
        Must handle:
        - Screenshot capture
        - UI element parsing
        - State updates
        - Coordinate normalization
        """

2. Implement Core MCP Server

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("omnimcp")

@mcp.tool()
async def get_screen_state() -> ScreenState:
    """Get current state of visible UI elements"""
    
@mcp.tool()
async def click_element(description: str) -> ClickResult:
    """Click UI element matching description"""

@mcp.tool() 
async def type_text(text: str) -> TypeResult:
    """Type text"""

3. Build Element Targeting

def find_element(description: str) -> Element:
    """Find UI element matching description.
    
    Critical for action reliability.
    Consider:
    - Text matching
    - Element type
    - Location/context
    - Confidence scores
    """

Implementation Order

  1. Visual State Management

    • Screenshot capture
    • UI parsing
    • State updates
    • Basic caching
  2. MCP Protocol

    • Observe endpoint
    • Simple actions
    • Rich responses
    • Error handling
  3. Action System

    • Element targeting
    • Input simulation
    • Action verification
    • Error recovery

Key Considerations

State Management

  • Always update before actions
  • Cache intelligently
  • Track history when needed
  • Clear invalidation

Error Handling

  • Rich error context
  • Recovery strategies
  • Debug information
  • Verification

Performance

  • Minimize updates
  • Smart caching
  • Async where beneficial
  • Efficient targeting

MCP Protocol Details

Observe

@dataclass
class UIElement:
    content: str
    type: str
    bounds: Bounds
    confidence: float

@dataclass
class ScreenState:
    elements: List[UIElement]
    dimensions: tuple[int, int]
    timestamp: float

@dataclass
class ActionResult:
    success: bool
    element: Optional[UIElement]
    error: Optional[str] = None

Code Structure

Current implementation:

./
├── omnimcp/             # Main package directory
│   ├── omnimcp.py       # Core implementation with OmniMCP class and VisualState
│   ├── input.py         # Input controller for UI interactions
│   ├── types.py         # Type definitions (Bounds, UIElement, etc.)
│   ├── utils.py         # Utilities for screenshots, coordinates, etc.
│   ├── config.py        # Centralized configuration
│   └── omniparser/      # UI parsing functionality
│       ├── client.py    # Parser client and provider
│       └── server.py    # Parser deployment and management
├── tests/               # Test directory
│   ├── test_synthetic_ui.py  # Synthetic UI generation for testing
│   └── test_omnimcp.py       # Core functionality tests
└── run_omnimcp.py       # Command-line entry point

Planned expansion:

./
├── utils.py              # Core utilities and input control
├── omniparser/          # UI parsing functionality
│   ├── client.py        # Parser client and provider
│   └── server.py        # Parser deployment and management
├── core/               # Future: Core state management
│   ├── visual_state.py
│   └── element.py
└── mcp/                # Future: MCP implementation
    └── server.py

Package Management

OmniMCP uses uv for dependency management. When adding new dependencies, use:

uv add <package-name>       # Add a regular dependency
uv add --dev <package-name> # Add a development dependency
uv pip install -e .         # Install all dependencies

This ensures dependencies are properly recorded in pyproject.toml.

Configuration System

OmniMCP now uses a centralized configuration system with:

  • Settings loaded from environment variables and .env file
  • Default values for all settings
  • Support for various configuration types:
    • Claude API settings
    • OmniParser connection settings
    • AWS deployment configuration
    • Debug and logging settings

To configure OmniMCP, create a .env file in the project root with your settings:

Implementation Notes

Core Principles

  1. Visual state is always current
  2. Every action verifies completion
  3. Rich error context always available
  4. Debug information accessible

Critical Functions

  1. VisualState.update()
  2. MCPServer.observe()
  3. find_element()
  4. verify_action()

Error Handling

@dataclass
class ToolError:
    message: str
    visual_context: Optional[bytes]  # Screenshot
    attempted_action: str
    element_description: str
    recovery_suggestions: List[str]

Testing Requirements

  1. Unit tests for core logic
  2. Integration tests for flows
  3. Visual verification
  4. Performance benchmarks

Synthetic UI Testing

OmniMCP includes tools for generating synthetic test UIs with:

  • Predefined UI elements (buttons, text fields, checkboxes)
  • Before/after image pairs for action verification
  • Element visualization for debugging

This approach offers several advantages:

  • Works across all platforms
  • Runs in any environment (including CI)
  • Provides deterministic results
  • Doesn't require actual displays
  • Enables testing different scenarios

Example Implementation Flow

  1. Setup Visual State
visual_state = VisualState()
visual_state.update(take_screenshot())
  1. Find Target Element
element = visual_state.find_element_by_content("Submit")
if not element:
    raise MCPError("Element not found", context=visual_state.to_dict())
  1. Take Action
success = await input_controller.click(element.center)
if not success:
    raise MCPError("Click failed", context={"element": element})
  1. Verify Result
@dataclass
class ActionVerification:
    success: bool
    before_state: bytes  # Screenshot
    after_state: bytes
    changes_detected: List[BoundingBox]
    confidence: float

async def verify_tool_execution(
    action_result: ActionResult,
    verification: ActionVerification
) -> bool:
    """Verify tool executed successfully"""

Remember

  1. Focus on core functionality first
  2. Build incrementally
  3. Test thoroughly
  4. Keep it simple but robust
  5. Always verify actions
  6. Maintain current state
  7. Provide rich error context

This implementation guide focuses on the essential components needed for effective UI automation through visual understanding and action. Follow the implementation order strictly and ensure each component is solid before moving to the next.

===

Here's a high-level description of the ideal OmniMCP system:

OmniMCP System Design

Core Purpose

OmniMCP is a Model Context Protocol (MCP) server that enables AI models (particularly Claude) to:

  1. Understand UI elements on screen through visual analysis
  2. Take actions through mouse and keyboard control
  3. Get rich visual context about UI elements using Claude's vision capabilities

Key Components

1. MCP Server

class MCPServer:
    """Core MCP server implementing the Model Context Protocol.
    
    Primary interface for AI models to interact with the UI.
    """
    
    async def get_screen_state() -> Dict:
        """Get current screen state with UI elements."""
        
    async def analyze_ui(query: str, max_elements: int = 5) -> Dict:
        """Analyze UI elements matching a natural language query."""
        
    async def click_element(descriptor: str) -> Dict:
        """Click UI element by description."""
        
    async def type_text(text: str) -> Dict:
        """Type text using keyboard."""
        
    async def press_key(key: str) -> Dict:
        """Press a keyboard key."""

2. Visual Analysis

class VisualState:
    """Represents current screen state with UI elements."""
    
    def update_from_parser(self, parser_result: Dict):
        """Update state from UI parser results."""
        
    def find_element_by_content(self, content: str) -> Optional[Element]:
        """Find UI element by content."""
        
    def to_mcp_description(self) -> Dict:
        """Convert state to MCP format."""

3. UI Parser Integration

class OmniParserClient:
    """Client for interacting with the OmniParser API."""
    
    def parse_image(self, image: Image.Image) -> Dict[str, Any]:
        """Parse an image using the OmniParser service."""
        
    def check_server_available(self) -> bool:
        """Check if the OmniParser server is available."""

class OmniParserProvider:
    """Provider for OmniParser services with deployment capabilities."""
    
    def deploy(self) -> bool:
        """Deploy OmniParser if not already running."""
    
    def is_available(self) -> bool:
        """Check if parser is available."""

4. Input Control

class InputController:
    """Handles mouse and keyboard input."""
    
    def click(self, x: float, y: float):
        """Click at coordinates."""
        
    def type_text(self, text: str):
        """Type text."""
        
    def press_key(self, key: str):
        """Press keyboard key."""

5. Claude Vision Integration

class ClaudeVision:
    """Handles visual analysis using Claude."""
    
    async def describe_elements(
        elements: List[Element],
        context: Optional[Image] = None
    ) -> List[str]:
        """Get detailed descriptions of UI elements."""
        
    async def analyze_visual_query(
        query: str,
        screenshot: Image,
        elements: List[Element]
    ) -> Dict:
        """Answer questions about UI using Claude's vision."""

MCP Tools Interface

@mcp.tool() async def get_screen_state() -> ScreenState: """Get current state of visible UI elements""" state = await visual_state.capture() return state

@mcp.tool() async def find_element(description: str) -> Optional[UIElement]: """Find UI element matching natural language description""" state = await get_screen_state() return semantic_element_search(state.elements, description)

@mcp.tool() async def click_element(description: str) -> ClickResult: """Click UI element matching description""" element = await find_element(description) if not element: return ClickResult(success=False, error="Element not found") return await perform_click(element)

@mcp.tool() async def type_text(text: str) -> TypeResult: """Type text using keyboard""" try: await keyboard.type_text(text) return TypeResult(success=True, text_entered=text) except Exception as e: return TypeResult(success=False, error=str(e))

@mcp.tool() async def press_key( key: str, modifiers: List[str] = None ) -> ActionResult: """Press keyboard key with optional modifiers"""

Key Features

  1. Smart UI Analysis

    • Visual element detection
    • Natural language queries
    • Rich context through Claude vision
    • Element relationships and hierarchy
  2. Robust Actions

    • Smart element targeting
    • Coordinate normalization
    • Input verification
    • Action confirmation
  3. Development Support

    • Debug visualizations
    • Action logging
    • Error diagnostics
    • Performance metrics
  4. Deployment Options

    • Local parser
    • Remote parser service
    • Auto-deployment
    • Service management

===

OmniMCP Implementation Approach

Core Design Principles

  1. MCP server is the primary interface
  2. Visual state is always current
  3. Errors are descriptive and actionable
  4. Debug information is always available

Implementation Path

1. Foundation (Based on proven code)

class OmniMCP:
    def __init__(self):
        self.visual_state = VisualState()
        self.ui_parser = UIParserProvider()
        self.keyboard = KeyboardController()
        self.mouse = MouseController()

    def update_visual_state(self):
        screenshot = take_screenshot()
        parser_result = self.ui_parser.parse_screenshot(screenshot)
        self.visual_state.update_from_parser(parser_result)

2. MCP Server First

  • Implement core MCP tools based on our working server.py
  • Each tool updates visual state before acting
  • All tools return structured responses
  • Debug screenshots for each action

3. Visual Analysis Pipeline

  1. Screenshot capture
  2. UI element parsing
  3. State management
  4. Claude vision integration for rich context

4. Action System

  1. Element targeting
  2. Coordinate handling
  3. Input simulation
  4. Action verification

5. Debug Infrastructure

  • Visual state snapshots
  • Action logging
  • Error context
  • Performance metrics

Key Implementation Details

MCP Server

  • Use FastMCP for protocol compatibility
  • Structured responses for all actions
  • Visual state always updated before actions
  • Rich error context in responses

Visual State

  • Keep normalized and absolute coordinates
  • Track element confidence scores
  • Maintain element relationships
  • Cache recent states for context

UI Parser Integration

  • Start with local parser
  • Remote parser as fallback
  • Smart deployment management
  • Connection recovery

Input Control

  • Use proven pynput implementation
  • Coordinate normalization
  • Action verification
  • Error recovery

Critical Considerations

  1. Error Handling

    • Clear error messages
    • Recovery strategies
    • Debug context
    • User feedback
  2. Performance

    • Minimize visual state updates
    • Cache when possible
    • Async where beneficial
    • Smart retries
  3. Reliability

    • Verify actions
    • Handle edge cases
    • Recover from failures
    • Maintain state consistency

===

OmniMCP Core Protocol

Core Concept

MCP for OmniMCP is fundamentally about enabling AI models to:

  1. Understand what's on screen through rich context
  2. Take actions using natural language descriptions

Essential Tools

@mcp.tool() async def get_screen_state() -> ScreenState: """Get current state of visible UI elements

Returns:
    ScreenState containing all visible UI elements with their properties
"""

@mcp.tool() async def find_element(description: str) -> Optional[UIElement]: """Find UI element matching natural language description

Args:
    description: Natural language description of element (e.g. "the submit button")
"""

@mcp.tool() async def click_element(description: str) -> ClickResult: """Click UI element matching description

Args:
    description: Natural language description of element to click
"""

@mcp.tool() async def type_text(text: str) -> TypeResult: """Type text using keyboard

Args:
    text: Text to type
"""

@mcp.tool() async def press_key( key: str, modifiers: List[str] = None ) -> ActionResult: """Press keyboard key with optional modifiers

Args:
    key: Key to press (e.g. "enter", "tab")
    modifiers: Optional modifier keys (e.g. ["ctrl", "shift"])
"""

Key Design Points

  1. Simplicity

    • Two core endpoints: observe and act
    • Analysis as enhancement of observation
    • Clear, consistent response structure
  2. Stateful Context

    • Server maintains current visual state
    • Actions update state automatically
    • Historical context available when needed
  3. Natural Language Interface

    • Element targeting by description
    • Rich analysis of visual state
    • Error messages in natural language
  4. Verification

    • Actions confirm completion
    • Visual state updates verify changes
    • Clear error reporting

This represents the minimal, essential MCP interface needed for effective UI automation through visual understanding and action.

Prompt Templates

Use template utilities for clean, maintainable prompts:

from omnimcp.utils import create_prompt_template, render_prompt

# Create reusable template
analyze_template = create_prompt_template("""
    Analyze this UI element:
    {{ element.description }}
    
    Location: {{ element.bounds }}
    Type: {{ element.type }}
    
    Suggest interactions based on:
    {% for attr in element.attributes %}
    - {{ attr }}
    {% endfor %}
""")

# Render with data
prompt = analyze_template.render(
    element=ui_element
)

# Or one-step helper
prompt = render_prompt("""
    Quick analysis: {{ element.description }}
""", element=ui_element)

## Implementation Status

Note: The current implementation in `omnimcp.py` represents the API design based on MCP specifications but has not been tested with actual MCP server implementations yet. The types and tools are defined but require:

1. Integration testing with MCP SDK
2. Verification of tool definitions
3. Testing with Claude and other MCP clients
4. Implementation of actual tool logic

This design serves as a starting point for implementing a compliant MCP server for UI understanding.

## Testing Strategy

### Synthetic UI Testing

For testing visual understanding without relying on real UIs or displays, we'll use programmatically generated images:

```python
def generate_test_ui():
    """Generate synthetic UI image with known elements."""
    from PIL import Image, ImageDraw
    
    # Create blank canvas
    img = Image.new('RGB', (800, 600), color='white')
    draw = ImageDraw.Draw(img)
    
    # Draw UI elements with known positions
    elements = []
    
    # Button
    draw.rectangle([(100, 100), (200, 150)], fill='blue', outline='black')
    draw.text((110, 115), "Submit", fill="white")
    elements.append({
        "type": "button",
        "content": "Submit",
        "bounds": {"x": 100, "y": 100, "width": 100, "height": 50},
        "confidence": 1.0
    })
    
    # Text field
    draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
    draw.text((310, 115), "Username", fill="gray")
    elements.append({
        "type": "text_field",
        "content": "Username",
        "bounds": {"x": 300, "y": 100, "width": 200, "height": 50},
        "confidence": 1.0
    })
    
    return img, elements

Action Verification Testing

For testing action verification, we'll generate before/after image pairs:

def generate_action_test_pair(action_type="click"):
    """Generate before/after UI image pair for a specific action."""
    before_img, elements = generate_test_ui()
    after_img = before_img.copy()
    after_draw = ImageDraw.Draw(after_img)
    
    if action_type == "click":
        # Show button in pressed state
        after_draw.rectangle([(100, 100), (200, 150)], fill='darkblue', outline='black')
        after_draw.text((110, 115), "Submit", fill="white")
        # Add success message
        after_draw.text((100, 170), "Form submitted!", fill="green")
    
    elif action_type == "type":
        # Show text entered in field
        after_draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
        after_draw.text((310, 115), "testuser", fill="black")
    
    return before_img, after_img, elements

Test Implementation

Testing Claude integration with synthetic images:

async def test_element_finding():
    """Test Claude's ability to find elements in synthetic UI."""
    # Generate test image with known elements
    test_img, elements = generate_test_ui()
    
    # Mock screenshot capture to return test image
    with patch('omnimcp.utils.take_screenshot', return_value=test_img):
        # Setup OmniMCP with mock parser that returns our elements
        # ... 
        
        # Test with various descriptions
        descriptions = [
            "submit button",
            "blue button",
            "the username field",
            "textbox in the middle",
        ]
        
        for desc in descriptions:
            # Call find_element with each description
            element = await mcp._visual_state.find_element(desc)
            # Verify the correct element was found
            # ...

This testing approach:

  • Works across all platforms
  • Runs in any environment (including CI)
  • Provides deterministic results
  • Doesn't require actual displays or UI
  • Allows testing a variety of scenarios

For real UI action testing, we'll start with manual verification while developing more sophisticated test environments.

Focus on implementing the core functionality first, then expand the testing framework.