CLAUDE.md - OmniMCP Implementation Guide

Overview

This document describes how to implement OmniMCP, a system for UI automation through visual understanding and the Model Context Protocol (MCP).

Core Architecture

The system consists of these essential components:

VisualState - Current screen state
MCP Server - Protocol implementation
Input Control - UI actions
UI Parser Integration - Visual analysis

Implementation Approach

1. Start with VisualState

class VisualState:
    def __init__(self):
        self.elements = []
        self.timestamp = None
        self.screen_dimensions = None
        
    def update(self, screenshot):
        """Update visual state from screenshot.
        
        Critical function that maintains screen state.
        Must handle:
        - Screenshot capture
        - UI element parsing
        - State updates
        - Coordinate normalization
        """

2. Implement Core MCP Server

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("omnimcp")

@mcp.tool()
async def get_screen_state() -> ScreenState:
    """Get current state of visible UI elements"""
    
@mcp.tool()
async def click_element(description: str) -> ClickResult:
    """Click UI element matching description"""

@mcp.tool() 
async def type_text(text: str) -> TypeResult:
    """Type text"""

3. Build Element Targeting

def find_element(description: str) -> Element:
    """Find UI element matching description.
    
    Critical for action reliability.
    Consider:
    - Text matching
    - Element type
    - Location/context
    - Confidence scores
    """

Implementation Order

Visual State Management
- Screenshot capture
- UI parsing
- State updates
- Basic caching
MCP Protocol
- Observe endpoint
- Simple actions
- Rich responses
- Error handling
Action System
- Element targeting
- Input simulation
- Action verification
- Error recovery

Key Considerations

State Management

Always update before actions
Cache intelligently
Track history when needed
Clear invalidation

Error Handling

Rich error context
Recovery strategies
Debug information
Verification

Performance

Minimize updates
Smart caching
Async where beneficial
Efficient targeting

MCP Protocol Details

Observe

@dataclass
class UIElement:
    content: str
    type: str
    bounds: Bounds
    confidence: float

@dataclass
class ScreenState:
    elements: List[UIElement]
    dimensions: tuple[int, int]
    timestamp: float

@dataclass
class ActionResult:
    success: bool
    element: Optional[UIElement]
    error: Optional[str] = None

Code Structure

Current implementation:

./
├── omnimcp/             # Main package directory
│   ├── omnimcp.py       # Core implementation with OmniMCP class and VisualState
│   ├── input.py         # Input controller for UI interactions
│   ├── types.py         # Type definitions (Bounds, UIElement, etc.)
│   ├── utils.py         # Utilities for screenshots, coordinates, etc.
│   ├── config.py        # Centralized configuration
│   └── omniparser/      # UI parsing functionality
│       ├── client.py    # Parser client and provider
│       └── server.py    # Parser deployment and management
├── tests/               # Test directory
│   ├── test_synthetic_ui.py  # Synthetic UI generation for testing
│   └── test_omnimcp.py       # Core functionality tests
└── run_omnimcp.py       # Command-line entry point

Planned expansion:

./
├── utils.py              # Core utilities and input control
├── omniparser/          # UI parsing functionality
│   ├── client.py        # Parser client and provider
│   └── server.py        # Parser deployment and management
├── core/               # Future: Core state management
│   ├── visual_state.py
│   └── element.py
└── mcp/                # Future: MCP implementation
    └── server.py

Package Management

OmniMCP uses uv for dependency management. When adding new dependencies, use:

uv add <package-name>       # Add a regular dependency
uv add --dev <package-name> # Add a development dependency
uv pip install -e .         # Install all dependencies

This ensures dependencies are properly recorded in pyproject.toml.

Configuration System

OmniMCP now uses a centralized configuration system with:

Settings loaded from environment variables and .env file
Default values for all settings
Support for various configuration types:
- Claude API settings
- OmniParser connection settings
- AWS deployment configuration
- Debug and logging settings

To configure OmniMCP, create a .env file in the project root with your settings:

Implementation Notes

Core Principles

Visual state is always current
Every action verifies completion
Rich error context always available
Debug information accessible

Critical Functions

VisualState.update()
MCPServer.observe()
find_element()
verify_action()

Error Handling

@dataclass
class ToolError:
    message: str
    visual_context: Optional[bytes]  # Screenshot
    attempted_action: str
    element_description: str
    recovery_suggestions: List[str]

Testing Requirements

Unit tests for core logic
Integration tests for flows
Visual verification
Performance benchmarks

Synthetic UI Testing

OmniMCP includes tools for generating synthetic test UIs with:

Predefined UI elements (buttons, text fields, checkboxes)
Before/after image pairs for action verification
Element visualization for debugging

This approach offers several advantages:

Works across all platforms
Runs in any environment (including CI)
Provides deterministic results
Doesn't require actual displays
Enables testing different scenarios

Example Implementation Flow

Setup Visual State

visual_state = VisualState()
visual_state.update(take_screenshot())

Find Target Element

element = visual_state.find_element_by_content("Submit")
if not element:
    raise MCPError("Element not found", context=visual_state.to_dict())

Take Action

success = await input_controller.click(element.center)
if not success:
    raise MCPError("Click failed", context={"element": element})

Verify Result

@dataclass
class ActionVerification:
    success: bool
    before_state: bytes  # Screenshot
    after_state: bytes
    changes_detected: List[BoundingBox]
    confidence: float

async def verify_tool_execution(
    action_result: ActionResult,
    verification: ActionVerification
) -> bool:
    """Verify tool executed successfully"""

Remember

Focus on core functionality first
Build incrementally
Test thoroughly
Keep it simple but robust
Always verify actions
Maintain current state
Provide rich error context

This implementation guide focuses on the essential components needed for effective UI automation through visual understanding and action. Follow the implementation order strictly and ensure each component is solid before moving to the next.

===

Here's a high-level description of the ideal OmniMCP system:

OmniMCP System Design

Core Purpose

OmniMCP is a Model Context Protocol (MCP) server that enables AI models (particularly Claude) to:

Understand UI elements on screen through visual analysis
Take actions through mouse and keyboard control
Get rich visual context about UI elements using Claude's vision capabilities

Key Components

1. MCP Server

class MCPServer:
    """Core MCP server implementing the Model Context Protocol.
    
    Primary interface for AI models to interact with the UI.
    """
    
    async def get_screen_state() -> Dict:
        """Get current screen state with UI elements."""
        
    async def analyze_ui(query: str, max_elements: int = 5) -> Dict:
        """Analyze UI elements matching a natural language query."""
        
    async def click_element(descriptor: str) -> Dict:
        """Click UI element by description."""
        
    async def type_text(text: str) -> Dict:
        """Type text using keyboard."""
        
    async def press_key(key: str) -> Dict:
        """Press a keyboard key."""

2. Visual Analysis

class VisualState:
    """Represents current screen state with UI elements."""
    
    def update_from_parser(self, parser_result: Dict):
        """Update state from UI parser results."""
        
    def find_element_by_content(self, content: str) -> Optional[Element]:
        """Find UI element by content."""
        
    def to_mcp_description(self) -> Dict:
        """Convert state to MCP format."""

3. UI Parser Integration

class OmniParserClient:
    """Client for interacting with the OmniParser API."""
    
    def parse_image(self, image: Image.Image) -> Dict[str, Any]:
        """Parse an image using the OmniParser service."""
        
    def check_server_available(self) -> bool:
        """Check if the OmniParser server is available."""

class OmniParserProvider:
    """Provider for OmniParser services with deployment capabilities."""
    
    def deploy(self) -> bool:
        """Deploy OmniParser if not already running."""
    
    def is_available(self) -> bool:
        """Check if parser is available."""

4. Input Control

class InputController:
    """Handles mouse and keyboard input."""
    
    def click(self, x: float, y: float):
        """Click at coordinates."""
        
    def type_text(self, text: str):
        """Type text."""
        
    def press_key(self, key: str):
        """Press keyboard key."""

5. Claude Vision Integration

class ClaudeVision:
    """Handles visual analysis using Claude."""
    
    async def describe_elements(
        elements: List[Element],
        context: Optional[Image] = None
    ) -> List[str]:
        """Get detailed descriptions of UI elements."""
        
    async def analyze_visual_query(
        query: str,
        screenshot: Image,
        elements: List[Element]
    ) -> Dict:
        """Answer questions about UI using Claude's vision."""

MCP Tools Interface

@mcp.tool() async def get_screen_state() -> ScreenState: """Get current state of visible UI elements""" state = await visual_state.capture() return state

@mcp.tool() async def find_element(description: str) -> Optional[UIElement]: """Find UI element matching natural language description""" state = await get_screen_state() return semantic_element_search(state.elements, description)

@mcp.tool() async def click_element(description: str) -> ClickResult: """Click UI element matching description""" element = await find_element(description) if not element: return ClickResult(success=False, error="Element not found") return await perform_click(element)

@mcp.tool() async def type_text(text: str) -> TypeResult: """Type text using keyboard""" try: await keyboard.type_text(text) return TypeResult(success=True, text_entered=text) except Exception as e: return TypeResult(success=False, error=str(e))

@mcp.tool() async def press_key( key: str, modifiers: List[str] = None ) -> ActionResult: """Press keyboard key with optional modifiers"""

Key Features

Smart UI Analysis
- Visual element detection
- Natural language queries
- Rich context through Claude vision
- Element relationships and hierarchy
Robust Actions
- Smart element targeting
- Coordinate normalization
- Input verification
- Action confirmation
Development Support
- Debug visualizations
- Action logging
- Error diagnostics
- Performance metrics
Deployment Options
- Local parser
- Remote parser service
- Auto-deployment
- Service management

===

OmniMCP Implementation Approach

Core Design Principles

MCP server is the primary interface
Visual state is always current
Errors are descriptive and actionable
Debug information is always available

Implementation Path

1. Foundation (Based on proven code)

class OmniMCP:
    def __init__(self):
        self.visual_state = VisualState()
        self.ui_parser = UIParserProvider()
        self.keyboard = KeyboardController()
        self.mouse = MouseController()

    def update_visual_state(self):
        screenshot = take_screenshot()
        parser_result = self.ui_parser.parse_screenshot(screenshot)
        self.visual_state.update_from_parser(parser_result)

2. MCP Server First

Implement core MCP tools based on our working server.py
Each tool updates visual state before acting
All tools return structured responses
Debug screenshots for each action

3. Visual Analysis Pipeline

Screenshot capture
UI element parsing
State management
Claude vision integration for rich context

4. Action System

Element targeting
Coordinate handling
Input simulation
Action verification

5. Debug Infrastructure

Visual state snapshots
Action logging
Error context
Performance metrics

Key Implementation Details

MCP Server

Use FastMCP for protocol compatibility
Structured responses for all actions
Visual state always updated before actions
Rich error context in responses

Visual State

Keep normalized and absolute coordinates
Track element confidence scores
Maintain element relationships
Cache recent states for context

UI Parser Integration

Start with local parser
Remote parser as fallback
Smart deployment management
Connection recovery

Input Control

Use proven pynput implementation
Coordinate normalization
Action verification
Error recovery

Critical Considerations

Error Handling
- Clear error messages
- Recovery strategies
- Debug context
- User feedback
Performance
- Minimize visual state updates
- Cache when possible
- Async where beneficial
- Smart retries
Reliability
- Verify actions
- Handle edge cases
- Recover from failures
- Maintain state consistency

===

OmniMCP Core Protocol

Core Concept

MCP for OmniMCP is fundamentally about enabling AI models to:

Understand what's on screen through rich context
Take actions using natural language descriptions

Essential Tools

@mcp.tool() async def get_screen_state() -> ScreenState: """Get current state of visible UI elements

Returns:
    ScreenState containing all visible UI elements with their properties
"""

@mcp.tool() async def find_element(description: str) -> Optional[UIElement]: """Find UI element matching natural language description

Args:
    description: Natural language description of element (e.g. "the submit button")
"""

@mcp.tool() async def click_element(description: str) -> ClickResult: """Click UI element matching description

Args:
    description: Natural language description of element to click
"""

@mcp.tool() async def type_text(text: str) -> TypeResult: """Type text using keyboard

Args:
    text: Text to type
"""

@mcp.tool() async def press_key( key: str, modifiers: List[str] = None ) -> ActionResult: """Press keyboard key with optional modifiers

Args:
    key: Key to press (e.g. "enter", "tab")
    modifiers: Optional modifier keys (e.g. ["ctrl", "shift"])
"""

Key Design Points

Simplicity
- Two core endpoints: observe and act
- Analysis as enhancement of observation
- Clear, consistent response structure
Stateful Context
- Server maintains current visual state
- Actions update state automatically
- Historical context available when needed
Natural Language Interface
- Element targeting by description
- Rich analysis of visual state
- Error messages in natural language
Verification
- Actions confirm completion
- Visual state updates verify changes
- Clear error reporting

This represents the minimal, essential MCP interface needed for effective UI automation through visual understanding and action.

Prompt Templates

Use template utilities for clean, maintainable prompts:

from omnimcp.utils import create_prompt_template, render_prompt

# Create reusable template
analyze_template = create_prompt_template("""
    Analyze this UI element:
    {{ element.description }}
    
    Location: {{ element.bounds }}
    Type: {{ element.type }}
    
    Suggest interactions based on:
    {% for attr in element.attributes %}
    - {{ attr }}
    {% endfor %}
""")

# Render with data
prompt = analyze_template.render(
    element=ui_element
)

# Or one-step helper
prompt = render_prompt("""
    Quick analysis: {{ element.description }}
""", element=ui_element)

## Implementation Status

Note: The current implementation in `omnimcp.py` represents the API design based on MCP specifications but has not been tested with actual MCP server implementations yet. The types and tools are defined but require:

1. Integration testing with MCP SDK
2. Verification of tool definitions
3. Testing with Claude and other MCP clients
4. Implementation of actual tool logic

This design serves as a starting point for implementing a compliant MCP server for UI understanding.

## Testing Strategy

### Synthetic UI Testing

For testing visual understanding without relying on real UIs or displays, we'll use programmatically generated images:

```python
def generate_test_ui():
    """Generate synthetic UI image with known elements."""
    from PIL import Image, ImageDraw
    
    # Create blank canvas
    img = Image.new('RGB', (800, 600), color='white')
    draw = ImageDraw.Draw(img)
    
    # Draw UI elements with known positions
    elements = []
    
    # Button
    draw.rectangle([(100, 100), (200, 150)], fill='blue', outline='black')
    draw.text((110, 115), "Submit", fill="white")
    elements.append({
        "type": "button",
        "content": "Submit",
        "bounds": {"x": 100, "y": 100, "width": 100, "height": 50},
        "confidence": 1.0
    })
    
    # Text field
    draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
    draw.text((310, 115), "Username", fill="gray")
    elements.append({
        "type": "text_field",
        "content": "Username",
        "bounds": {"x": 300, "y": 100, "width": 200, "height": 50},
        "confidence": 1.0
    })
    
    return img, elements

Action Verification Testing

For testing action verification, we'll generate before/after image pairs:

def generate_action_test_pair(action_type="click"):
    """Generate before/after UI image pair for a specific action."""
    before_img, elements = generate_test_ui()
    after_img = before_img.copy()
    after_draw = ImageDraw.Draw(after_img)
    
    if action_type == "click":
        # Show button in pressed state
        after_draw.rectangle([(100, 100), (200, 150)], fill='darkblue', outline='black')
        after_draw.text((110, 115), "Submit", fill="white")
        # Add success message
        after_draw.text((100, 170), "Form submitted!", fill="green")
    
    elif action_type == "type":
        # Show text entered in field
        after_draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
        after_draw.text((310, 115), "testuser", fill="black")
    
    return before_img, after_img, elements

Test Implementation

Testing Claude integration with synthetic images:

async def test_element_finding():
    """Test Claude's ability to find elements in synthetic UI."""
    # Generate test image with known elements
    test_img, elements = generate_test_ui()
    
    # Mock screenshot capture to return test image
    with patch('omnimcp.utils.take_screenshot', return_value=test_img):
        # Setup OmniMCP with mock parser that returns our elements
        # ... 
        
        # Test with various descriptions
        descriptions = [
            "submit button",
            "blue button",
            "the username field",
            "textbox in the middle",
        ]
        
        for desc in descriptions:
            # Call find_element with each description
            element = await mcp._visual_state.find_element(desc)
            # Verify the correct element was found
            # ...

This testing approach:

Works across all platforms
Runs in any environment (including CI)
Provides deterministic results
Doesn't require actual displays or UI
Allows testing a variety of scenarios

For real UI action testing, we'll start with manual verification while developing more sophisticated test environments.

Focus on implementing the core functionality first, then expand the testing framework.

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md - OmniMCP Implementation Guide

Overview

Core Architecture

Implementation Approach

1. Start with VisualState

2. Implement Core MCP Server

3. Build Element Targeting

Implementation Order

Key Considerations

State Management

Error Handling

Performance

MCP Protocol Details

Observe

Code Structure

Package Management

Configuration System

Implementation Notes

Core Principles

Critical Functions

Error Handling

Testing Requirements

Synthetic UI Testing

Example Implementation Flow

Remember

===

OmniMCP System Design

Core Purpose

Key Components

1. MCP Server

2. Visual Analysis

3. UI Parser Integration

4. Input Control

5. Claude Vision Integration

MCP Tools Interface

Key Features

===

OmniMCP Implementation Approach

Core Design Principles

Implementation Path

1. Foundation (Based on proven code)

2. MCP Server First

3. Visual Analysis Pipeline

4. Action System

5. Debug Infrastructure

Key Implementation Details

MCP Server

Visual State

UI Parser Integration

Input Control

Critical Considerations

===

OmniMCP Core Protocol

Core Concept

Essential Tools

Key Design Points

Prompt Templates

Action Verification Testing

Test Implementation