This document describes how to implement OmniMCP, a system for UI automation through visual understanding and the Model Context Protocol (MCP).
The system consists of these essential components:
- VisualState - Current screen state
- MCP Server - Protocol implementation
- Input Control - UI actions
- UI Parser Integration - Visual analysis
class VisualState:
def __init__(self):
self.elements = []
self.timestamp = None
self.screen_dimensions = None
def update(self, screenshot):
"""Update visual state from screenshot.
Critical function that maintains screen state.
Must handle:
- Screenshot capture
- UI element parsing
- State updates
- Coordinate normalization
"""from mcp.server.fastmcp import FastMCP
mcp = FastMCP("omnimcp")
@mcp.tool()
async def get_screen_state() -> ScreenState:
"""Get current state of visible UI elements"""
@mcp.tool()
async def click_element(description: str) -> ClickResult:
"""Click UI element matching description"""
@mcp.tool()
async def type_text(text: str) -> TypeResult:
"""Type text"""def find_element(description: str) -> Element:
"""Find UI element matching description.
Critical for action reliability.
Consider:
- Text matching
- Element type
- Location/context
- Confidence scores
"""-
Visual State Management
- Screenshot capture
- UI parsing
- State updates
- Basic caching
-
MCP Protocol
- Observe endpoint
- Simple actions
- Rich responses
- Error handling
-
Action System
- Element targeting
- Input simulation
- Action verification
- Error recovery
- Always update before actions
- Cache intelligently
- Track history when needed
- Clear invalidation
- Rich error context
- Recovery strategies
- Debug information
- Verification
- Minimize updates
- Smart caching
- Async where beneficial
- Efficient targeting
@dataclass
class UIElement:
content: str
type: str
bounds: Bounds
confidence: float
@dataclass
class ScreenState:
elements: List[UIElement]
dimensions: tuple[int, int]
timestamp: float
@dataclass
class ActionResult:
success: bool
element: Optional[UIElement]
error: Optional[str] = NoneCurrent implementation:
./
├── omnimcp/ # Main package directory
│ ├── omnimcp.py # Core implementation with OmniMCP class and VisualState
│ ├── input.py # Input controller for UI interactions
│ ├── types.py # Type definitions (Bounds, UIElement, etc.)
│ ├── utils.py # Utilities for screenshots, coordinates, etc.
│ ├── config.py # Centralized configuration
│ └── omniparser/ # UI parsing functionality
│ ├── client.py # Parser client and provider
│ └── server.py # Parser deployment and management
├── tests/ # Test directory
│ ├── test_synthetic_ui.py # Synthetic UI generation for testing
│ └── test_omnimcp.py # Core functionality tests
└── run_omnimcp.py # Command-line entry point
Planned expansion:
./
├── utils.py # Core utilities and input control
├── omniparser/ # UI parsing functionality
│ ├── client.py # Parser client and provider
│ └── server.py # Parser deployment and management
├── core/ # Future: Core state management
│ ├── visual_state.py
│ └── element.py
└── mcp/ # Future: MCP implementation
└── server.py
OmniMCP uses uv for dependency management. When adding new dependencies, use:
uv add <package-name> # Add a regular dependency
uv add --dev <package-name> # Add a development dependency
uv pip install -e . # Install all dependenciesThis ensures dependencies are properly recorded in pyproject.toml.
OmniMCP now uses a centralized configuration system with:
- Settings loaded from environment variables and
.envfile - Default values for all settings
- Support for various configuration types:
- Claude API settings
- OmniParser connection settings
- AWS deployment configuration
- Debug and logging settings
To configure OmniMCP, create a .env file in the project root with your settings:
- Visual state is always current
- Every action verifies completion
- Rich error context always available
- Debug information accessible
- VisualState.update()
- MCPServer.observe()
- find_element()
- verify_action()
@dataclass
class ToolError:
message: str
visual_context: Optional[bytes] # Screenshot
attempted_action: str
element_description: str
recovery_suggestions: List[str]- Unit tests for core logic
- Integration tests for flows
- Visual verification
- Performance benchmarks
OmniMCP includes tools for generating synthetic test UIs with:
- Predefined UI elements (buttons, text fields, checkboxes)
- Before/after image pairs for action verification
- Element visualization for debugging
This approach offers several advantages:
- Works across all platforms
- Runs in any environment (including CI)
- Provides deterministic results
- Doesn't require actual displays
- Enables testing different scenarios
- Setup Visual State
visual_state = VisualState()
visual_state.update(take_screenshot())- Find Target Element
element = visual_state.find_element_by_content("Submit")
if not element:
raise MCPError("Element not found", context=visual_state.to_dict())- Take Action
success = await input_controller.click(element.center)
if not success:
raise MCPError("Click failed", context={"element": element})- Verify Result
@dataclass
class ActionVerification:
success: bool
before_state: bytes # Screenshot
after_state: bytes
changes_detected: List[BoundingBox]
confidence: float
async def verify_tool_execution(
action_result: ActionResult,
verification: ActionVerification
) -> bool:
"""Verify tool executed successfully"""- Focus on core functionality first
- Build incrementally
- Test thoroughly
- Keep it simple but robust
- Always verify actions
- Maintain current state
- Provide rich error context
This implementation guide focuses on the essential components needed for effective UI automation through visual understanding and action. Follow the implementation order strictly and ensure each component is solid before moving to the next.
Here's a high-level description of the ideal OmniMCP system:
OmniMCP is a Model Context Protocol (MCP) server that enables AI models (particularly Claude) to:
- Understand UI elements on screen through visual analysis
- Take actions through mouse and keyboard control
- Get rich visual context about UI elements using Claude's vision capabilities
class MCPServer:
"""Core MCP server implementing the Model Context Protocol.
Primary interface for AI models to interact with the UI.
"""
async def get_screen_state() -> Dict:
"""Get current screen state with UI elements."""
async def analyze_ui(query: str, max_elements: int = 5) -> Dict:
"""Analyze UI elements matching a natural language query."""
async def click_element(descriptor: str) -> Dict:
"""Click UI element by description."""
async def type_text(text: str) -> Dict:
"""Type text using keyboard."""
async def press_key(key: str) -> Dict:
"""Press a keyboard key."""class VisualState:
"""Represents current screen state with UI elements."""
def update_from_parser(self, parser_result: Dict):
"""Update state from UI parser results."""
def find_element_by_content(self, content: str) -> Optional[Element]:
"""Find UI element by content."""
def to_mcp_description(self) -> Dict:
"""Convert state to MCP format."""class OmniParserClient:
"""Client for interacting with the OmniParser API."""
def parse_image(self, image: Image.Image) -> Dict[str, Any]:
"""Parse an image using the OmniParser service."""
def check_server_available(self) -> bool:
"""Check if the OmniParser server is available."""
class OmniParserProvider:
"""Provider for OmniParser services with deployment capabilities."""
def deploy(self) -> bool:
"""Deploy OmniParser if not already running."""
def is_available(self) -> bool:
"""Check if parser is available."""class InputController:
"""Handles mouse and keyboard input."""
def click(self, x: float, y: float):
"""Click at coordinates."""
def type_text(self, text: str):
"""Type text."""
def press_key(self, key: str):
"""Press keyboard key."""class ClaudeVision:
"""Handles visual analysis using Claude."""
async def describe_elements(
elements: List[Element],
context: Optional[Image] = None
) -> List[str]:
"""Get detailed descriptions of UI elements."""
async def analyze_visual_query(
query: str,
screenshot: Image,
elements: List[Element]
) -> Dict:
"""Answer questions about UI using Claude's vision."""@mcp.tool() async def get_screen_state() -> ScreenState: """Get current state of visible UI elements""" state = await visual_state.capture() return state
@mcp.tool() async def find_element(description: str) -> Optional[UIElement]: """Find UI element matching natural language description""" state = await get_screen_state() return semantic_element_search(state.elements, description)
@mcp.tool() async def click_element(description: str) -> ClickResult: """Click UI element matching description""" element = await find_element(description) if not element: return ClickResult(success=False, error="Element not found") return await perform_click(element)
@mcp.tool() async def type_text(text: str) -> TypeResult: """Type text using keyboard""" try: await keyboard.type_text(text) return TypeResult(success=True, text_entered=text) except Exception as e: return TypeResult(success=False, error=str(e))
@mcp.tool() async def press_key( key: str, modifiers: List[str] = None ) -> ActionResult: """Press keyboard key with optional modifiers"""
-
Smart UI Analysis
- Visual element detection
- Natural language queries
- Rich context through Claude vision
- Element relationships and hierarchy
-
Robust Actions
- Smart element targeting
- Coordinate normalization
- Input verification
- Action confirmation
-
Development Support
- Debug visualizations
- Action logging
- Error diagnostics
- Performance metrics
-
Deployment Options
- Local parser
- Remote parser service
- Auto-deployment
- Service management
- MCP server is the primary interface
- Visual state is always current
- Errors are descriptive and actionable
- Debug information is always available
class OmniMCP:
def __init__(self):
self.visual_state = VisualState()
self.ui_parser = UIParserProvider()
self.keyboard = KeyboardController()
self.mouse = MouseController()
def update_visual_state(self):
screenshot = take_screenshot()
parser_result = self.ui_parser.parse_screenshot(screenshot)
self.visual_state.update_from_parser(parser_result)- Implement core MCP tools based on our working server.py
- Each tool updates visual state before acting
- All tools return structured responses
- Debug screenshots for each action
- Screenshot capture
- UI element parsing
- State management
- Claude vision integration for rich context
- Element targeting
- Coordinate handling
- Input simulation
- Action verification
- Visual state snapshots
- Action logging
- Error context
- Performance metrics
- Use FastMCP for protocol compatibility
- Structured responses for all actions
- Visual state always updated before actions
- Rich error context in responses
- Keep normalized and absolute coordinates
- Track element confidence scores
- Maintain element relationships
- Cache recent states for context
- Start with local parser
- Remote parser as fallback
- Smart deployment management
- Connection recovery
- Use proven pynput implementation
- Coordinate normalization
- Action verification
- Error recovery
-
Error Handling
- Clear error messages
- Recovery strategies
- Debug context
- User feedback
-
Performance
- Minimize visual state updates
- Cache when possible
- Async where beneficial
- Smart retries
-
Reliability
- Verify actions
- Handle edge cases
- Recover from failures
- Maintain state consistency
MCP for OmniMCP is fundamentally about enabling AI models to:
- Understand what's on screen through rich context
- Take actions using natural language descriptions
@mcp.tool() async def get_screen_state() -> ScreenState: """Get current state of visible UI elements
Returns:
ScreenState containing all visible UI elements with their properties
"""
@mcp.tool() async def find_element(description: str) -> Optional[UIElement]: """Find UI element matching natural language description
Args:
description: Natural language description of element (e.g. "the submit button")
"""
@mcp.tool() async def click_element(description: str) -> ClickResult: """Click UI element matching description
Args:
description: Natural language description of element to click
"""
@mcp.tool() async def type_text(text: str) -> TypeResult: """Type text using keyboard
Args:
text: Text to type
"""
@mcp.tool() async def press_key( key: str, modifiers: List[str] = None ) -> ActionResult: """Press keyboard key with optional modifiers
Args:
key: Key to press (e.g. "enter", "tab")
modifiers: Optional modifier keys (e.g. ["ctrl", "shift"])
"""
-
Simplicity
- Two core endpoints: observe and act
- Analysis as enhancement of observation
- Clear, consistent response structure
-
Stateful Context
- Server maintains current visual state
- Actions update state automatically
- Historical context available when needed
-
Natural Language Interface
- Element targeting by description
- Rich analysis of visual state
- Error messages in natural language
-
Verification
- Actions confirm completion
- Visual state updates verify changes
- Clear error reporting
This represents the minimal, essential MCP interface needed for effective UI automation through visual understanding and action.
Use template utilities for clean, maintainable prompts:
from omnimcp.utils import create_prompt_template, render_prompt
# Create reusable template
analyze_template = create_prompt_template("""
Analyze this UI element:
{{ element.description }}
Location: {{ element.bounds }}
Type: {{ element.type }}
Suggest interactions based on:
{% for attr in element.attributes %}
- {{ attr }}
{% endfor %}
""")
# Render with data
prompt = analyze_template.render(
element=ui_element
)
# Or one-step helper
prompt = render_prompt("""
Quick analysis: {{ element.description }}
""", element=ui_element)
## Implementation Status
Note: The current implementation in `omnimcp.py` represents the API design based on MCP specifications but has not been tested with actual MCP server implementations yet. The types and tools are defined but require:
1. Integration testing with MCP SDK
2. Verification of tool definitions
3. Testing with Claude and other MCP clients
4. Implementation of actual tool logic
This design serves as a starting point for implementing a compliant MCP server for UI understanding.
## Testing Strategy
### Synthetic UI Testing
For testing visual understanding without relying on real UIs or displays, we'll use programmatically generated images:
```python
def generate_test_ui():
"""Generate synthetic UI image with known elements."""
from PIL import Image, ImageDraw
# Create blank canvas
img = Image.new('RGB', (800, 600), color='white')
draw = ImageDraw.Draw(img)
# Draw UI elements with known positions
elements = []
# Button
draw.rectangle([(100, 100), (200, 150)], fill='blue', outline='black')
draw.text((110, 115), "Submit", fill="white")
elements.append({
"type": "button",
"content": "Submit",
"bounds": {"x": 100, "y": 100, "width": 100, "height": 50},
"confidence": 1.0
})
# Text field
draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
draw.text((310, 115), "Username", fill="gray")
elements.append({
"type": "text_field",
"content": "Username",
"bounds": {"x": 300, "y": 100, "width": 200, "height": 50},
"confidence": 1.0
})
return img, elementsFor testing action verification, we'll generate before/after image pairs:
def generate_action_test_pair(action_type="click"):
"""Generate before/after UI image pair for a specific action."""
before_img, elements = generate_test_ui()
after_img = before_img.copy()
after_draw = ImageDraw.Draw(after_img)
if action_type == "click":
# Show button in pressed state
after_draw.rectangle([(100, 100), (200, 150)], fill='darkblue', outline='black')
after_draw.text((110, 115), "Submit", fill="white")
# Add success message
after_draw.text((100, 170), "Form submitted!", fill="green")
elif action_type == "type":
# Show text entered in field
after_draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
after_draw.text((310, 115), "testuser", fill="black")
return before_img, after_img, elementsTesting Claude integration with synthetic images:
async def test_element_finding():
"""Test Claude's ability to find elements in synthetic UI."""
# Generate test image with known elements
test_img, elements = generate_test_ui()
# Mock screenshot capture to return test image
with patch('omnimcp.utils.take_screenshot', return_value=test_img):
# Setup OmniMCP with mock parser that returns our elements
# ...
# Test with various descriptions
descriptions = [
"submit button",
"blue button",
"the username field",
"textbox in the middle",
]
for desc in descriptions:
# Call find_element with each description
element = await mcp._visual_state.find_element(desc)
# Verify the correct element was found
# ...This testing approach:
- Works across all platforms
- Runs in any environment (including CI)
- Provides deterministic results
- Doesn't require actual displays or UI
- Allows testing a variety of scenarios
For real UI action testing, we'll start with manual verification while developing more sophisticated test environments.
Focus on implementing the core functionality first, then expand the testing framework.