Vision-based desktop control for AI agents β interact with any application through visual understanding, not DOM inspection.
Unlike browser automation tools (Playwright, Puppeteer, Chrome MCP), Vision Control:
- Works with ANY application β desktop apps, games, legacy software, not just web browsers
- No DOM/API required β controls apps that lack programmatic interfaces
- Visual reasoning β AI sees what users see through grid-annotated screenshots
- True desktop automation β native OS-level mouse/keyboard control across all windows
Use Playwright/Chrome MCP when: You need to automate web browsers with DOM access, network interception, or headless testing.
Use Vision Control when: You need to automate desktop applications, legacy software, or any GUI that lacks APIs.
- Capture β Screenshot with 26Γ15 grid overlay (A-Z columns, 1-15 rows)
- Locate β AI identifies targets by grid coordinates (
S3,M8/2) - Act β Execute mouse clicks, drags, scrolls, keyboard input at precise locations
# Capture screen with grid
python scripts/capture_grid.py
# Click at grid coordinate
python scripts/mouse_action.py click S3
# Type text
python scripts/keyboard_input.py type "Hello World"Screen divided into 26Γ15 grid (A-Z columns Γ 1-15 rows). Coordinates like S3 target cell centers, S3/2 targets quadrant 2 (top-right).
Quadrants: βββββ¬ββββ
β 1 β 2 β
βββββΌββββ€
β 3 β 4 β
βββββ΄ββββ
Mouse: click, right, double, move, drag, scroll
Keyboard: type, press, hotkey, click-type
See references/ for detailed command syntax.
MIT License β see LICENSE file.