Skip to content

agent skill that let AI see and control your screen - Grid-based visual automation for LLM agents

License

Notifications You must be signed in to change notification settings

askiichan/vision-control-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vision Control Skill

Python 3.8+ License: MIT

Vision-based desktop control for AI agents β€” interact with any application through visual understanding, not DOM inspection.

Why Vision Control?

Unlike browser automation tools (Playwright, Puppeteer, Chrome MCP), Vision Control:

  • Works with ANY application β€” desktop apps, games, legacy software, not just web browsers
  • No DOM/API required β€” controls apps that lack programmatic interfaces
  • Visual reasoning β€” AI sees what users see through grid-annotated screenshots
  • True desktop automation β€” native OS-level mouse/keyboard control across all windows

Use Playwright/Chrome MCP when: You need to automate web browsers with DOM access, network interception, or headless testing.

Use Vision Control when: You need to automate desktop applications, legacy software, or any GUI that lacks APIs.

How It Works

  1. Capture β€” Screenshot with 26Γ—15 grid overlay (A-Z columns, 1-15 rows)
  2. Locate β€” AI identifies targets by grid coordinates (S3, M8/2)
  3. Act β€” Execute mouse clicks, drags, scrolls, keyboard input at precise locations
# Capture screen with grid
python scripts/capture_grid.py

# Click at grid coordinate
python scripts/mouse_action.py click S3

# Type text
python scripts/keyboard_input.py type "Hello World"

Grid Coordinates

Screen divided into 26Γ—15 grid (A-Z columns Γ— 1-15 rows). Coordinates like S3 target cell centers, S3/2 targets quadrant 2 (top-right).

Quadrants:  β”Œβ”€β”€β”€β”¬β”€β”€β”€β”
            β”‚ 1 β”‚ 2 β”‚
            β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€
            β”‚ 3 β”‚ 4 β”‚
            β””β”€β”€β”€β”΄β”€β”€β”€β”˜

Actions

Mouse: click, right, double, move, drag, scroll
Keyboard: type, press, hotkey, click-type

See references/ for detailed command syntax.

Safety

⚠️ Performs real system actions β€” verify coordinates before executing. Screenshots auto-delete for privacy.

License

MIT License β€” see LICENSE file.

About

agent skill that let AI see and control your screen - Grid-based visual automation for LLM agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages