Safe desktop control for any AI agent. Reads the screen through the accessibility tree (screenshots as fallback),
verifies its own actions, and gates everything through one safety checkpoint. Local · cross-OS · any model.
Quickstart · Why it's different · The engine · How it works · Tools · Platforms · Changelog
Clawd Cursor is a local MCP server that gives any tool-calling agent — Claude Code, Cursor, Windsurf, OpenClaw, the Claude Agent SDK, or your own loop — safe control of the real desktop. It clicks, types, reads the screen, opens apps, and drives any GUI the way a human would: native apps, the browser, even a canvas.
Most "let an agent use the computer" tools take a screenshot and feed it to a vision model — slow, expensive, and brittle. Clawd Cursor reads the accessibility tree first (structured text, near-free, no vision model), falls back to OCR, and only reaches for pixels as a last resort. The result is cheaper, faster, private, and — uniquely — it checks that each action actually did what it claimed.
If a human can do it on a screen, your agent can too. No API, no integration, no problem — only the right sequence of reads, clicks, keys, and waits. Use it as the last-mile fallback: native API exists? Use it. CLI? Use it. Clawd Cursor is for the click, the legacy app, the GUI with no public surface.
The desktop-agent space is crowded. The closest install-and-go peers are Windows-MCP and Terminator (desktop MCP servers); browser-only tools (browser-use, Playwright MCP) are adjacent; and OmniParser / UI-TARS are vision-centric parsing approaches you'd build an agent around, not products you install. Here's the honest comparison across those approaches — what Clawd Cursor does that the popular options don't:
| Clawd Cursor | browser-use | Playwright MCP | OmniParser / UI-TARS | computer-use | |
|---|---|---|---|---|---|
| Any desktop app, not just the web | ✅ | web only | web only | ✅ | ✅ |
| Cross-OS (Windows + macOS + Linux) | ✅ | — | — | varies | sandbox |
| Perception without a vision model | ✅ a11y → OCR → vision | DOM | a11y tree | ❌ vision-centric | ❌ vision |
| Verifies its own actions (deviation) | ✅ | — | — | — | — |
| Single safety chokepoint (allow/confirm/block) | ✅ | — | — | — | — |
| Any model / vendor | ✅ | ✅ | not an agent | model-specific | Claude only |
| MCP-native (one config, any host) | ✅ | library | test framework | — | tool-use API |
| Local-only, no cloud required | ✅ | ✅ | ✅ | needs a model | screens → cloud |
Three things here are genuinely rare:
- Cheapest-tier-first perception, fully local. Accessibility tree (free) → OCR (cheap) → screenshot (expensive — the only tier that puts pixels in the model's context; "screenshot" and "vision" are the same step). The agent climbs only when it must, so token cost tracks task difficulty — and with a local model, nothing leaves the machine. Vision-centric agents (OmniParser, UI-TARS) need a screenshot in the model for every observation.
- It verifies. Pass
expecton a consequential action and Clawd Cursor re-checks the live screen (with a short settle window for async UIs) and reports a DEVIATION instead of a hollow "success." A completed task can't be marked done on evidence that was already true before it acted. - One safety gate. Every call — from an editor over stdio, an external agent over HTTP, or the built-in loop — routes through a single
safety.evaluate()chokepoint (allow / confirm / block) before it touches the desktop. The agent cannot bypass it.
Plus: an on-screen "desktop control in progress" banner with a blinking red dot whenever an agent is driving — double-click it to stop. A human at the machine always knows, and always has a kill switch.
Install (any OS):
npm i -g clawdcursorOr one line per OS (clones, builds, handles the macOS native build)
# Windows (PowerShell)
powershell -c "irm https://clawdcursor.com/install.ps1 | iex"# macOS / Linux
curl -fsSL https://clawdcursor.com/install.sh | bashSet up — this is the whole thing for the common case (your agent drives over MCP):
clawdcursor consent --accept # one-time consent (required) — also registers the
# clawdcursor *skill* in your agents (Claude Code,
# OpenClaw, Codex, Cursor); re-run: clawdcursor register-skill
clawdcursor grant # macOS only — approve Accessibility + Screen RecordingWire it into your editor (Claude Code, Cursor, Windsurf, Zed):
That's it. Ask your agent to "open Outlook and reply to the latest email from Sarah" and watch it run.
You never run
clawdcursor mcpyourself — the editor spawns it over stdio on demand.clawdcursor doctoris not part of MCP setup; it only configures the built-in LLM for the autonomousagentdaemon. On macOS, Accessibility is required (the primary control path); Screen Recording is optional (only the vision fallback needs it).
Editor permission allowlists: use the server-level wildcard
"mcp__clawdcursor"rather than per-tool entries — it covers every tool and survives tool renames across versions.
If you use Claude Code, you can skip the manual mcpServers block above. This
repo ships a plugin (.claude-plugin/plugin.json) that registers the MCP server
and bundles the usage skill in one step. It launches the server with
npx -y clawdcursor mcp --compact, so there's nothing to install first —
npx fetches clawdcursor on demand (or uses your global install if you have one), and
because it resolves the package's bin (never a hard-coded dist/ path) it can't be
broken by an entry-point change on upgrade.
# load the plugin for one session straight from a checkout…
claude --plugin-dir /path/to/clawdcursor
# …or add this repo to a plugin marketplace for a persistent install.
# one-time desktop-control consent (npx fetches the bin if you don't have it):
npx -y clawdcursor consent --acceptRequires Node.js 20+ (for
npx, which ships with Node). The first launch downloads clawdcursor into npx's cache; later launches reuse it — no global install and noPATHshim to resolve.
The perception + verification core (the UI State Compiler, since v1.5.0):
compile_uifuses the accessibility tree and OCR into one confidence-scored map of the screen, every element tagged with a stableel_NNid. Act on an element by{element_id, snapshot_id}instead of pixels — near-free in tokens, and it survives DPI, resize, and layout shifts.find_button/find_fieldlocate a target by meaning and hand you the id.- Reactive verification.
expecton an action → Clawd Cursor confirms the outcome on the live screen and returns a DEVIATION when the UI didn't obey. - Cross-platform parity. The compiler, secure-field redaction, and coordinate handling run on Windows, macOS, and Linux; the external-agent (MCP) surface resolves
el_NNrefs through the safety gate and discloses when it attached to your existing browser.
Set-of-Mark-style element IDs and a11y/OCR fusion aren't new ideas on their own — what's rare is doing them locally, a11y-first (no vision model required), with a built-in verification gate and one safety chokepoint, across three operating systems, behind a single MCP config.
See the changelog for the full release history (latest: v1.5.2 — perception reliability, honest verification, the control banner).
Where the brain lives decides how you run it. Both modes can run side-by-side.
| Brain lives… | Mode | Command | What you call |
|---|---|---|---|
| In your editor (Claude Code, Cursor, Windsurf, Zed) | Direct tools | clawdcursor mcp |
Each tool, via stdio MCP |
| In a headless agent with its own LLM (OpenClaw, Agent SDK, your loop) | Direct tools | clawdcursor agent --no-llm |
Same, over HTTP MCP |
| Inside Clawd Cursor itself (scheduled / "submit and walk away") | Thin agent loop | clawdcursor agent + doctor-configured LLM |
task / submit_task |
| External brain that delegates sub-tasks to the built-in loop | Direct + delegation | clawdcursor agent + your client |
task({instruction:…}) to hand off |
Read the a11y tree (cheap) → act on named targets → verify from fresh observations → escalate perception only when needed (OCR → screenshot, the one tier that sends pixels to the model). Sparse a11y tree? system.detect_webview switches Electron/WebView2 apps to browser.* over CDP. Canvas-only (Paint, Figma, games)? Screenshot + coordinate click.
flowchart TB
task["User task"] --> loop["Agent LLM loop<br/>plans · chooses tools · verifies"]
loop --> observe{"Cheapest observation<br/>that answers the question"}
observe -- "obs·a11y — free" --> a11y["A11y tree<br/>(structured text + el_NN handles)"]
observe -- "obs·ocr — cheap" --> ocr["OCR (OS-level, no vision LLM)"]
observe -- "obs·dom — medium" --> dom["Browser DOM (CDP)"]
observe -- "obs·vision — expensive" --> vision["Screenshot (image into context)"]
a11y --> act
ocr --> act
dom --> act
vision --> act
act["Act<br/>click/type/key/drag · invoke/set_value · open_app · batch"] --> safety
safety["Single safety gate<br/>safety.evaluate() → allow / confirm / block"] -- allowed --> tools["Tool registry<br/>98 granular + 7 compound"]
safety -- needs user --> confirm["Human confirmation"] --> tools
safety -- denied --> blocked["blocked"]
tools --> desktop["Real desktop"]
desktop --> verify{"expect → does state match?"}
verify -- pass --> done["done"]
verify -- "DEVIATION" --> loop
classDef agentNode fill:#dbeafe,stroke:#2563eb,color:#0f172a;
classDef gate fill:#ede9fe,stroke:#7c3aed,color:#0f172a;
classDef obsNode fill:#fef9c3,stroke:#ca8a04,color:#0f172a;
classDef actNode fill:#ffedd5,stroke:#ea580c,color:#0f172a;
classDef stop fill:#fee2e2,stroke:#dc2626,color:#0f172a;
class loop,verify agentNode;
class safety,confirm,tools gate;
class observe,a11y,ocr,dom,vision obsNode;
class act actNode;
class blocked stop;
batch for deterministic stretches. When the next N steps are known, collapse them into one call — each step still routes through the safety gate; on any guard miss or error the batch halts with a per-step trace.
Task delegation. With an LLM configured on the daemon, an external agent can hand off at any point: task({"instruction":"…"}). The built-in loop takes the wheel and reports back — offload grunt work to a cheaper model without burning your own context.
Two catalogs ship side-by-side. The toolbox is 7 compound tools, each with an action enum covering ~10–20 verbs (~1,500 tokens total — about 12× smaller than granular, the computer_20250124 shape editor hosts already know). The granular surface is the 98 underlying primitives, one schema per verb (for runtimes that need top-level tools, or for debugging). Both run through the same safety.evaluate() chokepoint; the full catalog is always visible via MCP tools/list.
| Toolbox | Actions |
|---|---|
computer |
screenshot, click, double_click, right_click, triple_click, hover, move, scroll, scroll_horizontal, drag, drag_path, type, key, wait |
accessibility |
read_tree, find, get_element, focused, invoke, focus, set_value, get_value, expand, collapse, toggle, select, state, list_children, wait_for, compile_ui, find_button, find_field, smart_click, smart_type, smart_read |
window |
list, active, focus, maximize, minimize, restore, close, resize, list_displays, screen_size, open_app, open_file, open_url, switch_tab, navigate |
system |
clipboard_read, clipboard_write, system_time, ocr, undo, shortcuts_list, shortcuts_run, delegate, detect_webview, relaunch_with_cdp, system_prompt, build_uri, open_uri, open_app, open_file, open_url, detect_app, app_guide, learn_app |
browser |
connect, page_context, read_text, click, type, select_option, evaluate, wait_for, list_tabs, switch_tab, scroll |
task |
run (default; bounded-sync — waits up to timeouts, returns {status:"running"} + progress if longer, re-call to keep waiting), status, abort. Delegates to the built-in loop. Requires clawdcursor agent with an LLM. |
batch |
{steps:[…]} — collapse N calls into one round-trip; each step {name, arguments, expect?}, re-perceived and safety-gated, halts with a trace on any miss. |
computer({ action: "key", combo: "mod+s" }) // Cmd+S / Ctrl+S, resolved per-OS
accessibility({ action: "invoke", name: "Send" }) // click by name, not pixels
window({ action: "open_app", name: "Outlook" })
task({ instruction: "open Notepad and type hello" }) // hand off to the thin loopEvery observation has a cost. Start at the cheapest rung that works; climb only when it fails. The live log (CLAWD_LOG=pretty, default on a TTY) shows the ladder in real time via per-call badges.
| Tier | Badge | Cost | Source | When |
|---|---|---|---|---|
| T1 structured | obs·a11y |
~free | accessibility.*, window.*, browser.read_text, clipboard |
Default. Text + bounds, no image, no vision LLM. |
| T2 OCR | obs·ocr |
cheap | system.ocr, smart_read / smart_click / smart_type |
A11y tree empty/sparse. OS-level text out, no image bytes. |
| T3 DOM | obs·dom |
medium | browser.read_text / page_context (CDP) |
WebView / Electron / Chrome content. |
| T4 screenshot (vision) | obs·vision |
expensive | computer.screenshot |
The only tier that puts pixels in the model's context. Canvas-only apps or spatial reasoning. Last resort. |
Acting tools log act. Watching obs·a11y → act → obs·a11y on a normal turn — and the rare climb to obs·vision — is the whole efficiency model, visible.
One protocol — MCP — two transports, same catalog and JSON-RPC envelope. Both stateless; no session handshake.
| Transport | When | Client config |
|---|---|---|
| stdio MCP | Editor hosts. Tools appear on demand — no daemon. | {"command":"clawdcursor","args":["mcp","--compact"]} |
| HTTP MCP | Headless agents, daemons, orchestration, Agent SDK. POST JSON-RPC to http://127.0.0.1:3847/mcp. |
Run clawdcursor agent. Bearer token at ~/.clawdcursor/token. |
# HTTP MCP — list tools
curl -s -X POST http://127.0.0.1:3847/mcp \
-H "Authorization: Bearer $(cat ~/.clawdcursor/token)" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'Platform code lives behind a single PlatformAdapter interface (src/platform/{windows,macos,linux}.ts + wayland-backend.ts). Business logic never reads process.platform.
| Platform | UI Automation | OCR | Browser (CDP) | Input |
|---|---|---|---|---|
| Windows 10/11 (x64 / ARM64) | UIA via PowerShell bridge | Windows.Media.Ocr |
Chrome / Edge | nut-js |
| macOS 12+ (Intel / Apple Silicon) | JXA + System Events (TCC-safe) | Apple Vision | Chrome / Edge | nut-js + System Events |
| Linux X11 | AT-SPI via python3-gi |
Tesseract | Chrome / Edge | nut-js |
| Linux Wayland | AT-SPI via python3-gi |
Tesseract | Chrome / Edge | ydotool / wtype |
- Windows — no setup; the PowerShell bridge spawns on demand.
- macOS — first run needs Accessibility (required) + Screen Recording (optional);
clawdcursor grantwalks the dialogs. Retina/HiDPI handled in-adapter — don't pre-scale coordinates. - Linux X11 —
apt install tesseract-ocr python3-gi gir1.2-atspi-2.0. - Linux Wayland — same, plus
ydotool+ydotoold(preferred) orwtype(keyboard only).
| Tier | Actions | Behavior |
|---|---|---|
| Allow | Reading, opening apps, navigation, typing into non-sensitive fields, minimize | Executes immediately |
| Confirm | Sends, deletes, purchases, transfers, close-window/quit-app & show-desktop key combos, sensitive apps | Pauses for approval (batch({allowConfirm:true}) to authorize) |
| Block | Ctrl+Alt+Del, lock / log-out / force-quit / shutdown key sequences |
Refused outright (no path) |
- Network isolation. Binds to
127.0.0.1. Verify:netstat -an | findstr 3847(Windows) /| grep 3847(Unix). - Bearer-token auth on every HTTP request (
~/.clawdcursor/token). - Sensitive-app policy. Email, banking, password managers, private messaging auto-elevate to Confirm.
- No telemetry by default. Nothing phones home. Screenshots stay in RAM; with a local model nothing leaves the machine; with a cloud provider, screenshots go only to the endpoint you configured.
clawdcursor reportis opt-in and previews exactly what it sends. - Prompt-injection defense. Screen text is returned inside
<untrusted-screen-content>tags — data, never instructions. - Log privacy. Logs redact password-field values (
AXSecureTextField, UIAIsPassword=true).
See SECURITY.md for private vulnerability reporting.
| Directory | What lives here |
|---|---|
src/core/ |
Thin agent loop (runAgent), sense layer (a11y / snapshot / fingerprint / UI compiler), reactive verification, focus guard, safety gate. |
src/tools/ |
98 granular tools + 7 compound aggregators + batch, playbooks, registry, dispatch. |
src/platform/ |
PlatformAdapter + Windows / macOS / Linux / Wayland, OCR engine, CDP driver, URI handler. |
src/llm/ |
Provider clients (Claude, GPT, Gemini, Llama, Kimi, Ollama, …), credentials, model config. |
src/surface/ |
CLI, MCP server (stdio + HTTP), dashboard, doctor, onboarding, control banner. |
The PlatformAdapter is the only thing platform code talks to; safety.evaluate() is the only way tools execute. Those two seams are the whole point.
For humans diagnosing an install. Agents connect via MCP.
clawdcursor consent Manage desktop-control consent (--accept / --revoke / --status)
clawdcursor grant Grant macOS permissions (interactive, macOS only)
clawdcursor doctor Configure the AI provider for `agent` mode (+ diagnostics)
clawdcursor status Readiness check (consent, permissions, AI config)
clawdcursor mcp stdio MCP server — editor hosts spawn this; you don't
clawdcursor agent Daemon: HTTP MCP on :3847, optional built-in thin loop
clawdcursor agent --no-llm Daemon, tool surface only (no built-in brain)
clawdcursor stop Stop every running mode
clawdcursor uninstall Remove all config and data
Options: --port <n> (default 3847) · --compact · --no-banner · --provider <name> · --accept
git clone https://github.com/AmrDab/clawdcursor.git && cd clawdcursor
npm install
npm run build # tsc + postbuild → dist/surface/cli.js
npm test # vitest (1,000+ tests)
npm run lint # eslint
npm link # global `clawdcursor` shim (Admin shell on Windows)Tests run on Node 20 & 22 against Ubuntu, macOS, and Windows in CI, plus a coverage ratchet, a perf tripwire, and an npm audit gate.
Tech: TypeScript · Node 20+ · nut-js · Playwright · sharp · Express · Model Context Protocol SDK · Zod · commander.
PRs welcome — see CONTRIBUTING.md for the dev loop, branch conventions, and the test matrix every change clears. Bugs and features in issues; private security reports via SECURITY.md.
MIT — see LICENSE.
Built on the Model Context Protocol SDK, nut-js, Playwright, the Anthropic computer_20250124 tool shape, and the AT-SPI / UIA / AX trees that make app-agnostic GUI automation possible at all.
clawdcursor.com · Discord · Changelog · npm