Computer Use Agents
Building agents that interact with desktop applications using Anthropic's Computer Use: the screenshot-analyze-act loop, safety boundaries, sandboxing, and practical use cases like legacy app automation.
Quick Reference
- →Computer use loop: take screenshot → send to Claude → receive coordinate-based action → execute → repeat
- →Actions: mouse_move, left_click, right_click, double_click, type, key (keyboard shortcuts), screenshot
- →Always sandbox: run in a VM or Docker container — the agent can interact with anything visible on screen
- →Permission boundaries: whitelist allowed applications and screen regions, block sensitive areas (password managers, admin panels)
- →Speed: expect 3-8 seconds per action — screenshot capture + LLM inference + action execution
How Computer Use Works
Anthropic's Computer Use gives Claude the ability to interact with a full desktop environment — clicking buttons, typing text, using keyboard shortcuts, and reading screen content. Unlike browser agents that parse HTML, computer use works with raw pixels: Claude sees a screenshot and specifies pixel coordinates for where to click. This makes it universal (works with any application) but slower and less precise than API-based automation.
| Step | What Happens | Latency | Cost |
|---|---|---|---|
| 1. Screenshot | Capture current screen state as PNG | 100-500ms | Free |
| 2. Send to Claude | Image + conversation history sent to API | 50-200ms upload | Image tokens (~1000-2000) |
| 3. Claude analyzes | Identifies UI elements, decides action | 2-5s inference | Output tokens |
| 4. Execute action | Mouse move/click, keyboard input | 100-500ms | Free |
| 5. Wait for UI | Application processes the action | 500ms-5s | Free |
Each action in the loop takes 3-8 seconds and costs $0.02-0.05 in API calls. A 20-step workflow takes 1-3 minutes and costs $0.40-1.00. This is expensive compared to API calls but can automate tasks that have no API alternative.