Agent Architecture/Autonomous Agents
Advanced11 min

Computer Use Agents

Building agents that interact with desktop applications using Anthropic's Computer Use: the screenshot-analyze-act loop, safety boundaries, sandboxing, and practical use cases like legacy app automation.

Quick Reference

  • Computer use loop: take screenshot → send to Claude → receive coordinate-based action → execute → repeat
  • Actions: mouse_move, left_click, right_click, double_click, type, key (keyboard shortcuts), screenshot
  • Always sandbox: run in a VM or Docker container — the agent can interact with anything visible on screen
  • Permission boundaries: whitelist allowed applications and screen regions, block sensitive areas (password managers, admin panels)
  • Speed: expect 3-8 seconds per action — screenshot capture + LLM inference + action execution

How Computer Use Works

Anthropic's Computer Use gives Claude the ability to interact with a full desktop environment — clicking buttons, typing text, using keyboard shortcuts, and reading screen content. Unlike browser agents that parse HTML, computer use works with raw pixels: Claude sees a screenshot and specifies pixel coordinates for where to click. This makes it universal (works with any application) but slower and less precise than API-based automation.

StepWhat HappensLatencyCost
1. ScreenshotCapture current screen state as PNG100-500msFree
2. Send to ClaudeImage + conversation history sent to API50-200ms uploadImage tokens (~1000-2000)
3. Claude analyzesIdentifies UI elements, decides action2-5s inferenceOutput tokens
4. Execute actionMouse move/click, keyboard input100-500msFree
5. Wait for UIApplication processes the action500ms-5sFree
Total per action: 3-8 seconds, $0.02-0.05

Each action in the loop takes 3-8 seconds and costs $0.02-0.05 in API calls. A 20-step workflow takes 1-3 minutes and costs $0.40-1.00. This is expensive compared to API calls but can automate tasks that have no API alternative.