Advanced18 min

Computer Use Agents

When to use computer use versus API automation, the screenshot-analyze-act loop with the current computer_20251124 tool, real cost math that shows context growth dominates price, Docker and ephemeral VM sandboxing with prompt injection defense, verification and stuck detection, production failure modes, and a reference implementation using the latest Anthropic API.

Quick Reference

→Computer use loop: screenshot → send to Claude → receive coordinate action → execute → repeat — each cycle takes 3-8 seconds
→Current tool type: computer_20251124 — requires beta header anthropic-beta: computer-use-2025-11-24 on every request
→Supported models: Opus 4.7 (high-res up to 2576px, no coordinate scaling needed), Opus 4.6, Sonnet 4.6
→Cost grows quadratically without pruning: by step 20, you resend ~550K tokens of screenshot history per request — use history_limit=5
→Stuck detection: compare 3 consecutive screenshots with pixel diff; if change < 2%, force Claude to try a different approach
→Prompt injection via screen content is a unique attack surface — a page can display text that overrides your instructions
→Companion tools: text_editor_20250728 and bash_20250124 can run alongside computer use for file editing and shell commands
→Use computer use only when no API exists — if a structured endpoint is available, it will be 10-100x faster and cheaper

Should I Use Computer Use at All?

Computer use is the most expensive and fragile automation option available. Before building a computer use agent, ask one question: does the application expose an API, SDK, or browser-parseable DOM? If yes, use that. Computer use is for applications where no structured interface exists — legacy desktop software, mainframe terminals, old ERP systems that predate APIs, or cross-application workflows that span apps with no shared integration layer.

Scenario	Use Computer Use?	Why / Alternative
Web app with a REST API	No	Use the API — 100x faster, zero visual fragility
Browser task (form fill, scraping)	No	Use Playwright or a browser agent that reads the DOM
Legacy desktop ERP with no API	Yes	No structured interface exists
Mainframe terminal emulator	Yes	xdotool/xvfb is the only available interface
Cross-app desktop workflow spanning 3 non-API apps	Yes	No shared integration layer
One-time data migration from old desktop app	Maybe	Worth it once; for recurring tasks, build an API connector
Daily recurring task via UI	No	Invest in the API integration — CU will break on every UI update

Cost reality: computer use is expensive

A 20-step task with Sonnet 4.6 costs $1.74–$2.90 depending on whether you manage context history (see Section 3 for the math). The equivalent API call chain costs under $0.01. Computer use is not a shortcut — it is a last resort for applications with no API.

Anthropic also offers Claude Managed Agents (public beta as of April 2026) — a fully managed agent harness with built-in sandboxing, checkpointing, and server-sent event streaming. If you want Anthropic to handle the orchestration layer instead of building your own, Managed Agents is worth evaluating before hand-rolling a computer use agent.

How the Screenshot-Analyze-Act Loop Works

Computer use works at the pixel level. Claude receives a screenshot of the current screen state and returns a coordinate-based action — a click at (x, y), text to type, a keyboard shortcut. Your application executes that action, waits for the UI to settle, takes another screenshot, and sends it back. The loop repeats until Claude returns a final text response instead of a tool call.

What It Actually Costs

The biggest cost misconception about computer use is that it has a flat per-action price. It does not. Each API call sends the entire conversation history — including all previous screenshots. Cost grows quadratically with step count. A step-20 request is 20x more expensive in screenshot tokens than a step-1 request, even if the action itself is identical.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.