Design a Computer-Use Agent

A hellointerview-style system design deep dive into computer-use agents like OpenAI Operator (CUA), Anthropic Computer Use, and Google Project Mariner. Unlike API-based agents, these systems process raw screenshots via vision models and control computers through virtual mouse and keyboard actions. Covers requirements, core entities, the perception-action loop, and three production deep dives: screen understanding, action planning and error recovery, and sandboxing and security. Each deep dive walks through naive, better, and production-grade approaches with trade-offs.

Quick Reference

→The agent processes raw screenshots via vision models and acts through virtual mouse and keyboard — fundamentally different from API-based tool-calling agents
→Screen understanding uses multimodal vision with Set-of-Marks prompting and UI tree grounding to reliably identify interactive elements
→Action planning uses a plan-execute-verify loop with stuck detection and automatic replanning when consecutive screenshots are identical
→Sandboxing uses tiered permissions with ephemeral VMs, network allowlists, and action allowlists to prevent data exfiltration
→Current SOTA on OSWorld is roughly 30-40 percent task completion — this is still an early and hard problem where error compounds across steps
→The action space is effectively infinite: any pixel can be clicked, any text typed — making planning and error recovery far harder than finite-tool agents
→Coordinate prediction errors account for roughly 30 percent of failures — Set-of-Marks prompting reduces this by labeling elements with numbered overlays
→Prefer API-based agents when available — computer-use fills the gap for legacy software, desktop apps, and websites without APIs

Understanding the Problem

A computer-use agent is a system that receives a task in natural language and autonomously completes it by operating a computer the way a human would — looking at the screen, moving the mouse, clicking buttons, typing text. Unlike API-based agents that call structured endpoints, a computer-use agent perceives the world through raw screenshots and acts through virtual mouse and keyboard inputs. This is a fundamentally different paradigm. The agent does not know what is on the screen until it looks. It does not know where a button is until it processes the pixels. It cannot call a structured API to submit a form — it must find the form fields, click into them, type the values, and click submit. Products like OpenAI Operator (the Computer-Using Agent architecture), Anthropic Computer Use, and Google Project Mariner have demonstrated that modern vision models are capable enough to understand GUIs from raw pixels. But the problem remains extraordinarily hard because errors compound: a 95 percent per-step success rate over a 30-step task yields only a 21 percent end-to-end success rate. From a system design perspective, this touches perception (understanding what is on screen), planning (deciding what to do next across many steps), execution (translating intent into precise pixel-level actions), and security (the agent has full computer access including potentially sensitive data).

Real project

OpenAI Operator unified Computer Use and Deep Research into a shared CUA (Computer-Using Agent) architecture that operates a full browser within a sandboxed VM. Anthropic Computer Use sends screenshots to Claude and receives coordinate-based actions, running inside a Docker container with a virtual desktop. Google Project Mariner focuses specifically on browser automation, overlaying an agent on Chrome tabs. The key insight across all products: vision models are good enough to understand GUIs, but error recovery and multi-step planning remain the hard problems. Current best results on the OSWorld benchmark sit at roughly 30 to 40 percent task completion.

The Core Framing

This is fundamentally about building a system that can perceive, plan, act, and recover on a real computer via raw pixels and virtual input devices. The three hardest sub-problems are: (1) reliably understanding what is on screen and where interactive elements are located, (2) planning and recovering across multi-step tasks where errors compound exponentially, and (3) ensuring the agent cannot access sensitive data or exfiltrate information from the host environment.

Requirements

Before designing the system, we need to establish what it must do and how well it must do it. In a real interview, you would drive this conversation with the interviewer to narrow the scope. A computer-use agent has requirements that differ sharply from API-based agents — the input is pixels, the output is mouse and keyboard events, and the security model must account for full desktop access.

Core Entities

The data model for a computer-use agent captures the lifecycle of a task from submission through perception, planning, action, and verification. These entities differ from API-based agents because the primary input is visual (screenshots) and the primary output is physical (mouse and keyboard events). Getting this model right determines how the agent tracks what it sees, what it has done, and what it should do next.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.