Design an AI Coding Agent
A hellointerview-style system design deep dive into autonomous AI coding agents like Devin, Cursor, and Claude Code. Covers requirements, core entities, the ReAct loop, and three production deep dives: codebase indexing and retrieval, context window management, and sandboxed execution and safety. Each deep dive walks through naive, better, and production-grade approaches with trade-offs.
Quick Reference
- →The core loop is Plan, Edit, Run, Observe, Iterate — the agent decides what to do next based on execution results
- →Codebase indexing uses hybrid retrieval (AST + embeddings + dependency graph) to find relevant code across 100K+ file repos
- →Context window management uses a four-tier progressive compression system to stay within token budgets
- →Sandboxed execution in Docker containers is non-negotiable — the agent has arbitrary shell access
- →Retrieval quality is the number one factor in agent performance — most SWE-bench failures are retrieval failures, not reasoning failures
- →Every coding agent should operate on a git branch with frequent checkpoint commits for rollback safety
- →Token budget allocation matters: retrieved code gets the lion share (100K tokens) because the agent must see the code it modifies
- →The Plan entity is critical — agents that plan first and revise as they learn produce significantly better code
Understanding the Problem
An AI coding agent is a system that receives a task in natural language — a GitHub issue, a feature request, a bug report — and autonomously produces a working implementation. It edits multiple files across a real codebase, runs tests and build tools, diagnoses errors, and iterates until the code works. This is not autocomplete or inline suggestion. This is full autonomy: the agent decides which files to read, what to change, what commands to run, and when the job is done. Products like Devin (Cognition Labs), Cursor Agent Mode, Claude Code (Anthropic), and GitHub Copilot Agent Mode have made this a mainstream product category. From a system design perspective, this is a rich problem because it touches retrieval (finding relevant code in massive repos), execution safety (the agent has shell access), resource management (context windows are finite), and iterative planning (the agent must recover from mistakes). It is also a problem where the trade-offs are sharp and consequential — getting retrieval wrong means the agent edits the wrong files, getting safety wrong means data loss, and getting context management wrong means the agent loses coherence mid-task.
Devin by Cognition Labs was the first widely-demonstrated fully autonomous coding agent, operating in a complete virtual environment with browser, terminal, and editor. Cursor embeds the agent inside the IDE so it can leverage LSP diagnostics, inline diffs, and user-guided corrections. Claude Code operates from the terminal with deep filesystem and git integration, prioritizing transparency and user control. GitHub Copilot Agent Mode works within pull requests, automatically proposing fixes for issues. Each product makes different trade-offs on autonomy, safety, and user interaction, but all share the same fundamental architecture: a ReAct loop over a set of code-editing and execution tools.
This is fundamentally about building a system that can autonomously plan, edit, execute, and iterate on code in a sandboxed environment. The three hardest sub-problems are: (1) finding the right code in a codebase too large to fit in context, (2) managing the context window as actions accumulate, and (3) ensuring the agent cannot cause irreversible damage.