Design an AI Coding Agent

A hellointerview-style system design deep dive into autonomous AI coding agents like Devin, Cursor, and Claude Code. Covers requirements, core entities, the ReAct loop, and three production deep dives: codebase indexing and retrieval, context window management, and sandboxed execution and safety. Each deep dive walks through naive, better, and production-grade approaches with trade-offs.

Quick Reference

→The core loop is Plan, Edit, Run, Observe, Iterate — the agent decides what to do next based on execution results
→Codebase indexing uses hybrid retrieval (AST + embeddings + dependency graph) to find relevant code across 100K+ file repos
→Context window management uses a four-tier progressive compression system to stay within token budgets
→Sandboxed execution in Docker containers is non-negotiable — the agent has arbitrary shell access
→Retrieval quality is the number one factor in agent performance — most SWE-bench failures are retrieval failures, not reasoning failures
→Every coding agent should operate on a git branch with frequent checkpoint commits for rollback safety
→Token budget allocation matters: retrieved code gets the lion share (100K tokens) because the agent must see the code it modifies
→The Plan entity is critical — agents that plan first and revise as they learn produce significantly better code

Understanding the Problem

An AI coding agent is a system that receives a task in natural language — a GitHub issue, a feature request, a bug report — and autonomously produces a working implementation. It edits multiple files across a real codebase, runs tests and build tools, diagnoses errors, and iterates until the code works. This is not autocomplete or inline suggestion. This is full autonomy: the agent decides which files to read, what to change, what commands to run, and when the job is done. Products like Devin (Cognition Labs), Cursor Agent Mode, Claude Code (Anthropic), and GitHub Copilot Agent Mode have made this a mainstream product category. From a system design perspective, this is a rich problem because it touches retrieval (finding relevant code in massive repos), execution safety (the agent has shell access), resource management (context windows are finite), and iterative planning (the agent must recover from mistakes). It is also a problem where the trade-offs are sharp and consequential — getting retrieval wrong means the agent edits the wrong files, getting safety wrong means data loss, and getting context management wrong means the agent loses coherence mid-task.

Real project

Devin by Cognition Labs was the first widely-demonstrated fully autonomous coding agent, operating in a complete virtual environment with browser, terminal, and editor. Cursor embeds the agent inside the IDE so it can leverage LSP diagnostics, inline diffs, and user-guided corrections. Claude Code operates from the terminal with deep filesystem and git integration, prioritizing transparency and user control. GitHub Copilot Agent Mode works within pull requests, automatically proposing fixes for issues. Each product makes different trade-offs on autonomy, safety, and user interaction, but all share the same fundamental architecture: a ReAct loop over a set of code-editing and execution tools.

The Core Framing

This is fundamentally about building a system that can autonomously plan, edit, execute, and iterate on code in a sandboxed environment. The three hardest sub-problems are: (1) finding the right code in a codebase too large to fit in context, (2) managing the context window as actions accumulate, and (3) ensuring the agent cannot cause irreversible damage.

Requirements

Before designing the system, we need to establish what it must do and how well it must do it. In a real interview, you would drive this conversation with the interviewer to narrow the scope. For a coding agent, the requirements split cleanly into functional capabilities (what the agent can do) and non-functional constraints (how fast, how safely, at what cost).

Core Entities

The data model for a coding agent captures the lifecycle of a task from submission through planning, execution, and completion. These entities define the state the system manages and the relationships between planning, acting, and observing. Getting this model right matters because it determines how the agent tracks progress, recovers from errors, and reports results.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.