Design a Deep Research Agent
A hellointerview-style system design deep dive into autonomous deep research agents like OpenAI Deep Research, Anthropic multi-agent research, and Gemini Deep Research. Covers requirements, core entities, the orchestrator-worker architecture, and three production deep dives: research orchestration with adaptive budget allocation, hierarchical memory with progressive summarization, and iterative report synthesis with source verification. Each deep dive walks through naive, better, and production-grade approaches with trade-offs.
Quick Reference
- →The orchestrator-worker pattern is the core architecture: a lead agent creates a research plan and delegates to 3-5 parallel worker agents
- →Token usage explains approximately 80 percent of quality variance in deep research — spending more tokens on more sources and more iterations is the number one quality lever
- →Hierarchical memory with progressive summarization lets research sessions survive far beyond context window limits
- →Adaptive termination criteria prevent both premature stopping (incomplete research) and runaway spending (diminishing returns)
- →Each worker operates independently with its own context window, tools, and research focus — inter-worker communication adds coordination overhead for little benefit
- →Iterative report drafting with source verification catches contradictions, fills gaps, and ensures every claim is cited before the report is finalized
- →Rainbow deployments keep old instances alive to finish multi-hour research tasks while new instances handle new requests
- →Cost per deep research task ranges from 5 to 50 dollars depending on depth — implement configurable budgets with hard caps
Understanding the Problem
A deep research agent is a system that autonomously conducts multi-hour research on a complex topic and produces a comprehensive, cited report. Unlike quick AI search (Perplexity-style answers in 10 seconds), deep research spends minutes to hours investigating a topic from multiple angles, reading dozens to hundreds of sources, cross-referencing claims, identifying contradictions, and synthesizing findings into a structured document comparable to what a junior research analyst would produce. Products like OpenAI Deep Research, Anthropic's multi-agent research system, and Gemini Deep Research have made this a mainstream product category. From a system design perspective, this is the most challenging AI system design problem because it combines long-running computation (tasks run for hours, not seconds), massive information processing (500K to 2M tokens of raw content per task), multi-agent coordination (orchestrator delegates to parallel workers), memory management (findings exceed any context window), and high-stakes synthesis (the final report must be accurate, cited, and coherent). The trade-offs are sharp and the stakes are high: a shallow research report wastes the user's time and money, an inaccurate report is worse than no report, and a report that costs 50 dollars but could have been adequate at 10 dollars erodes user trust in the system's economics.
OpenAI Deep Research launched in 2025 using an orchestrator that spawns subagents for parallel investigation, capable of multi-hour research sessions that produce comprehensive reports. Anthropic published 'How we built our multi-agent research system' revealing two critical findings: token usage explains approximately 80 percent of quality variance, and parallel workers reduce research time by up to 90 percent compared to sequential investigation. Gemini Deep Research operates within Google's ecosystem, leveraging its search index directly for retrieval and integrating with Google Docs for report delivery. All three systems share the same core insight: deep research quality scales with compute (tokens processed), not just model capability.
This is fundamentally about building a system that can autonomously plan a research strategy, execute it across multiple parallel workers, manage findings that far exceed any context window, and synthesize everything into a coherent report with verifiable citations. The three hardest sub-problems are: (1) orchestrating research across parallel workers with adaptive budget allocation and intelligent termination criteria, (2) maintaining coherent memory across a session that accumulates more tokens than any context window can hold, and (3) synthesizing a final report that is structured, accurate, internally consistent, and fully cited.