Design an AI Recommendation Agent
A hellointerview-style system design deep dive into AI recommendation systems like Netflix, Spotify DJ, TikTok For You, and Amazon. Covers requirements, core entities, the multi-stage retrieval pipeline, and three production deep dives: multi-stage retrieval architecture, conversational recommendation layer, and exploration versus exploitation. Each deep dive walks through naive, better, and production-grade approaches with trade-offs.
Quick Reference
- →Multi-stage retrieval pipeline: candidate generation (1M to 10K) then ranking (10K to 100) then re-ranking (100 to 20) then diversity filtering
- →Two-tower model separates user and item encoding for efficient approximate nearest neighbor retrieval at serving time
- →The conversational LLM layer lets users adjust preferences in real-time through natural language interaction
- →Contextual bandits with Thompson sampling balance exploration and exploitation with principled uncertainty estimation
- →Pre-compute item embeddings offline and compute user embeddings in real-time for sub-100ms end-to-end retrieval
- →Cold-start handling combines popularity fallback, onboarding signals, and progressive personalization as interaction history builds
- →Explanation generation increases engagement by 10 to 15 percent — attributing recommendations builds user trust
- →Optimize for long-term retention and session diversity, not just click-through rate — filter bubbles reduce lifetime value
Understanding the Problem
A recommendation system surfaces the most relevant items from a catalog of millions to each individual user. The system must balance what it knows the user likes (exploitation) with helping users discover new interests (exploration), all within a latency budget that makes the experience feel instantaneous. Products like Netflix (movie and show recommendations), Spotify DJ (music with conversational AI transitions), TikTok For You (video feed personalization), and Amazon (product recommendations) serve billions of recommendations daily, and the quality of these recommendations directly determines user engagement, retention, and revenue. From a system design perspective, this is a rich problem because it combines large-scale information retrieval (finding relevant items from millions), real-time personalization (incorporating the latest user signals within seconds), conversational AI (letting users express preferences in natural language), and multi-objective optimization (balancing engagement, diversity, freshness, and long-term retention). The trade-offs are sharp: optimizing purely for clicks creates filter bubbles that reduce long-term retention, optimizing purely for exploration reduces short-term engagement, and adding conversational capabilities introduces latency in a path where every millisecond matters.
Netflix uses a multi-stage retrieval pipeline with candidate generation via two-tower models, ranking via a deep neural network, and re-ranking for diversity and row-level optimization across the homepage. Spotify DJ combines traditional collaborative filtering with an LLM that generates natural language transitions between recommendation segments, creating a radio-DJ experience that feels personal. TikTok For You uses an exceptionally aggressive exploration strategy with real-time feedback loops — the system learns preferences from watch time on each video and adapts within a single session. Amazon combines item-to-item collaborative filtering with session-based context to recommend products related to what you are currently browsing. All four systems use multi-stage retrieval, but they differ sharply in how they balance exploration versus exploitation and whether they incorporate conversational interaction.
This is fundamentally about building a multi-stage retrieval system that narrows millions of items to a personalized set of 20 within 100ms, enhanced with a conversational layer that lets users steer recommendations in natural language. The three hardest sub-problems are: (1) building a retrieval pipeline that maintains recall at each narrowing stage without losing good candidates, (2) enabling real-time conversational preference adjustment without disrupting the low-latency serving path, and (3) balancing exploration and exploitation to maximize long-term user retention rather than just short-term clicks.