Advanced11 min
Browser Agents
Building AI agents that navigate and interact with websites: Playwright + LLM for web tasks, page understanding strategies, action spaces, and error recovery patterns.
Quick Reference
- →Browser agents combine Playwright (or Puppeteer) for browser control with an LLM for decision-making
- →Page understanding methods: screenshots (visual), DOM extraction (structural), accessibility tree (semantic) — each has trade-offs
- →Define a bounded action space: click, type, scroll, navigate, extract — fewer actions means better accuracy
- →Error recovery: detect stale elements, handle navigation failures, retry with alternative selectors
- →Always use headless browsers in production — headed mode is only for debugging
Architecture: Browser + LLM Loop
A browser agent follows a perception-decision-action loop: observe the current page state, ask the LLM what to do next, execute the action in the browser, observe the result, and repeat. The key architectural decision is how to represent the page to the LLM — this determines cost, speed, and accuracy.
| Page Representation | Pros | Cons | Best For |
|---|---|---|---|
| Screenshot (image) | Sees exactly what user sees, handles dynamic content | Expensive (image tokens), slow, LLM may misread text | Visually complex pages, canvas/SVG content |
| DOM extraction | Cheap (text tokens), fast, precise element references | HTML is verbose, hard to parse complex layouts | Forms, data tables, known page structures |
| Accessibility tree | Semantic, compact, includes roles and labels | Missing visual layout, not all sites have good a11y | Most general-purpose browsing tasks |
| Hybrid (a11y tree + selective screenshots) | Best accuracy, reasonable cost | More complex implementation | Production browser agents |
Start with the accessibility tree
The accessibility tree gives you semantic element names, roles, and states in a compact text format. It's cheaper than screenshots and more meaningful than raw DOM. Add targeted screenshots only when the a11y tree doesn't capture the information you need.