Agent Architecture/Autonomous Agents
Advanced11 min

Browser Agents

Building AI agents that navigate and interact with websites: Playwright + LLM for web tasks, page understanding strategies, action spaces, and error recovery patterns.

Quick Reference

  • Browser agents combine Playwright (or Puppeteer) for browser control with an LLM for decision-making
  • Page understanding methods: screenshots (visual), DOM extraction (structural), accessibility tree (semantic) — each has trade-offs
  • Define a bounded action space: click, type, scroll, navigate, extract — fewer actions means better accuracy
  • Error recovery: detect stale elements, handle navigation failures, retry with alternative selectors
  • Always use headless browsers in production — headed mode is only for debugging

Architecture: Browser + LLM Loop

A browser agent follows a perception-decision-action loop: observe the current page state, ask the LLM what to do next, execute the action in the browser, observe the result, and repeat. The key architectural decision is how to represent the page to the LLM — this determines cost, speed, and accuracy.

Page RepresentationProsConsBest For
Screenshot (image)Sees exactly what user sees, handles dynamic contentExpensive (image tokens), slow, LLM may misread textVisually complex pages, canvas/SVG content
DOM extractionCheap (text tokens), fast, precise element referencesHTML is verbose, hard to parse complex layoutsForms, data tables, known page structures
Accessibility treeSemantic, compact, includes roles and labelsMissing visual layout, not all sites have good a11yMost general-purpose browsing tasks
Hybrid (a11y tree + selective screenshots)Best accuracy, reasonable costMore complex implementationProduction browser agents
Start with the accessibility tree

The accessibility tree gives you semantic element names, roles, and states in a compact text format. It's cheaper than screenshots and more meaningful than raw DOM. Add targeted screenshots only when the a11y tree doesn't capture the information you need.