Metadata Filtering & Pre-Retrieval
Using metadata to narrow search scope before vector similarity. Attaching metadata during indexing, pre-filtering, self-querying retrievers, and combining filters with semantic search.
Quick Reference
- →Attach metadata (source, date, author, category, version) to every chunk during indexing
- →Pre-filtering narrows the search space before vector similarity — faster and more precise
- →Self-querying retriever: LLM automatically extracts filters from natural language questions
- →Metadata filtering is essential for multi-tenant RAG — filter by tenant_id before search
- →Store metadata as structured fields (not in the text) for efficient filtering
Attaching Metadata During Indexing
Every chunk in your vector store should carry rich metadata beyond just the text content. Metadata enables filtering at query time (find only recent documents), source attribution in answers (this came from the 2024 employee handbook, page 12), and debugging (which document was this chunk extracted from?). The quality of your metadata directly determines the quality of your filtered retrieval.
Every chunk should have at minimum: source (filename/URL), page/section number, indexed_at timestamp, and content_hash. Add domain-specific fields: department, product, version, language, access_level. Design metadata as if you'll need to filter on any combination of these fields.