Advanced RAG/Search Quality
Advanced14 min

Metadata Filtering & Pre-Retrieval

Metadata filtering is the difference between searching your entire corpus and searching the relevant 2% of it. This article covers when to add it, how to design metadata schemas that survive production, and the failure modes that will burn you silently.

Quick Reference

  • Pre-filter narrows the candidate set before ANN search — guarantees K results, unlike post-filtering
  • Metadata cardinality matters: a field with 10M unique values can't be indexed efficiently in most vector DBs
  • Self-querying retriever adds one LLM call per query (~$0.001) — skip it for programmatic interfaces with known filters
  • BM25 doesn't support native pre-filtering; apply it as a post-filter on the BM25 leg of hybrid search
  • Multi-tenant isolation requires mandatory tenant_id filtering — not optional, not a performance hint
  • Metadata filtering is a relevance and performance tool, not a security boundary — sensitive data needs separate collections
  • Schema drift is the silent killer: if 20% of your corpus lacks a field, filters on that field silently return fewer results

When Metadata Filtering Matters (and When It Doesn't)

Metadata filtering solves a real problem, but it's not free. Adding filters means designing and maintaining a metadata schema, indexing those fields in your vector database (which consumes memory), and writing filter construction logic. For small or homogeneous corpora, this cost exceeds the benefit. Before adding metadata filtering, run this decision table against your use case.

ScenarioUse Filtering?Reason
Corpus < 1,000 chunksNoANN search over the full set is fast enough; filters add complexity without measurable precision gain
All documents share the same metadata (same source, date, category)NoFilters on identical values narrow nothing; they're a no-op
Users ask vague, open-ended questions ('tell me about X')No for self-querying, yes for hard-coded filtersLLM can't extract meaningful filters from structureless intent
Metadata schema drifts across ingestion batchesFix schema firstStale or missing fields silently return fewer results than expected
Multi-tenant with a shared vector indexMandatorytenant_id filter is the only isolation mechanism in a shared index
Time-sensitive corpus (news, docs, changelogs)YesRecency filter prevents stale answers from dominating retrieval
Role-based access control requirementYes (not sole boundary)Filters prevent unauthorized documents from surfacing in results
Quick test before you build

Run 20 representative queries on your corpus without filters. For each, check: are the top-5 results from the right subset? If yes for 18/20, your corpus is homogeneous enough that filtering adds complexity without benefit. If no for 8+, you have a signal that filters will meaningfully improve precision.