Metadata Filtering & Pre-Retrieval
Metadata filtering is the difference between searching your entire corpus and searching the relevant 2% of it. This article covers when to add it, how to design metadata schemas that survive production, and the failure modes that will burn you silently.
Quick Reference
- →Pre-filter narrows the candidate set before ANN search — guarantees K results, unlike post-filtering
- →Metadata cardinality matters: a field with 10M unique values can't be indexed efficiently in most vector DBs
- →Self-querying retriever adds one LLM call per query (~$0.001) — skip it for programmatic interfaces with known filters
- →BM25 doesn't support native pre-filtering; apply it as a post-filter on the BM25 leg of hybrid search
- →Multi-tenant isolation requires mandatory tenant_id filtering — not optional, not a performance hint
- →Metadata filtering is a relevance and performance tool, not a security boundary — sensitive data needs separate collections
- →Schema drift is the silent killer: if 20% of your corpus lacks a field, filters on that field silently return fewer results
When Metadata Filtering Matters (and When It Doesn't)
Metadata filtering solves a real problem, but it's not free. Adding filters means designing and maintaining a metadata schema, indexing those fields in your vector database (which consumes memory), and writing filter construction logic. For small or homogeneous corpora, this cost exceeds the benefit. Before adding metadata filtering, run this decision table against your use case.
| Scenario | Use Filtering? | Reason |
|---|---|---|
| Corpus < 1,000 chunks | No | ANN search over the full set is fast enough; filters add complexity without measurable precision gain |
| All documents share the same metadata (same source, date, category) | No | Filters on identical values narrow nothing; they're a no-op |
| Users ask vague, open-ended questions ('tell me about X') | No for self-querying, yes for hard-coded filters | LLM can't extract meaningful filters from structureless intent |
| Metadata schema drifts across ingestion batches | Fix schema first | Stale or missing fields silently return fewer results than expected |
| Multi-tenant with a shared vector index | Mandatory | tenant_id filter is the only isolation mechanism in a shared index |
| Time-sensitive corpus (news, docs, changelogs) | Yes | Recency filter prevents stale answers from dominating retrieval |
| Role-based access control requirement | Yes (not sole boundary) | Filters prevent unauthorized documents from surfacing in results |
Run 20 representative queries on your corpus without filters. For each, check: are the top-5 results from the right subset? If yes for 18/20, your corpus is homogeneous enough that filtering adds complexity without benefit. If no for 8+, you have a signal that filters will meaningfully improve precision.