Advanced14 min

Metadata Filtering & Pre-Retrieval

Metadata filtering is the difference between searching your entire corpus and searching the relevant 2% of it. This article covers when to add it, how to design metadata schemas that survive production, and the failure modes that will burn you silently.

Quick Reference

→Pre-filter narrows the candidate set before ANN search — guarantees K results, unlike post-filtering
→Metadata cardinality matters: a field with 10M unique values can't be indexed efficiently in most vector DBs
→Self-querying retriever adds one LLM call per query (~$0.001) — skip it for programmatic interfaces with known filters
→BM25 doesn't support native pre-filtering; apply it as a post-filter on the BM25 leg of hybrid search
→Multi-tenant isolation requires mandatory tenant_id filtering — not optional, not a performance hint
→Metadata filtering is a relevance and performance tool, not a security boundary — sensitive data needs separate collections
→Schema drift is the silent killer: if 20% of your corpus lacks a field, filters on that field silently return fewer results

When Metadata Filtering Matters (and When It Doesn't)

Metadata filtering solves a real problem, but it's not free. Adding filters means designing and maintaining a metadata schema, indexing those fields in your vector database (which consumes memory), and writing filter construction logic. For small or homogeneous corpora, this cost exceeds the benefit. Before adding metadata filtering, run this decision table against your use case.

Scenario	Use Filtering?	Reason
Corpus < 1,000 chunks	No	ANN search over the full set is fast enough; filters add complexity without measurable precision gain
All documents share the same metadata (same source, date, category)	No	Filters on identical values narrow nothing; they're a no-op
Users ask vague, open-ended questions ('tell me about X')	No for self-querying, yes for hard-coded filters	LLM can't extract meaningful filters from structureless intent
Metadata schema drifts across ingestion batches	Fix schema first	Stale or missing fields silently return fewer results than expected
Multi-tenant with a shared vector index	Mandatory	tenant_id filter is the only isolation mechanism in a shared index
Time-sensitive corpus (news, docs, changelogs)	Yes	Recency filter prevents stale answers from dominating retrieval
Role-based access control requirement	Yes (not sole boundary)	Filters prevent unauthorized documents from surfacing in results

Quick test before you build

Run 20 representative queries on your corpus without filters. For each, check: are the top-5 results from the right subset? If yes for 18/20, your corpus is homogeneous enough that filtering adds complexity without benefit. If no for 8+, you have a signal that filters will meaningfully improve precision.

Designing Your Metadata Schema

The fields you attach at indexing time determine what you can filter on at query time. You can't add a filter on a field that doesn't exist, and you can't change what's indexed without re-indexing everything. Design the schema before you ingest a single document.

Pre-Filtering vs Post-Filtering

Pre-filtering applies metadata conditions before vector similarity search. The filter narrows the candidate set, and then ANN search runs only over the matching documents. Post-filtering runs ANN search over the full index, retrieves the top K candidates, and then discards those that don't match the filter. The difference in guarantees is significant.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.