Knowledge Base Lifecycle
The engineering decision guide for production knowledge bases: when to build vs buy, the six lifecycle stages, 2026 chunking approaches (contextual retrieval, late chunking), hybrid retrieval as the serving default, evaluation metrics, monitoring for drift, and cost math at scale.
Quick Reference
- →Start with a managed KB pipeline (LlamaCloud, LangChain indexing API) — build custom only when you outgrow it
- →Knowledge base lifecycle: ingest → transform → index → serve → refresh → deprecate — each stage needs its own monitoring
- →2026 chunking default: recursive splitting as baseline; contextual retrieval (prepend LLM context to each chunk) and late chunking (embed full doc, split embeddings) are leading alternatives
- →Hybrid retrieval (BM25 + vector with RRF fusion) is the production default — vector-only misses exact matches on error codes, IDs, and proper nouns
- →Refresh math: full rebuild 100K docs ≈ 2 hours at 50 docs/sec; incremental on 200 changed docs ≈ 24 seconds — but incremental requires lineage tracking for deletions
- →Evaluate retrieval continuously: precision@5 > 0.7, faithfulness > 0.8, answer relevancy > 0.75 as starting thresholds
- →Embedding cost at text-embedding-3-small: 1M docs × 5 chunks × 500 tokens ≈ 2.5B tokens ≈ $50 standard, $25 Batch API
Should You Build a Custom KB Pipeline?
Most teams build custom ingestion and chunking pipelines when they don't need to. The question isn't whether you can build a custom pipeline — it's whether the engineering cost is justified given what managed services now offer.
Build vs buy · most teams below 50K docs should start with managed services
| Signal | Use managed (LlamaCloud, Unstructured, LangChain indexing API) | Build custom |
|---|---|---|
| Document count | < 50K documents | > 50K or growing rapidly |
| Source types | 1–3 standard sources (PDFs, Confluence, Notion) | > 3 sources or proprietary systems |
| Freshness requirement | Hourly or slower is fine | Sub-minute (real-time) required |
| Chunking needs | Standard fixed-size or semantic | Domain-specific (code, legal, medical) |
| Multimodal content | Not needed | Tables, diagrams, PDFs with complex layouts |
| Team capacity | No dedicated data infra engineer | Dedicated data infra team |
Teams that start with a managed pipeline and migrate to custom when they hit real limits save weeks of engineering time. The signals to migrate are specific: you need CDC-level freshness, your sources are too proprietary for connectors, or your chunking requirements are domain-specific. 'We might need this later' is not a signal.