Production & Scale/Data Engineering for AI
★ OverviewAdvanced16 min

Knowledge Base Lifecycle

The engineering decision guide for production knowledge bases: when to build vs buy, the six lifecycle stages, 2026 chunking approaches (contextual retrieval, late chunking), hybrid retrieval as the serving default, evaluation metrics, monitoring for drift, and cost math at scale.

Quick Reference

  • Start with a managed KB pipeline (LlamaCloud, LangChain indexing API) — build custom only when you outgrow it
  • Knowledge base lifecycle: ingest → transform → index → serve → refresh → deprecate — each stage needs its own monitoring
  • 2026 chunking default: recursive splitting as baseline; contextual retrieval (prepend LLM context to each chunk) and late chunking (embed full doc, split embeddings) are leading alternatives
  • Hybrid retrieval (BM25 + vector with RRF fusion) is the production default — vector-only misses exact matches on error codes, IDs, and proper nouns
  • Refresh math: full rebuild 100K docs ≈ 2 hours at 50 docs/sec; incremental on 200 changed docs ≈ 24 seconds — but incremental requires lineage tracking for deletions
  • Evaluate retrieval continuously: precision@5 > 0.7, faithfulness > 0.8, answer relevancy > 0.75 as starting thresholds
  • Embedding cost at text-embedding-3-small: 1M docs × 5 chunks × 500 tokens ≈ 2.5B tokens ≈ $50 standard, $25 Batch API

Should You Build a Custom KB Pipeline?

Most teams build custom ingestion and chunking pipelines when they don't need to. The question isn't whether you can build a custom pipeline — it's whether the engineering cost is justified given what managed services now offer.

Building a KB pipeline?Fewer than 50K documents?(or 3 or fewer source types)YesNoUse managed serviceLlamaCloud · UnstructuredLangChain indexing APIReal-time freshness needed?YesNoCustom chunking or multi-modal?Build CDC pipelineDebezium 2.5+ · Kafkacomplex, high infra costYesNoUse managed servicesaves weeks of eng timeBuild batch pipelinecustom + managed hybrid

Build vs buy · most teams below 50K docs should start with managed services

SignalUse managed (LlamaCloud, Unstructured, LangChain indexing API)Build custom
Document count< 50K documents> 50K or growing rapidly
Source types1–3 standard sources (PDFs, Confluence, Notion)> 3 sources or proprietary systems
Freshness requirementHourly or slower is fineSub-minute (real-time) required
Chunking needsStandard fixed-size or semanticDomain-specific (code, legal, medical)
Multimodal contentNot neededTables, diagrams, PDFs with complex layouts
Team capacityNo dedicated data infra engineerDedicated data infra team
Start managed, migrate when you have pain

Teams that start with a managed pipeline and migrate to custom when they hit real limits save weeks of engineering time. The signals to migrate are specific: you need CDC-level freshness, your sources are too proprietary for connectors, or your chunking requirements are domain-specific. 'We might need this later' is not a signal.