Ingestion Pipelines

How to choose between batch, streaming, and CDC ingestion; what ingestion actually costs; and the specific ways production pipelines fail — with a complete reference implementation.

Quick Reference

→Batch ingestion runs on a schedule; streaming processes documents as they arrive; CDC listens for change events.
→Incremental updates hash document content and re-index only what changed — avoiding the cost of full-corpus rebuilds.
→Embedding cost for 1M docs at OpenAI text-embedding-3-small: $50 (standard) or $25 (Batch API, 24h completion).
→Cohere embed-v4 costs 5× more than OpenAI small at scale; self-hosted is 20–100× cheaper for large corpora.
→Always delete old chunks before re-indexing a changed document — skipping this step creates duplicate, contradictory results.
→A dead letter queue prevents one failing document from stalling the entire pipeline.
→Airflow 3.x (2026) is a ground-up rewrite with a new Task Execution API — Airflow 2.x DAG patterns no longer apply.

Batch, Streaming, or CDC: Choosing Your Strategy

Most articles about ingestion pipelines describe all three strategies (batch, streaming, CDC) as equally valid options, leaving you to figure out which one fits. In practice, the decision follows directly from one question: how stale can your index be before it causes a user-visible problem? For the majority of RAG systems, an hour-old index is fine. For a support chatbot that needs to reflect tickets closed five minutes ago, it's not.

start with batch · add streaming only when sources demand it · CDC for sub-minute freshness

Strategy	Index lag	Infra cost	Error model	When to choose
Batch	Minutes to hours	Low — cron job or DAG	Retry whole job or failed items	90% of RAG systems: documents updated daily or hourly
Streaming	Seconds to minutes	Medium — message queue + consumers	Dead letter queue, per-message retries	Live sources: edit streams, support tickets, news feeds
CDC	Milliseconds to seconds	Low (webhooks) or high (Debezium+Kafka)	At-least-once, idempotent required	Database-backed sources: Postgres, Notion, Confluence

Start with batch, add the others only when forced to

Batch is the only strategy you can test end-to-end locally in 30 seconds. Streaming needs a queue. CDC needs a source system that emits events. Start with batch, get it correct and monitored, then layer in streaming or CDC for specific sources that actually need lower latency. Two strategies in one pipeline is manageable; three is a debugging nightmare.

A 2026 development worth knowing: RisingWave (a streaming database) can connect directly to PostgreSQL's write-ahead log and materialized the embeddings via a built-in `openai_embedding()` function. The result is an embedding index that auto-updates on document changes — no orchestration layer needed. It's still a niche tool, but it demonstrates where CDC-based ingestion is heading.

Ingestion Cost Math

The most common missing section in ingestion articles is a cost breakdown. Here it is. Assumption: average document produces 2,500 embedding tokens (roughly five 500-token chunks). This is conservative — a short wiki page might be 800 tokens; a long PDF might be 8,000.

Incremental Updates and Change Detection

Re-indexing an entire corpus when one document changes wastes money and time. Incremental updates hash document content, store the hash in a checkpoint file, and on subsequent runs only re-process documents whose hash has changed. The critical correctness invariant: when a document changes, you must delete its old chunks from the index before adding new ones. Skipping the delete creates contradictory search results that are hard to debug — they appear randomly depending on which chunk the retriever returns.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.