Advanced RAG/RAG in Production
Advanced18 min

Ingestion Pipelines

How to choose between batch, streaming, and CDC ingestion; what ingestion actually costs; and the specific ways production pipelines fail — with a complete reference implementation.

Quick Reference

  • Batch ingestion runs on a schedule; streaming processes documents as they arrive; CDC listens for change events.
  • Incremental updates hash document content and re-index only what changed — avoiding the cost of full-corpus rebuilds.
  • Embedding cost for 1M docs at OpenAI text-embedding-3-small: $50 (standard) or $25 (Batch API, 24h completion).
  • Cohere embed-v4 costs 5× more than OpenAI small at scale; self-hosted is 20–100× cheaper for large corpora.
  • Always delete old chunks before re-indexing a changed document — skipping this step creates duplicate, contradictory results.
  • A dead letter queue prevents one failing document from stalling the entire pipeline.
  • Airflow 3.x (2026) is a ground-up rewrite with a new Task Execution API — Airflow 2.x DAG patterns no longer apply.

Batch, Streaming, or CDC: Choosing Your Strategy

Most articles about ingestion pipelines describe all three strategies (batch, streaming, CDC) as equally valid options, leaving you to figure out which one fits. In practice, the decision follows directly from one question: how stale can your index be before it causes a user-visible problem? For the majority of RAG systems, an hour-old index is fine. For a support chatbot that needs to reflect tickets closed five minutes ago, it's not.

Ingestion Strategy Decision TreeHow fresh must indexed content be?Freshness?> 1 hour5–60 min< 5 minBatchSchedule-basedAirflow / PrefectMost RAG systemsStreamingQueue-basedKafka / SQSLive wikis, ticketsCDC / WebhookEvent-drivenWebhooks / DB WALDB / CMS sources

start with batch · add streaming only when sources demand it · CDC for sub-minute freshness

StrategyIndex lagInfra costError modelWhen to choose
BatchMinutes to hoursLow — cron job or DAGRetry whole job or failed items90% of RAG systems: documents updated daily or hourly
StreamingSeconds to minutesMedium — message queue + consumersDead letter queue, per-message retriesLive sources: edit streams, support tickets, news feeds
CDCMilliseconds to secondsLow (webhooks) or high (Debezium+Kafka)At-least-once, idempotent requiredDatabase-backed sources: Postgres, Notion, Confluence
Start with batch, add the others only when forced to

Batch is the only strategy you can test end-to-end locally in 30 seconds. Streaming needs a queue. CDC needs a source system that emits events. Start with batch, get it correct and monitored, then layer in streaming or CDC for specific sources that actually need lower latency. Two strategies in one pipeline is manageable; three is a debugging nightmare.

A 2026 development worth knowing: RisingWave (a streaming database) can connect directly to PostgreSQL's write-ahead log and materialized the embeddings via a built-in `openai_embedding()` function. The result is an embedding index that auto-updates on document changes — no orchestration layer needed. It's still a niche tool, but it demonstrates where CDC-based ingestion is heading.