Ingestion Pipelines
How to choose between batch, streaming, and CDC ingestion; what ingestion actually costs; and the specific ways production pipelines fail — with a complete reference implementation.
Quick Reference
- →Batch ingestion runs on a schedule; streaming processes documents as they arrive; CDC listens for change events.
- →Incremental updates hash document content and re-index only what changed — avoiding the cost of full-corpus rebuilds.
- →Embedding cost for 1M docs at OpenAI text-embedding-3-small: $50 (standard) or $25 (Batch API, 24h completion).
- →Cohere embed-v4 costs 5× more than OpenAI small at scale; self-hosted is 20–100× cheaper for large corpora.
- →Always delete old chunks before re-indexing a changed document — skipping this step creates duplicate, contradictory results.
- →A dead letter queue prevents one failing document from stalling the entire pipeline.
- →Airflow 3.x (2026) is a ground-up rewrite with a new Task Execution API — Airflow 2.x DAG patterns no longer apply.
Batch, Streaming, or CDC: Choosing Your Strategy
Most articles about ingestion pipelines describe all three strategies (batch, streaming, CDC) as equally valid options, leaving you to figure out which one fits. In practice, the decision follows directly from one question: how stale can your index be before it causes a user-visible problem? For the majority of RAG systems, an hour-old index is fine. For a support chatbot that needs to reflect tickets closed five minutes ago, it's not.
start with batch · add streaming only when sources demand it · CDC for sub-minute freshness
| Strategy | Index lag | Infra cost | Error model | When to choose |
|---|---|---|---|---|
| Batch | Minutes to hours | Low — cron job or DAG | Retry whole job or failed items | 90% of RAG systems: documents updated daily or hourly |
| Streaming | Seconds to minutes | Medium — message queue + consumers | Dead letter queue, per-message retries | Live sources: edit streams, support tickets, news feeds |
| CDC | Milliseconds to seconds | Low (webhooks) or high (Debezium+Kafka) | At-least-once, idempotent required | Database-backed sources: Postgres, Notion, Confluence |
Batch is the only strategy you can test end-to-end locally in 30 seconds. Streaming needs a queue. CDC needs a source system that emits events. Start with batch, get it correct and monitored, then layer in streaming or CDC for specific sources that actually need lower latency. Two strategies in one pipeline is manageable; three is a debugging nightmare.
A 2026 development worth knowing: RisingWave (a streaming database) can connect directly to PostgreSQL's write-ahead log and materialized the embeddings via a built-in `openai_embedding()` function. The result is an embedding index that auto-updates on document changes — no orchestration layer needed. It's still a niche tool, but it demonstrates where CDC-based ingestion is heading.