Ingestion Pipelines
Building production ingestion pipelines for RAG: batch vs streaming, incremental updates, change detection, and pipeline orchestration with Airflow and Prefect.
Quick Reference
- →Batch ingestion: process all documents on a schedule (hourly, daily). Simple, predictable, good for most use cases.
- →Streaming ingestion: process documents as they arrive. Lower latency but more complex infrastructure.
- →Incremental updates: detect changed documents and re-index only those, not the entire corpus.
- →Content hashing: compute a hash of each document to detect changes without re-processing.
- →Pipeline orchestration: use Airflow or Prefect for scheduling, retries, monitoring, and alerting.
Batch vs Streaming Ingestion
Batch ingestion processes documents on a schedule — every hour, every day, or on-demand. It's simpler to build, test, and debug. Streaming ingestion processes documents as soon as they're created or updated, typically using a message queue. It provides lower latency (documents are searchable within minutes) but adds infrastructure complexity. Most production RAG systems start with batch and add streaming only for time-sensitive sources.
| Aspect | Batch Ingestion | Streaming Ingestion |
|---|---|---|
| Latency | Minutes to hours (depends on schedule) | Seconds to minutes |
| Complexity | Low — simple script or DAG | High — message queues, consumers, error handling |
| Throughput | High — processes in bulk | Variable — depends on arrival rate |
| Error handling | Retry entire batch or failed items | Dead letter queues, individual retries |
| Monitoring | Check job completion, count processed | Monitor queue depth, consumer lag |
| Best for | Daily document updates, bulk imports | Live documents (wiki edits, support tickets) |
Build your ingestion pipeline as a batch job first. Get it correct, monitored, and stable. Then add streaming for specific sources that need low latency (e.g., support tickets, live documentation). Most RAG systems don't need sub-minute freshness — hourly batch ingestion is sufficient.