Designing Data Pipeline Architecture for Real-Time Analytics

Batch processing gives you accurate results hours later. Stream processing gives you approximate results in seconds. The right architecture depends on which trade-off your business can live with — and whether "approximate" is close enough.

Lambda Architecture

Lambda runs parallel batch and streaming pipelines. The batch layer processes all historical data for accuracy. The speed layer processes recent data for low latency. A serving layer merges results from both.

Advantage: Accurate results (batch) with low latency (speed). The batch layer corrects any approximations from the speed layer.

Disadvantage: You maintain two codebases doing essentially the same computation in different paradigms. This is the reason Lambda has fallen out of favor — the operational burden of two parallel pipelines is substantial.

Kappa Architecture

Kappa simplifies Lambda by eliminating the batch layer. Everything runs through the streaming pipeline. If you need to reprocess historical data, replay the event log.

Advantage: One codebase, one pipeline, simpler operations. Reprocessing means deploying a new version of the processor and replaying from Kafka.

Disadvantage: Replay can be slow for large historical datasets. Kafka retention limits how far back you can replay (unless using tiered storage).

When to choose Kappa over Lambda: When your streaming logic can achieve batch-level accuracy (using event-time processing and watermarks), Kappa is the simpler choice. Most modern stream processors (Flink, Kafka Streams) handle this well.

Stream Processing with Kafka Streams

Kafka Streams is a library (not a framework) that runs inside your application. No separate cluster to manage.

java

Kafka Streams strengths: Embedded in your app (no external cluster), exactly-once processing, stateful operations with local RocksDB state stores, interactive queries on local state.

Kafka Streams limitations: Can only read from and write to Kafka. If you need to enrich from a database or call an external API, you're on your own (typically via a lookup table loaded from a compacted Kafka topic).

Stream Processing with Apache Flink

Flink is a distributed stream processing framework. It runs on its own cluster (or Kubernetes), giving you more processing power but more operational overhead.

python

Flink strengths: True distributed processing, event-time semantics with watermarks, exactly-once with two-phase commit sinks, complex event processing (CEP), can read from and write to anything (not just Kafka).

Flink limitations: Operational complexity (JobManager, TaskManagers, state backends, checkpointing), cluster management, and debugging distributed state.

Handling Late-Arriving Data

In real-time systems, events don't always arrive in order. A mobile app might buffer events offline and send them hours later. How do you handle a click event that arrives 30 minutes after the window closed?

Watermarks define the boundary. A watermark says "I believe all events with timestamp <= W have arrived." Events arriving after the watermark are "late."

Strategies for late data:

Drop late events. Simplest. Works when losing a few events is acceptable (analytics, non-critical metrics).
Allowed lateness. Keep the window open for an additional period. Kafka Streams and Flink support this:

java

Side output for late events. Process late events separately — write them to a "late events" topic for batch reconciliation:

java

Backpressure Management

When a downstream sink (database, API) can't keep up with the processing rate, events pile up. Without backpressure handling, the pipeline either drops events or runs out of memory.

Kafka Streams: Backpressure is implicit. If the consumer can't keep up, it stops polling. Kafka retains the events. The lag grows, but no data is lost.

Flink: Uses a credit-based flow control mechanism. When a downstream operator is slow, upstream operators stop sending data. This propagates backpressure all the way to the source.

Monitoring backpressure:

Kafka consumer lag (events in the topic minus consumer offset)
Flink's backpressure metrics (task backpressure ratio)
Processing latency (time between event timestamp and processing time)

Sink Selection

Where processed data lands determines query performance:

Sink	Best For	Query Latency	Write Throughput
ClickHouse	Analytics queries, aggregations	50-500ms	500K rows/s
Apache Druid	Time-series, slicing & dicing	100-500ms	200K rows/s
PostgreSQL	Mixed workloads, < 100M rows	1-50ms	50K rows/s
Elasticsearch	Full-text search + analytics	10-100ms	100K docs/s
Redis	Real-time dashboards, counters	< 1ms	500K ops/s

For real-time dashboards, a common pattern: stream processor writes aggregated results to Redis (for sub-millisecond dashboard queries) and detailed results to ClickHouse (for ad-hoc analytics).

End-to-End Architecture

Real-time data pipeline architecture is about matching the processing model to your latency and accuracy requirements. Kappa (streaming-only) handles most modern use cases. Lambda (batch + streaming) is justified when batch-level accuracy is legally or financially required. Start with Kafka Streams for simple transformations, move to Flink when you need distributed processing, event-time semantics, or complex windowing. And always plan for late data — it's not an edge case, it's a certainty.

Designing Data Pipeline Architecture for Real-Time Analytics

Designing Data Pipeline Architecture for Real-Time Analytics

Lambda Architecture

Kappa Architecture

Stream Processing with Kafka Streams

We build this end-to-end in the cohort.

Stream Processing with Apache Flink

Handling Late-Arriving Data

Backpressure Management

Sink Selection

End-to-End Architecture

More in Architecture

The Strangler Fig Pattern: Migrating Legacy Systems Incrementally

Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Strategies

become an engineering leader