Blog / Architecture
Architecture

Designing Data Pipeline Architecture for Real-Time Analytics

Real-time data pipeline design covering Lambda vs Kappa architecture, stream processing with Kafka Streams and Flink, and handling late-arriving data.

Akhil Sharma

Akhil Sharma

March 18, 2026

11 min read

Designing Data Pipeline Architecture for Real-Time Analytics

Batch processing gives you accurate results hours later. Stream processing gives you approximate results in seconds. The right architecture depends on which trade-off your business can live with — and whether "approximate" is close enough.

Lambda Architecture

Lambda runs parallel batch and streaming pipelines. The batch layer processes all historical data for accuracy. The speed layer processes recent data for low latency. A serving layer merges results from both.

Advantage: Accurate results (batch) with low latency (speed). The batch layer corrects any approximations from the speed layer.

Disadvantage: You maintain two codebases doing essentially the same computation in different paradigms. This is the reason Lambda has fallen out of favor — the operational burden of two parallel pipelines is substantial.

Kappa Architecture

Kappa simplifies Lambda by eliminating the batch layer. Everything runs through the streaming pipeline. If you need to reprocess historical data, replay the event log.

Advantage: One codebase, one pipeline, simpler operations. Reprocessing means deploying a new version of the processor and replaying from Kafka.

Disadvantage: Replay can be slow for large historical datasets. Kafka retention limits how far back you can replay (unless using tiered storage).

When to choose Kappa over Lambda: When your streaming logic can achieve batch-level accuracy (using event-time processing and watermarks), Kappa is the simpler choice. Most modern stream processors (Flink, Kafka Streams) handle this well.

Stream Processing with Kafka Streams

Kafka Streams is a library (not a framework) that runs inside your application. No separate cluster to manage.

java

Kafka Streams strengths: Embedded in your app (no external cluster), exactly-once processing, stateful operations with local RocksDB state stores, interactive queries on local state.

Advanced System Design Cohort

We build this end-to-end in the cohort.

Live sessions, real systems, your questions answered in real time. Next cohort starts 2nd July 2026 — 20 seats.

Reserve your spot →

Kafka Streams limitations: Can only read from and write to Kafka. If you need to enrich from a database or call an external API, you're on your own (typically via a lookup table loaded from a compacted Kafka topic).

Stream Processing with Apache Flink

Flink is a distributed stream processing framework. It runs on its own cluster (or Kubernetes), giving you more processing power but more operational overhead.

python

Flink strengths: True distributed processing, event-time semantics with watermarks, exactly-once with two-phase commit sinks, complex event processing (CEP), can read from and write to anything (not just Kafka).

Flink limitations: Operational complexity (JobManager, TaskManagers, state backends, checkpointing), cluster management, and debugging distributed state.

Handling Late-Arriving Data

In real-time systems, events don't always arrive in order. A mobile app might buffer events offline and send them hours later. How do you handle a click event that arrives 30 minutes after the window closed?

Watermarks define the boundary. A watermark says "I believe all events with timestamp <= W have arrived." Events arriving after the watermark are "late."

Strategies for late data:

  1. Drop late events. Simplest. Works when losing a few events is acceptable (analytics, non-critical metrics).

  2. Allowed lateness. Keep the window open for an additional period. Kafka Streams and Flink support this:

java
  1. Side output for late events. Process late events separately — write them to a "late events" topic for batch reconciliation:
java

Backpressure Management

When a downstream sink (database, API) can't keep up with the processing rate, events pile up. Without backpressure handling, the pipeline either drops events or runs out of memory.

Kafka Streams: Backpressure is implicit. If the consumer can't keep up, it stops polling. Kafka retains the events. The lag grows, but no data is lost.

Flink: Uses a credit-based flow control mechanism. When a downstream operator is slow, upstream operators stop sending data. This propagates backpressure all the way to the source.

Monitoring backpressure:

  • Kafka consumer lag (events in the topic minus consumer offset)
  • Flink's backpressure metrics (task backpressure ratio)
  • Processing latency (time between event timestamp and processing time)

Sink Selection

Where processed data lands determines query performance:

SinkBest ForQuery LatencyWrite Throughput
ClickHouseAnalytics queries, aggregations50-500ms500K rows/s
Apache DruidTime-series, slicing & dicing100-500ms200K rows/s
PostgreSQLMixed workloads, < 100M rows1-50ms50K rows/s
ElasticsearchFull-text search + analytics10-100ms100K docs/s
RedisReal-time dashboards, counters< 1ms500K ops/s

For real-time dashboards, a common pattern: stream processor writes aggregated results to Redis (for sub-millisecond dashboard queries) and detailed results to ClickHouse (for ad-hoc analytics).

End-to-End Architecture

Real-time data pipeline architecture is about matching the processing model to your latency and accuracy requirements. Kappa (streaming-only) handles most modern use cases. Lambda (batch + streaming) is justified when batch-level accuracy is legally or financially required. Start with Kafka Streams for simple transformations, move to Flink when you need distributed processing, event-time semantics, or complex windowing. And always plan for late data — it's not an edge case, it's a certainty.

Data Pipelines Streaming Analytics Kafka

become an engineering leader

Advanced System Design Cohort