System Design: Stream Processing System (Kafka + Flink)

Requirements

Functional Requirements:

Ingest event streams from producers at up to 5 million events/second
Apply stateful transformations: aggregations, joins between streams, enrichment with side inputs
Support tumbling, sliding, and session windows with configurable sizes
Detect patterns across event sequences (e.g., fraud signals within a 5-minute window)
Emit processed results to multiple sinks: downstream Kafka topics, Cassandra, Elasticsearch, and data warehouse
Support late-arriving events with configurable watermark tolerance

Non-Functional Requirements:

End-to-end processing latency under 500ms at the 99th percentile
Exactly-once processing semantics for financial and transactional events
Horizontal scalability to 100 Flink task managers without architecture changes
Recovery time objective (RTO) under 30 seconds after a task manager failure
Kafka retention sufficient to replay 7 days of events for backfill and debugging

Scale Estimation

At 5M events/second with an average event size of 500 bytes, ingestion throughput is 2.5 GB/s. Kafka requires ~2.5 GB/s * 3 replicas = 7.5 GB/s write throughput across brokers. With 7-day retention: 2.5 GB/s * 86400 * 7 = ~1.5 PB raw storage. Flink state for 5-minute session windows with 10 million active sessions at 1 KB state per session = 10 GB in-memory state per topology, requiring RocksDB-backed state with SSD storage.*

High-Level Architecture

Producers publish events to Kafka topics partitioned by a natural sharding key (e.g., user_id or device_id). Partitioning by the aggregation key ensures that all events for a given key land on the same Flink task, enabling local state management and eliminating distributed lookups. Kafka brokers run with a replication factor of 3 and min.insync.replicas=2 to guarantee durability without sacrificing too much write latency.

Flink jobs consume from Kafka using the FlinkKafkaConsumer, which maps each Kafka partition to a Flink source sub-task. The Flink topology consists of: Source (Kafka consumer) → Watermark assigner → Keyed transformations (grouped by sharding key) → Window operator → Aggregation function → Sink. The Checkpoint mechanism persists Flink state to S3 every 30 seconds, enabling recovery by replaying Kafka events from the last checkpoint offset.

A job management layer handles Flink application lifecycle: deploying new job versions, scaling parallelism, and triggering savepoints before upgrades. Flink's Application Mode on Kubernetes deploys each job as an isolated cluster, preventing resource contention between different streaming pipelines. KEDA autoscaling adjusts Flink task manager replica counts based on Kafka consumer group lag.

Core Components

Apache Kafka Cluster

A Kafka cluster of 30 brokers (10 per AZ across 3 AZs) handles 2.5 GB/s throughput. Topics are created with a partition count equal to 3x the maximum desired Flink parallelism (e.g., 300 partitions for 100-task parallelism) to allow scaling without partition reassignment. Kafka's acks=all and idempotent producer configuration ensure no data loss or duplicates at the producer layer. Schema Registry enforces Avro schemas, rejecting malformed events before they enter the topic.

Apache Flink State Backend

Flink operators maintain keyed state (ValueState, MapState, ListState) scoped to each key. For state larger than available JVM heap, RocksDB state backend spills to local NVMe SSDs, providing state sizes of hundreds of GB per task manager with consistent read/write latency under 1ms. Incremental checkpoints to S3 transfer only changed RocksDB SST files, keeping checkpoint overhead under 5% of processing throughput.

Watermark & Late Data Handling

Flink's event-time processing uses watermarks to track the progress of event time. A bounded-out-of-orderness watermark strategy allows events up to 30 seconds late before the window closes. Events arriving after the allowed lateness are routed to a side output for separate handling (typically written to a correction topic for downstream reconciliation). The watermark generator is keyed per source partition to handle skewed partitions where one partition may lag others.

Database Design

Processed aggregations are written to Apache Cassandra for low-latency serving. The schema is keyed by (entity_id, window_start) with TTL matching the business retention requirement. For session state, a Redis cluster stores active session objects with an expiry matching the session timeout; Flink broadcasts session close events to trigger Redis cleanup. The Flink metadata store (ZooKeeper or Kubernetes ConfigMaps) tracks job IDs, checkpoint locations, and parallelism configurations.

API Design

POST /jobs — Submit a new Flink job JAR with topology configuration and parallelism settings. POST /jobs/{job_id}/savepoints — Trigger a savepoint for zero-downtime job upgrade or migration. GET /jobs/{job_id}/metrics — Return throughput (events/s), processing lag (ms), checkpoint duration, and backpressure ratio per operator. GET /consumer-groups/{group_id}/lag — Return per-partition Kafka consumer lag for a Flink job's consumer group.

Scaling & Bottlenecks

Backpressure is the primary scaling signal: when a downstream operator cannot keep up, it propagates pressure upstream, eventually throttling Kafka consumption. Flink's backpressure visualization identifies the bottleneck operator. Solutions include increasing operator parallelism, optimizing the bottleneck function (switching to async I/O for external lookups), or splitting the topology into separate jobs connected by intermediate Kafka topics.

State size growth can cause checkpoint timeouts and task manager OOM errors. Mitigation strategies: TTL-based state expiry (Flink's StateTtlConfig), state backend switching from heap to RocksDB at 1 GB threshold, and periodic state compaction. For very large state (>1 TB), co-locating Flink with an external state store (Aerospike or Redis Cluster) and using Flink only for flow control reduces state management overhead at the cost of additional network hops.

Key Trade-offs

Exactly-once vs. at-least-once: Exactly-once requires two-phase commits between Flink checkpoints and Kafka sink transactions, adding 100–500ms latency per checkpoint interval; at-least-once with idempotent sinks achieves near-identical correctness at lower latency.
Event time vs. processing time: Event-time processing with watermarks gives accurate window semantics for out-of-order events but adds latency equal to the watermark lag; processing-time windows have lower latency but give incorrect results when events arrive out of order.
Micro-batch (Spark Structured Streaming) vs. true streaming (Flink): Spark micro-batching achieves higher throughput for batch-oriented transforms; Flink's continuous processing achieves lower latency for record-at-a-time operations.
Stateful vs. stateless operators: Stateless operators scale linearly and have no checkpoint overhead; stateful operators enable powerful aggregations but require careful state size management and add recovery time proportional to state size.