INTERVIEW_QUESTIONS

Apache Kafka Interview Questions for Senior Engineers (2026)

15 advanced Apache Kafka interview questions with detailed answer frameworks covering Kafka architecture, partitions, consumer groups, exactly-once semantics, Kafka Streams, Kafka Connect, topic design, and offset management patterns used at LinkedIn, Uber, Netflix, and other top companies.

20 min readUpdated Apr 25, 2026
interview-questionsapache-kafkadistributed-systemsevent-streamingsenior-engineer

Why Apache Kafka Dominates Senior Engineering Interviews

Apache Kafka has become the backbone of event-driven architectures at nearly every large-scale technology company. Originally built at LinkedIn to handle trillions of messages per day, Kafka is now the de facto standard for real-time data pipelines, event sourcing, stream processing, and inter-service communication. In senior engineering interviews, Kafka questions separate candidates who have operated distributed messaging systems in production from those who have only read the documentation.

At companies like LinkedIn, Uber, and Netflix, Kafka is not just a message queue. It is the central nervous system connecting hundreds of microservices, feeding real-time analytics, powering machine learning pipelines, and ensuring data consistency across geographically distributed systems. When an interviewer asks about Kafka, they are assessing whether you understand the distributed systems principles it embodies: partitioned logs, consumer group coordination, exactly-once delivery guarantees, and the fundamental trade-offs between throughput, latency, and durability.

The questions in this guide go beyond surface-level definitions. Each one is framed around the real intent behind the interviewer's question, followed by a structured answer framework that demonstrates both architectural depth and operational experience. Whether you are preparing for a system design interview that involves event streaming or a deep-dive technical round on distributed messaging, these questions will prepare you to demonstrate senior-level expertise.

For foundational context, review our guide on how message queues work and our comparison of Kafka vs RabbitMQ.


Question 1: Explain Kafka's architecture from the ground up. What makes it different from traditional message queues?

What the interviewer is really asking: Do you understand that Kafka is fundamentally a distributed commit log, not a message queue? Can you explain how this architectural difference leads to Kafka's unique properties around replay, retention, and consumer independence?

Answer framework:

Kafka's architecture is built around four core concepts: brokers, topics, partitions, and consumer groups.

A Kafka cluster consists of multiple broker nodes. Each broker is a server that stores data and serves client requests. Brokers are stateless with respect to consumers; they do not track what each consumer has read.

A topic is a named feed of records, analogous to a database table. Topics are the primary abstraction for organizing data.

Each topic is divided into partitions, which are the unit of parallelism and ordering. A partition is an ordered, immutable, append-only log of records. Each record within a partition gets a sequential offset that uniquely identifies it.

What makes Kafka different from traditional queues:

  1. Retention-based, not consumption-based. Traditional queues delete messages after they are consumed. Kafka retains messages for a configurable period (or indefinitely with compacted topics) regardless of consumption. This enables replay, new consumers reading historical data, and debugging production issues by re-reading the event stream.

  2. Consumer pull model. Consumers pull data at their own pace. A slow consumer does not block other consumers or the brokers. Each consumer tracks its own offset.

  3. Multiple independent consumers. Multiple consumer groups can read the same topic independently, each maintaining their own offsets. This is fundamentally different from a traditional queue where a message is delivered to exactly one consumer.

  4. Partitioned parallelism. Throughput scales horizontally by adding partitions. Each partition can be consumed by a different consumer in the group, enabling parallel processing.

  5. Replication for durability. Each partition is replicated across multiple brokers. One replica is the leader (handles reads and writes), and the others are followers (replicate from the leader). If the leader fails, a follower is elected as the new leader.

The ISR (In-Sync Replica) set is critical: only replicas that are fully caught up with the leader are in the ISR. A message is considered committed only after all ISR members have acknowledged it (with acks=all). This is the foundation of Kafka's durability guarantee.


Question 2: How do you choose the right number of partitions for a Kafka topic?

What the interviewer is really asking: Can you reason about the trade-offs between parallelism, ordering, resource usage, and operational complexity? Partition count is one of the most impactful decisions in Kafka and one of the hardest to change later.

Answer framework:

Partition count determines the maximum parallelism of consumption: you can have at most as many active consumers in a group as there are partitions. But more partitions is not always better.

Factors that push for more partitions:

  • Target throughput. If a single consumer can process 10 MB/s and you need 100 MB/s total throughput, you need at least 10 partitions.
  • Consumer parallelism. If you have 20 consumer instances, you need at least 20 partitions for full utilization.
  • Future growth. Increasing partition count is possible but disruptive (it breaks key-based ordering guarantees). Plan for 2-3x current needs.

Factors that push for fewer partitions:

  • End-to-end latency. Each partition adds overhead to the producer batching cycle and broker replication. More partitions generally mean slightly higher end-to-end latency.
  • Broker memory and file handles. Each partition-replica maps to a directory on disk with multiple segment files. Thousands of partitions per broker consume significant memory for index caching and file handles.
  • Leader election time. When a broker fails, all partitions where it was leader need new leader elections. More partitions means longer recovery time (though KRaft mode in Kafka 3.x significantly improves this).
  • Consumer rebalance time. When a consumer joins or leaves, partitions are reassigned. More partitions means longer rebalance pauses.

Practical guidelines:

  • Start with max(desired_throughput / per_partition_throughput, expected_consumer_count).
  • For most topics, 6-12 partitions is a reasonable default.
  • High-throughput topics (clickstreams, logs) might need 50-100+ partitions.
  • Low-volume topics (configuration changes, admin events) can use 1-3 partitions.
  • Always benchmark with production-like data before committing to a partition count.

Key insight to convey: Partition count is effectively a one-way door decision because reducing partition count requires creating a new topic and migrating data. Increasing partitions is possible but breaks key ordering for existing keys. This is why the decision deserves careful upfront analysis.


Question 3: Explain consumer groups and partition assignment. What happens during a rebalance?

What the interviewer is really asking: Do you understand Kafka's coordination mechanism for distributing work across consumers? Rebalances are one of the most operationally impactful aspects of Kafka, and understanding them is essential for building reliable consumers.

Answer framework:

A consumer group is a set of consumer instances that cooperate to consume a topic. Kafka guarantees that each partition is assigned to exactly one consumer within a group at any given time. This ensures that records within a partition are processed in order by a single consumer.

If Consumer B fails, a rebalance is triggered. During a rebalance:

  1. All consumers in the group pause consumption.
  2. The group coordinator (a designated broker) revokes all partition assignments.
  3. A new assignment is computed based on the partition assignment strategy.
  4. Partitions are redistributed among the remaining consumers.
  5. Consumers commit their current offsets and resume from the newly assigned partitions.

The problem with eager rebalancing: In the default eager protocol, all consumers stop processing during the rebalance. This creates a "stop-the-world" pause that can last seconds to minutes, depending on the group size and the time consumers take to join.

Cooperative rebalancing (incremental): Introduced in Kafka 2.4, cooperative rebalancing only revokes partitions that need to move. Consumers that keep their partitions continue processing without interruption. This dramatically reduces rebalance impact.

java

Assignment strategies:

  • RangeAssignor: Assigns contiguous partition ranges to consumers. Can create uneven distribution across multiple topics.
  • RoundRobinAssignor: Distributes partitions evenly but does not consider locality.
  • StickyAssignor: Minimizes partition movement during rebalances, preserving consumer affinity.
  • CooperativeStickyAssignor: Combines sticky assignment with cooperative rebalancing for minimal disruption.

Static group membership: By assigning a group.instance.id to each consumer, you can prevent rebalances during short-lived failures (restarts, deployments). The broker waits for session.timeout.ms before triggering a rebalance, giving the consumer time to rejoin with the same assignment.

For more on distributed coordination patterns, see our guide on consensus algorithms and how they apply to distributed systems interviews.


Question 4: What are the different delivery guarantees in Kafka, and how do you achieve exactly-once semantics?

What the interviewer is really asking: This is the core distributed systems question in Kafka interviews. Can you explain the spectrum from at-most-once to exactly-once, and do you understand the mechanisms (idempotent producers, transactions) that make exactly-once possible? Do you know its limitations?

Answer framework:

At-most-once delivery: The producer sends a message and does not retry on failure. If the broker does not acknowledge, the message is lost. On the consumer side, offsets are committed before processing. If the consumer crashes after committing but before processing, the message is skipped.

java

At-least-once delivery: The producer retries on failure, which can cause duplicates if the broker received the first attempt but the acknowledgment was lost. On the consumer side, offsets are committed after processing. If the consumer crashes after processing but before committing, the message is reprocessed.

java

Exactly-once semantics (EOS): Kafka provides exactly-once through two mechanisms:

  1. Idempotent producer. Each producer instance is assigned a Producer ID (PID), and each message gets a sequence number. The broker deduplicates messages with the same PID and sequence number, eliminating duplicates from producer retries.
java
  1. Transactional producer/consumer. For exactly-once processing that spans reading from one topic and writing to another (the consume-transform-produce pattern), Kafka supports transactions:
java

The transaction ensures that the output messages and offset commits are atomic: either all are committed or none are. Consumers configured with isolation.level=read_committed will only see committed transactional messages.

Limitations of EOS:

  • Exactly-once only applies within the Kafka ecosystem. If your consumer writes to an external database, you need an idempotency mechanism at the database level (e.g., upserts with a deduplication key).
  • Transactions add latency (typically 50-100ms per transaction commit).
  • Transaction throughput is limited by the transaction coordinator, which can become a bottleneck.

The key insight: exactly-once in Kafka is really "effectively once" achieved through idempotency and atomic commits, not through preventing duplicates at the network level (which is impossible in distributed systems).


Question 5: How do you design Kafka topics for a microservices architecture? What are the key decisions?

What the interviewer is really asking: Can you make practical topic design decisions that balance the needs of producers and consumers, support schema evolution, and avoid common anti-patterns? This tests your ability to think about Kafka as an organizational tool, not just a transport layer.

Answer framework:

Topic design decisions fall into four categories:

1. Topic granularity: one topic per event type vs. one topic per domain

Fine-grained (one topic per event type):

  • user.created, user.updated, user.deleted
  • Pros: Consumers subscribe only to events they care about, simpler per-topic schemas.
  • Cons: Topic proliferation (hundreds of topics), harder to maintain ordering across related events.

Coarse-grained (one topic per domain):

  • user-events containing all user-related events
  • Pros: Ordering is preserved across related events (user created before user updated), fewer topics to manage.
  • Cons: Consumers must filter events they don't care about, schemas become union types.

Recommendation: Use domain-level topics with a type field in the message header or payload. This preserves ordering for related events while allowing consumers to filter efficiently.

2. Key selection

The message key determines which partition a message lands in (via hashing). Choose keys that:

  • Group related messages together for ordered processing (e.g., user_id for user events ensures all events for a user go to the same partition and are processed in order).
  • Distribute evenly across partitions to avoid hotspots.
java

Beware of hot keys: if 10% of traffic comes from one user (a bot or power user), that user's partition becomes a bottleneck.

3. Schema management

Use a schema registry (Confluent Schema Registry, Apicurio) with Avro, Protobuf, or JSON Schema. Enforce schema compatibility modes:

  • BACKWARD: New schemas can read old data (safe for consumer upgrades).
  • FORWARD: Old schemas can read new data (safe for producer upgrades).
  • FULL: Both backward and forward compatible (safest, most restrictive).

4. Retention and compaction

  • Time-based retention (retention.ms): For event logs and transactional data. Delete messages older than the retention period.
  • Log compaction (cleanup.policy=compact): For state snapshots. Kafka keeps the latest value for each key, deleting older values. Useful for maintaining a materialized view of current state (e.g., the latest config for each service).

For a comprehensive treatment of event-driven architecture, see our concepts guide on event-driven architecture and microservices communication patterns.


Question 6: Explain Kafka's offset management. How do you handle offset commits in failure scenarios?

What the interviewer is really asking: Offset management is where theory meets production reality. Can you reason about the failure modes of automatic vs. manual offset commits and design a strategy that matches your application's durability requirements?

Answer framework:

An offset is a sequential identifier for a record within a partition. Kafka stores committed offsets in a special internal topic called __consumer_offsets. When a consumer restarts, it resumes from its last committed offset.

Auto-commit (default):

java

Kafka automatically commits offsets every 5 seconds (by default). Problem: if the consumer crashes between an auto-commit and finishing processing, messages processed since the last commit will be reprocessed on restart. If it crashes after an auto-commit but before processing, messages will be lost.

Manual synchronous commit:

java

Pros: Precise control over when offsets are committed. Cons: commitSync() blocks, adding latency.

Manual asynchronous commit:

java

Pros: Non-blocking. Cons: If an async commit fails, a later commit might succeed with a higher offset, effectively skipping the failed range. Use async for performance with a synchronous commit in the shutdown hook as a safety net.

Per-partition offset commit:

For fine-grained control, commit offsets per partition as each partition's batch is processed:

java

External offset storage: For exactly-once processing to an external system, store offsets alongside the output data in a single atomic transaction:

java

This pattern eliminates the gap between processing and committing because both happen in the same database transaction.


Question 7: How does Kafka handle broker failures and leader election? What is the role of the ISR?

What the interviewer is really asking: Can you explain Kafka's availability and durability model? The ISR mechanism is central to understanding Kafka's consistency guarantees, and misunderstanding it leads to data loss or availability problems in production.

Answer framework:

Every partition has one leader replica and zero or more follower replicas. All reads and writes go to the leader. Followers replicate data from the leader by fetching messages.

The In-Sync Replica (ISR) set contains replicas that are "in sync" with the leader. A replica is considered in-sync if it has fetched messages from the leader within replica.lag.time.max.ms (default 30 seconds). If a follower falls behind, it is removed from the ISR.

Durability guarantees based on acks:

  • acks=0: Producer does not wait for any acknowledgment. Fastest but messages can be lost.
  • acks=1: Producer waits for the leader to write to its local log. If the leader crashes before followers replicate, the message is lost.
  • acks=all (or -1): Producer waits for all ISR replicas to acknowledge. Combined with min.insync.replicas=2 (on a 3-replica topic), this guarantees that at least 2 replicas have the message before the producer receives an acknowledgment.

Leader election process:

  1. The controller (a designated broker, or the KRaft quorum in Kafka 3.3+) detects that a broker has failed.
  2. For each partition where the failed broker was leader, the controller selects a new leader from the ISR.
  3. The new leader begins serving reads and writes. Followers start replicating from the new leader.

Unclean leader election: If all ISR replicas fail and unclean.leader.election.enable=true, Kafka can elect an out-of-sync replica as leader. This preserves availability but risks data loss (the new leader is missing messages). The default is false in production configurations, prioritizing data integrity over availability.

KRaft mode (replacing ZooKeeper): Starting with Kafka 3.3 (GA in 3.5), Kafka uses its own Raft-based consensus protocol for metadata management, eliminating the ZooKeeper dependency. The KRaft controller quorum handles broker registration, topic metadata, and leader election with improved performance and simpler operations.

For more on consensus and leader election in distributed systems, see our concepts guide on Raft consensus and interview questions on distributed systems.


Question 8: What is Kafka Streams and how does it differ from other stream processing frameworks?

What the interviewer is really asking: Do you understand the architectural difference between Kafka Streams (a library) and frameworks like Apache Flink or Spark Streaming (cluster-based systems)? Can you articulate when Kafka Streams is the right choice and when you need a more powerful framework?

Answer framework:

Kafka Streams is a client library for building stream processing applications. Unlike Flink or Spark, it does not require a separate cluster or resource manager. Your application is a regular Java/Kotlin process that reads from Kafka, processes data, and writes back to Kafka.

Key architectural properties:

  1. Library, not a framework. You deploy it as part of your application using your existing deployment infrastructure (Kubernetes, ECS, etc.). No separate processing cluster to manage.

  2. Parallel processing via partitions. Each application instance processes a subset of partitions. Scaling is done by running more instances of your application, up to the number of partitions.

  3. State stores. Kafka Streams provides local state stores (backed by RocksDB) for stateful operations like aggregations, joins, and windowing. State is backed up to Kafka changelog topics for fault tolerance.

java

When to choose Kafka Streams:

  • Your input and output are both Kafka topics.
  • You want simple deployment without managing a separate processing cluster.
  • Your processing logic is moderate complexity (filtering, mapping, aggregations, simple joins).
  • You need exactly-once processing within the Kafka ecosystem.

When to choose Flink or Spark:

  • You need advanced windowing, complex event processing, or sophisticated watermark handling.
  • You need to read from or write to non-Kafka sources (databases, files, APIs) as a first-class operation.
  • You need processing capabilities that go beyond what Kafka Streams provides (e.g., iterative algorithms, batch-stream unification).
  • You need independent scaling of processing resources from your application.

Kafka Streams DSL vs. Processor API: The DSL (shown above) provides a high-level functional API for common operations. The Processor API provides low-level access for custom processing logic, custom state stores, and fine-grained control over scheduling and punctuation.

For comparisons with other processing frameworks, see our comparison of stream processing frameworks and concepts guide on stream processing.


Question 9: Explain Kafka Connect and how you would use it to build a CDC pipeline.

What the interviewer is really asking: Can you describe Kafka Connect's architecture and articulate how it fits into a data pipeline that captures changes from databases and propagates them to downstream systems? This tests your understanding of practical data integration patterns.

Answer framework:

Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom producer/consumer code. It provides:

  • Source connectors: Read data from external systems into Kafka (databases, files, APIs).
  • Sink connectors: Write data from Kafka to external systems (databases, search indexes, data warehouses).
  • Workers: JVM processes that execute connector tasks. Can run in standalone mode (single process) or distributed mode (cluster of workers with automatic task distribution and fault tolerance).

CDC pipeline architecture:

Change Data Capture reads the database's transaction log (WAL in PostgreSQL, binlog in MySQL) and streams every row-level change as a Kafka event.

Debezium configuration example:

json

This creates Kafka topics like cdc.orders and cdc.customers, each receiving insert, update, and delete events with before/after snapshots of the row.

Key operational considerations:

  1. Schema evolution. Use the schema registry with Debezium's Avro converter to handle DDL changes gracefully.
  2. Snapshotting. On first start, Debezium takes a consistent snapshot of the existing data before switching to streaming mode. For large tables, this can take hours.
  3. Single Message Transforms (SMTs). Kafka Connect supports lightweight transformations (filtering, routing, field renaming) without requiring a separate stream processing step.
  4. Dead letter queues. Configure errors.tolerance=all with a dead letter queue topic to capture records that fail transformation or serialization, preventing one bad record from blocking the pipeline.
  5. Exactly-once in Connect. Kafka Connect supports exactly-once source connector delivery (Kafka 3.3+) by committing source offsets transactionally with the produced messages.

For more on CDC patterns, see our concepts guide on change data capture and how event sourcing works.


Question 10: How would you handle message ordering guarantees across multiple partitions?

What the interviewer is really asking: Kafka guarantees ordering within a partition but not across partitions. Can you design around this constraint for use cases that require ordering across logically related events that might span partitions?

Answer framework:

The fundamental constraint: Kafka guarantees that messages within a single partition are consumed in the order they were produced. There is no ordering guarantee across partitions. This means that if related messages land in different partitions, a consumer may see them out of order.

Strategy 1: Use the right partition key.

The simplest and most effective approach. If all events for a given entity use the entity's ID as the Kafka key, they will always go to the same partition and be processed in order.

java

Strategy 2: Sequence numbers in the payload.

When ordering across partitions is unavoidable, embed a sequence number in each message. Consumers buffer out-of-order messages and process them in sequence order.

java

Strategy 3: Single-partition topics for strict global ordering.

If you absolutely need global ordering (e.g., a financial ledger), use a single-partition topic. This sacrifices parallelism entirely and limits throughput to what a single consumer can handle.

Strategy 4: Kafka Streams with timestamp-based processing.

Kafka Streams processes events based on event timestamps, not arrival order. For many use cases, "events processed in timestamp order" is sufficient even if they arrive out of order:

java

Strategy 5: Outbox pattern for transactional ordering.

When a microservice needs to publish events in the same order as database transactions, use the transactional outbox pattern: write events to an outbox table within the same database transaction, then publish them to Kafka in order from that table using a single-threaded poller or Debezium.

The key insight: most ordering problems are solved by choosing the right partition key. If you find yourself needing cross-partition ordering, reconsider whether your topic design or key selection is wrong before reaching for complex reordering solutions.


Question 11: What is log compaction and when would you use it?

What the interviewer is really asking: Can you explain how compacted topics provide a different retention model and when that model is the right choice? This tests whether you understand Kafka as a storage system, not just a transport.

Answer framework:

Log compaction is a retention policy where Kafka keeps the latest value for each key in a topic, removing older values for the same key. Unlike time-based retention (which deletes all messages older than a threshold), compaction preserves at least one record per key indefinitely.

Note that the latest record for each key is retained, and offsets are preserved (not renumbered).

Deletion with tombstones: To delete a key from a compacted topic, produce a record with that key and a null value (a tombstone). The compactor will eventually remove the tombstone and all prior records for that key.

Use cases for log compaction:

  1. Changelog topics for Kafka Streams state stores. Each key is a state store key, and the latest value is the current state. On recovery, a Kafka Streams application replays the compacted topic to rebuild its local state store.

  2. Configuration distribution. A topic where each key is a service name and the value is its current configuration. New service instances consume the entire topic to get the latest config for every service.

  3. Materialized views. A compacted topic can serve as a key-value store that consumers use to build lookup tables. For example, a topic keyed by user_id with the latest user profile as the value.

  4. CDC snapshots. Debezium can produce to compacted topics so that new consumers get the current state of every row without needing an initial database snapshot.

Configuration:

bash
  • min.cleanable.dirty.ratio: Compaction triggers when the ratio of "dirty" (uncompacted) to total log size exceeds this threshold.
  • delete.retention.ms: How long tombstone records are retained after compaction (to ensure slow consumers see deletions).

Pitfall: Compaction is not immediate. There is always a "dirty" section of the log (recent writes) that has not yet been compacted. Consumers reading from the tail of a compacted topic will see all records, including duplicates for the same key. Only the compacted portion guarantees one record per key.


Question 12: How would you monitor a Kafka cluster in production? What are the key metrics?

What the interviewer is really asking: Have you actually operated Kafka in production? Can you identify the metrics that predict problems before they cause outages, and do you know the difference between symptoms (high latency) and causes (ISR shrinkage, under-replicated partitions)?

Answer framework:

Kafka monitoring falls into four categories: broker health, producer metrics, consumer metrics, and topic/partition metrics.

Broker health metrics:

  • Under-replicated partitions (kafka.server:UnderReplicatedPartitions): The number of partitions where the ISR is smaller than the replication factor. This is the single most important broker metric. Non-zero values indicate broker issues, network problems, or disk I/O bottlenecks.
  • Active controller count (kafka.controller:ActiveControllerCount): Exactly one broker should be the controller. If zero, there is a leadership vacuum. If more than one, there is a split-brain (critical).
  • Request handler idle ratio (kafka.server:RequestHandlerAvgIdlePercent): Measures how much time request handler threads spend idle. Below 20% indicates the broker is CPU-saturated.
  • Log flush latency (kafka.log:LogFlushRateAndTimeMs): Spikes indicate disk I/O problems that will affect producer latency and replication lag.

Producer metrics:

  • Record error rate (kafka.producer:record-error-rate): Non-zero indicates the producer is failing to deliver messages.
  • Request latency (kafka.producer:request-latency-avg): p99 latency for produce requests. Increases correlate with broker load or replication delays.
  • Batch size (kafka.producer:batch-size-avg): Small batches indicate the producer is sending too frequently. Tune linger.ms and batch.size to improve throughput.

Consumer metrics:

  • Consumer lag (kafka.consumer:records-lag-max): The difference between the latest offset in the partition and the consumer's committed offset. This is the most critical consumer metric. Growing lag means the consumer cannot keep up.
  • Commit latency (kafka.consumer:commit-latency-avg): High commit latency can indicate __consumer_offsets topic issues.
  • Rebalance rate and duration: Frequent rebalances indicate unstable consumers (crashes, long GC pauses, heartbeat timeouts).

Topic/partition metrics:

  • Messages in per second per topic: Track ingestion rate to detect traffic spikes or drops.
  • Bytes in/out per broker: Ensure even distribution. Uneven distribution indicates hot partitions.
  • Partition count per broker: Should be roughly even across brokers.

Alerting thresholds (starting points):

yaml

Use tools like Prometheus with the JMX exporter, Grafana dashboards, and Burrow (LinkedIn's consumer lag monitoring tool) for comprehensive Kafka observability.


Question 13: How do you handle schema evolution in Kafka? What happens when a producer starts sending a new version of a message?

What the interviewer is really asking: Schema evolution is where data engineering meets software engineering. Can you design a schema management strategy that allows producers and consumers to evolve independently without breaking each other?

Answer framework:

Without schema management, Kafka is just moving bytes. Schema evolution requires three components: a serialization format, a schema registry, and compatibility rules.

Serialization formats:

  • Avro: Schema is defined separately from data. Compact binary encoding. Excellent schema evolution support. The most common choice in the Kafka ecosystem.
  • Protobuf: Strong typing, code generation, good evolution support. Growing adoption, especially in gRPC-heavy environments.
  • JSON Schema: Human-readable, easier to debug, but larger message size and less mature evolution tooling.

Schema Registry:

The Confluent Schema Registry (or alternatives like Apicurio) stores schemas versioned by subject (typically <topic>-value or <topic>-key). Producers register schemas before sending messages. Consumers fetch schemas to deserialize messages.

java

Compatibility modes:

ModeRuleUse case
BACKWARDNew schema can read old dataConsumer upgrades first
FORWARDOld schema can read new dataProducer upgrades first
FULLBoth backward and forwardIndependent upgrades
NONENo compatibility checkDevelopment only

Safe schema evolution operations (under FULL compatibility):

  • Adding a field with a default value
  • Removing a field that has a default value

Unsafe operations (break compatibility):

  • Removing a required field
  • Changing a field's type
  • Renaming a field (Avro treats this as remove + add)

Practical workflow:

  1. Developer proposes a schema change.
  2. CI/CD pipeline tests the new schema against the registry's compatibility rules.
  3. If compatible, the new schema version is registered.
  4. Producer deploys with the new schema. Old consumers continue reading (BACKWARD compatible).
  5. Consumers deploy with support for the new schema at their own pace.
bash

For more on schema management in distributed systems, see our concepts guide on schema evolution and interview questions on API versioning.


Question 14: Describe a scenario where Kafka can lose messages. How do you prevent it?

What the interviewer is really asking: This is a trick question in some sense, because Kafka is often described as durable. Can you identify the specific configurations and failure modes where message loss can occur, demonstrating that you understand the durability model rather than just trusting defaults?

Answer framework:

Kafka can lose messages in several scenarios, all of which are preventable through proper configuration:

Scenario 1: Producer with acks=1 and leader failure.

The producer gets an acknowledgment after the leader writes the message but before followers replicate it. If the leader crashes immediately after, the message exists only on the failed broker. When a follower is elected leader, the message is gone.

Prevention: acks=all + min.insync.replicas=2 on a topic with replication factor 3.

Scenario 2: Unclean leader election.

All ISR replicas fail, and unclean.leader.election.enable=true allows an out-of-sync replica to become leader. This replica is missing messages that the previous leader had accepted.

Prevention: unclean.leader.election.enable=false (default since Kafka 0.11). Accept unavailability over data loss.

Scenario 3: Producer fire-and-forget.

The producer sends a message with acks=0 or does not check the future returned by send(). Network errors, serialization errors, or broker rejections are silently lost.

Prevention: Always handle the send() callback or future:

java

Scenario 4: Consumer auto-commit with crash.

Auto-commit advances the offset before the consumer finishes processing. On crash, the unprocessed messages are skipped.

Prevention: Disable auto-commit. Commit offsets only after successful processing.

Scenario 5: Disk failure on all replicas.

If all replicas (including ISR) lose data due to correlated failures (same rack, same storage system), messages are lost regardless of Kafka configuration.

Prevention: Spread replicas across failure domains using broker.rack configuration. Use min.insync.replicas across racks. Consider tiered storage to S3/GCS for long-term durability.

Scenario 6: Log retention deletes unprocessed messages.

If a consumer is offline longer than the topic's retention period, messages are deleted before the consumer can read them.

Prevention: Monitor consumer lag aggressively. Set retention periods that exceed your maximum expected consumer downtime. Use tiered storage for extended retention without broker disk pressure.

The production-hardened configuration:

properties

Question 15: You need to migrate a high-throughput system from RabbitMQ to Kafka. What is your approach?

What the interviewer is really asking: Can you plan a complex migration that involves changing the fundamental messaging paradigm (push vs. pull, queues vs. logs) while maintaining system availability? This tests architectural thinking, risk management, and operational maturity.

Answer framework:

This migration is not just swapping one broker for another. RabbitMQ and Kafka have fundamentally different models: RabbitMQ pushes messages to consumers and deletes them after acknowledgment; Kafka retains messages and consumers pull at their own pace. The migration must account for these semantic differences.

Phase 1: Assessment and design (2-4 weeks)

  • Inventory all RabbitMQ exchanges, queues, bindings, and consumers. Map each to a Kafka topic and consumer group.
  • Identify patterns that do not have direct Kafka equivalents: RabbitMQ's routing keys map to Kafka headers or message keys; fanout exchanges map to multiple consumer groups reading the same topic; priority queues have no Kafka equivalent (consider separate topics per priority).
  • Design the Kafka topic structure, partition counts, and key selection based on the actual message volumes and ordering requirements.

Phase 2: Dual-write bridge (2-3 weeks)

  • Deploy Kafka alongside RabbitMQ. Build a bridge that writes every message to both systems.
  • Option A: Modify producers to write to both RabbitMQ and Kafka.
  • Option B: Use a RabbitMQ-to-Kafka connector (Shovel plugin or a custom bridge consumer that reads from RabbitMQ and produces to Kafka).
  • Validate that message counts, ordering, and content match between systems.

Phase 3: Shadow consumers (2-4 weeks)

  • Deploy Kafka consumers in shadow mode: they read and process messages from Kafka but do not perform side effects (no database writes, no API calls). Compare their outputs with the production RabbitMQ consumers.
  • Fix any discrepancies. Common issues: message ordering differences, serialization format changes, consumer group rebalancing behavior.

Phase 4: Incremental migration (4-8 weeks)

  • Migrate consumers one at a time from RabbitMQ to Kafka. Start with the least critical consumers.
  • For each consumer: switch it from RabbitMQ to Kafka, monitor for one week, then move to the next.
  • Keep the dual-write bridge active throughout this phase.

Phase 5: Producer migration (2-4 weeks)

  • Once all consumers are reading from Kafka, migrate producers to write directly to Kafka.
  • Remove the dual-write bridge.
  • Decommission RabbitMQ.

Risk mitigation:

  • Maintain the ability to roll back to RabbitMQ at every phase.
  • Use feature flags to switch individual services between RabbitMQ and Kafka consumers.
  • Monitor consumer lag, error rates, and end-to-end latency at each phase.

For more on messaging system comparisons, see our detailed comparison of Kafka vs RabbitMQ and our guide on system design interviews. To explore structured preparation plans, visit our pricing page.


How to Practice

Kafka proficiency comes from hands-on experience with real clusters. Here is a structured approach:

  1. Run a local cluster. Use Docker Compose to run a 3-broker Kafka cluster with KRaft mode, Schema Registry, and Kafka Connect. Practice creating topics, producing and consuming messages, and triggering rebalances by stopping containers.
yaml
  1. Build a consume-transform-produce pipeline. Read from one topic, enrich the data, write to another topic. Implement exactly-once semantics with transactions. Deliberately inject failures (kill the consumer mid-processing) and verify that no messages are lost or duplicated.

  2. Experiment with partition rebalancing. Start a consumer group with 3 consumers on a 6-partition topic. Add and remove consumers to observe rebalance behavior. Compare eager vs. cooperative rebalancing.

  3. Practice monitoring. Set up Prometheus + Grafana with the JMX exporter. Create dashboards for under-replicated partitions, consumer lag, and producer latency. Simulate problems (slow disk, network partition) and observe how metrics respond.

  4. Read production post-mortems. Uber, LinkedIn, and Netflix have published detailed accounts of Kafka incidents. Study what went wrong and how it was detected and resolved.

For structured interview preparation with real-time feedback, check out our interview preparation plans and system design interview guide.


Common Mistakes to Avoid

  1. Treating Kafka like a traditional message queue. Kafka is a distributed log, not a queue. If you describe Kafka in terms of "pushing messages to consumers" or "messages being deleted after consumption," the interviewer will doubt your production experience.

  2. Ignoring partition key design. Saying "we will just use random partitioning" shows a lack of thought about ordering and consumer affinity. Always have a rationale for your key selection.

  3. Claiming exactly-once is free. Exactly-once semantics require specific configuration (idempotent producers, transactions, read_committed isolation), add latency, and only work within the Kafka ecosystem. Acknowledging these limitations demonstrates maturity.

  4. Overlooking consumer group rebalancing. Many candidates describe producing and consuming but cannot explain what happens during a rebalance or how to minimize its impact. This is a red flag for production readiness.

  5. Not mentioning monitoring. A Kafka design without a monitoring strategy is incomplete. Always discuss how you would detect consumer lag, under-replicated partitions, and broker health issues.

  6. Conflating partitions with replicas. Partitions provide parallelism and ordering. Replicas provide durability and availability. They are orthogonal concepts. Mixing them up signals a fundamental misunderstanding.

  7. Using default configurations without justification. Saying "we will use the defaults" for retention, replication factor, or acks shows a lack of intentionality. In an interview, explicitly state your configuration choices and why they are appropriate for the use case.

  8. Forgetting about schema evolution. Producing messages without schema management is a ticking time bomb. Any serious Kafka deployment needs a serialization strategy and compatibility rules. Bring this up before the interviewer has to ask.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.