INTERVIEW_QUESTIONS
Distributed Systems Interview Questions for Senior Engineers (2026)
Top distributed systems interview questions with detailed answer frameworks covering consensus, replication, partitioning, consistency models, and failure handling for FAANG interviews.
Why Distributed Systems Knowledge Matters in Interviews
Distributed systems form the backbone of every large-scale application. When you interact with any major platform, your request touches dozens of services running across multiple data centers. Senior engineering candidates are expected to understand the fundamental principles that govern these systems: how data is replicated, how consensus is achieved, how failures are detected and handled, and what trade-offs exist between consistency, availability, and partition tolerance.
Google, Meta, Amazon, and Netflix all heavily test distributed systems knowledge. The questions in this domain separate engineers who have read about concepts from those who have built and operated real systems. Interviewers look for precise understanding of failure modes, concrete numbers for latency and throughput, and the ability to reason about edge cases that only appear at scale.
This guide covers the most frequently asked distributed systems questions with frameworks that help you structure your thinking. For foundational concepts, review our distributed systems fundamentals. For hands-on preparation, explore our learning paths and the system design interview guide.
1. Explain the CAP theorem and its practical implications
What the interviewer is really asking: Do you truly understand CAP beyond the textbook definition, and can you apply it to real system design decisions?
Answer framework:
Start with the precise definition: in a distributed system experiencing a network partition (P), you must choose between consistency (C, every read returns the most recent write) and availability (A, every request receives a response). You cannot have both during a partition.
The key insight interviewers want: CAP is often misunderstood. It does not say you pick two of three. Partitions are not optional in distributed systems, they will happen. So the real choice is between CP (consistent but may be unavailable during partitions) and AP (available but may return stale data during partitions). When there is no partition, you can have both C and A.
Discuss real examples. DynamoDB is AP: during a partition, it accepts writes on both sides and reconciles later using vector clocks. ZooKeeper is CP: during a partition, the minority partition stops accepting writes, ensuring consistency but sacrificing availability for those nodes. A banking system chooses CP because showing an incorrect balance is worse than temporary unavailability. A social media feed chooses AP because showing a slightly stale feed is better than an error page.
Discuss the PACELC extension: even when there is no partition (the normal case), there is a trade-off between latency and consistency. This is often more relevant in practice since partitions are rare but the latency-consistency trade-off affects every request.
Common mistake: saying a system "is CA." In a distributed system, partitions will occur. A single-node database is CA but it is not distributed. Another mistake: treating CAP as a binary choice for the entire system. Different parts of the same system can make different CAP trade-offs. Your user profile might be CP while your news feed is AP.
Follow-up questions:
- How does the PACELC theorem extend CAP?
- Can you give an example of a system that switches between CP and AP behavior?
- How do cloud providers handle CAP in multi-region deployments?
2. How does Raft consensus work and why is it preferred over Paxos?
What the interviewer is really asking: Can you explain a consensus protocol clearly enough that the interviewer learns something, and do you understand the practical considerations of implementing it?
Answer framework:
Raft is a consensus algorithm designed for understandability. It decomposes consensus into three sub-problems: leader election, log replication, and safety.
Leader election: one node is the leader at any time. Nodes start as followers with a randomized election timeout (150-300ms). If a follower does not receive a heartbeat before timeout, it becomes a candidate and requests votes. It needs a majority to win. The randomized timeout prevents split votes (though they can still happen, triggering a new election round). Each election increments the term number, which acts as a logical clock.
Log replication: the leader receives client requests, appends them to its log, and replicates to followers via AppendEntries RPCs. A log entry is committed when a majority of nodes acknowledge it. Once committed, the leader applies it to its state machine and responds to the client. Followers apply committed entries in order.
Safety: a candidate can only win an election if its log is at least as up-to-date as a majority of nodes. This ensures the elected leader always has all committed entries. This is the key safety property that prevents data loss.
Why Raft over Paxos: Paxos is notoriously difficult to understand and implement correctly. Raft achieves the same safety guarantees with a design that is easier to reason about. The strong leader model simplifies log replication (no concurrent proposals). The clear state machine (follower, candidate, leader) makes implementation straightforward. In practice, most modern systems use Raft: etcd, CockroachDB, TiKV, and Consul.
Discuss practical considerations: cluster size (typically 3 or 5 nodes for the trade-off between fault tolerance and performance), performance impact of consensus (every write requires a majority round-trip), and read scaling (follower reads with lease-based consistency or linearizable reads that go through the leader).
Follow-up questions:
- What happens during a network partition in a 5-node Raft cluster?
- How does Raft handle log compaction and snapshotting?
- What is the latency impact of Raft consensus on write operations?
3. Explain consistency models from strongest to weakest
What the interviewer is really asking: Do you understand the spectrum of consistency guarantees and can you choose the right one for a given use case?
Answer framework:
Walk through the spectrum from strongest to weakest, with examples for each.
Linearizability (strongest): every operation appears to execute atomically at some point between its invocation and response. All clients see the same ordering. Use case: distributed locks (you need to know exactly who holds the lock). Cost: high latency since every read must go to the leader or a quorum.
Sequential consistency: all clients see the same ordering, and each client's operations appear in the order they were issued. But the global ordering does not need to respect real-time. Slightly cheaper than linearizability.
Causal consistency: if operation A causally precedes operation B (A happened before B and B could depend on A), then all clients see A before B. But concurrent operations (neither caused the other) can be seen in different orders by different clients. Use case: social media comments (a reply should always appear after the comment it replies to). Implementations use vector clocks or Lamport timestamps to track causality.
Eventual consistency (weakest): if no new writes occur, all replicas will eventually converge to the same value. No guarantee on ordering or timing. Use case: DNS, CDN caches, shopping cart item counts. Very low latency since reads can go to any replica.
Discuss read-your-writes consistency as a practical middle ground: a user always sees their own writes, even if other users see stale data. This can be achieved by routing a user's reads to the same replica that handled their writes, or by using a version token.
Explain that different parts of the same application often use different consistency models. Account balances need strong consistency. Social media likes can use eventual consistency. Product inventory needs strong consistency at checkout but eventual consistency for display pages.
Follow-up questions:
- How would you implement read-your-writes consistency in a multi-region setup?
- What consistency model does DynamoDB use by default and how can you upgrade it?
- How do you test for consistency violations in a distributed system?
4. How would you design a system to detect failures in a distributed environment?
What the interviewer is really asking: Do you understand the impossibility results around failure detection and the practical approaches used in production systems?
Answer framework:
Start with the fundamental challenge: in an asynchronous distributed system, you cannot distinguish between a crashed node and a slow node. This is related to the FLP impossibility result. All failure detectors are imperfect and must trade off between completeness (eventually detecting all failures) and accuracy (not falsely suspecting healthy nodes).
Heartbeat-based detection: the simplest approach. Each node periodically sends a heartbeat to a monitoring service (or to peers). If no heartbeat is received within a timeout, the node is suspected as failed. The challenge is setting the timeout: too short causes false positives (a busy node might miss a heartbeat), too long delays detection. Use adaptive timeouts based on historical heartbeat intervals (similar to TCP retransmission timeout estimation).
Gossip-based detection (used by Cassandra and many peer-to-peer systems): each node periodically gossips with random peers about which nodes it has heard from recently. This has several advantages: no single point of failure (unlike a centralized monitor), scales well (each node only gossips with a few peers per round), and provides probabilistic guarantees. The phi accrual failure detector used by Cassandra computes a suspicion level rather than a binary up/down decision.
SWIM protocol: combines gossip with direct and indirect probing. When node A suspects node B (missed heartbeat), A asks random node C to probe B. If C also cannot reach B, B is marked as failed. This reduces false positives from network issues between specific node pairs.
Discuss what happens after detection: in a consensus-based system, the surviving members run an election. In a load balancing context, the failed node is removed from the rotation. In a storage system, data is re-replicated to maintain the replication factor.
Discuss fault tolerance of the failure detector itself: if the monitoring service crashes, who monitors the monitor? Use a decentralized approach (gossip) or a replicated monitoring service.
Follow-up questions:
- How do you handle gray failures where a node is partially functioning?
- What is the impact of false positives in failure detection?
- How does Kubernetes detect and handle node failures?
5. Explain different data partitioning strategies and their trade-offs
What the interviewer is really asking: Can you reason about data distribution, hot spots, and rebalancing in real systems?
Answer framework:
Data partitioning (sharding) is how you distribute data across multiple nodes to scale beyond a single machine. Three main strategies exist.
Range partitioning: assign contiguous key ranges to partitions (e.g., A-F on node 1, G-L on node 2). Advantages: efficient range queries (all keys in a range are on the same node), natural ordering. Disadvantages: prone to hot spots if access patterns are skewed (e.g., if most users have names starting with S). Used by HBase and BigTable. Mitigate hot spots by choosing keys carefully or adding a hash prefix.
Hash partitioning: hash the key and use the hash to determine the partition (hash(key) mod N, or consistent hashing). Advantages: even distribution regardless of key patterns, no hot spots for random access patterns. Disadvantages: range queries are impossible since nearby keys are scattered. Used by DynamoDB, Cassandra (with the Murmur3 partitioner), and Redis Cluster.
Composite partitioning: combine hash and range. Hash the first part of the key (partition key) for distribution, then use the second part (sort key) for ordering within a partition. This gives even distribution across partitions while supporting range queries within a partition. Used by DynamoDB and Cassandra (compound primary keys).
Discuss rebalancing: when you add or remove nodes, data must move. With modulo hashing, adding a node remaps most keys, which is catastrophic. With consistent hashing, only K/N keys move on average. But even K/N can be a lot of data. Discuss strategies: virtual nodes (many small partitions that can be reassigned individually), pre-splitting (create many more partitions than nodes from the start, rebalance by moving whole partitions), and incremental rebalancing (move data gradually while serving reads from both old and new locations).
Discuss secondary indexes on partitioned data: local indexes (each partition indexes only its own data, queries must fan out) vs global indexes (a separate partition scheme for the index, updates must propagate across partitions). This is a key database design decision.
Follow-up questions:
- How does DynamoDB handle hot partitions?
- How would you rebalance data without downtime?
- What happens to joins when data is partitioned across nodes?
6. What are vector clocks and how do they solve the causality tracking problem?
What the interviewer is really asking: Do you understand logical time, why physical clocks are insufficient in distributed systems, and the practical implications of causality tracking?
Answer framework:
Physical clocks cannot establish ordering in distributed systems because clock skew is unavoidable (even with NTP, clocks can differ by milliseconds). Two events on different nodes with timestamps 1ms apart could actually have occurred in either order.
Lamport timestamps provide a partial solution: each node maintains a counter, increments it on each event, and includes it in messages. On receiving a message, a node updates its counter to max(local, received) + 1. This gives a happened-before relationship: if event A causally precedes event B, then timestamp(A) < timestamp(B). But the converse is not true: a lower timestamp does not mean the event happened first. Concurrent events cannot be distinguished from causally ordered ones.
Vector clocks solve this. Each node maintains a vector of counters, one per node. Node i increments its own counter on each event. When sending a message, it includes its entire vector. On receiving, a node takes the element-wise maximum of its vector and the received vector, then increments its own counter. To compare two events: if every element of vector A is less than or equal to the corresponding element of vector B, then A happened before B. If neither vector dominates, the events are concurrent.
Practical implications: Amazon Dynamo used vector clocks to detect conflicting writes. When a client reads and gets multiple versions with incomparable vector clocks, the client must resolve the conflict (application-level reconciliation). Shopping cart uses set union, other applications might use last-writer-wins as a simpler but lossy strategy.
Limitations of vector clocks: they grow linearly with the number of nodes. For systems with thousands of nodes, this is impractical. Solutions include dotted version vectors (more compact) and bounded vector clocks (truncate old entries at the cost of false conflicts).
Discuss alternatives: hybrid logical clocks (HLC) combine physical timestamps with logical counters. They provide the benefits of both: causality tracking like vector clocks and timestamps that are close to real time. Used by CockroachDB and MongoDB.
Follow-up questions:
- When would you choose vector clocks over hybrid logical clocks?
- How does CockroachDB use timestamps for serializable transactions?
- What is the difference between logical time and physical time in practice?
7. How does distributed transaction management work across services?
What the interviewer is really asking: Do you understand two-phase commit, its limitations, and the modern alternatives like sagas that are used in microservices?
Answer framework:
Two-phase commit (2PC): a coordinator asks all participants to prepare (vote yes or no). If all vote yes, the coordinator sends commit. If any votes no, the coordinator sends abort. This provides ACID transactions across multiple databases.
Limitations of 2PC: the coordinator is a single point of failure. If the coordinator crashes after sending prepare but before sending commit or abort, participants are blocked, holding locks indefinitely. This is the blocking problem. 2PC also requires all participants to be available, reducing system availability. In practice, 2PC works within a single data center but is problematic across data centers due to latency and partition risk.
Three-phase commit (3PC): adds a pre-commit phase to make the protocol non-blocking. In theory, this solves the blocking problem. In practice, 3PC is rarely used because it makes assumptions about network timing that do not hold in real systems.
Sagas (the modern approach for microservices): break a distributed transaction into a sequence of local transactions. Each local transaction updates a service and publishes an event. If a step fails, execute compensating transactions for all previously completed steps. Two coordination patterns exist.
Choreography: each service listens for events and reacts. No central coordinator. Simple for small flows but becomes hard to understand and debug as complexity grows. You lose visibility into the overall transaction state.
Orchestration: a central saga orchestrator defines the transaction flow and tells each service what to do. Easier to understand and monitor. The orchestrator maintains the state machine of the saga. More common in practice.
Discuss the limitations of sagas: they provide eventual consistency, not ACID. Compensating transactions can be complex (how do you un-send an email?). You need to handle cases where the compensating transaction itself fails. Design services to be idempotent since messages might be delivered more than once.
Reference event-driven architecture patterns: use an event store or message queue like Kafka for reliable event delivery between saga participants.
Follow-up questions:
- How would you implement a saga for an e-commerce order flow?
- What happens if a compensating transaction fails?
- How do you handle the lack of isolation in sagas (dirty reads of intermediate states)?
8. Explain the difference between leader-based and leaderless replication
What the interviewer is really asking: Can you articulate the trade-offs between these two fundamental replication approaches and reason about when to use each?
Answer framework:
Leader-based (single-leader) replication: one node (the leader) accepts all writes, applies them to its local storage, and replicates them to followers. Followers serve read requests. This is the most common model, used by PostgreSQL, MySQL, MongoDB (replica sets), and Redis.
Advantages: simple to reason about consistency (all writes are ordered by the leader), no write conflicts, well-understood failure modes. Disadvantages: the leader is a bottleneck for writes, leader failure requires election (brief unavailability), cross-region latency for writes (writes must go to the leader's region).
Multi-leader replication: multiple nodes accept writes, each replicating to the others. Used for multi-data-center deployments where each data center has its own leader. MySQL and PostgreSQL support this with tools like BDR. CockroachDB uses a variant with Raft groups per range.
Advantages: writes are fast in every region (local leader), no single point of failure. Disadvantages: write conflicts when two leaders modify the same data simultaneously. Conflict resolution is hard: last-writer-wins (simple but data loss), custom resolution logic (complex), or CRDTs (limited data types).
Leaderless replication: any node accepts reads and writes. The client sends writes to multiple nodes simultaneously and reads from multiple nodes, using quorum logic. Used by Dynamo, Cassandra, and Riak.
Advantages: no leader election needed, no single point of failure, every node is equal. Disadvantages: quorum reads and writes add latency (must wait for W or R responses), complex conflict resolution, no total ordering of writes.
Discuss quorum parameters: for N replicas, require W write confirmations and R read confirmations where W + R > N ensures overlap (at least one node has the latest value). Typical configs: W=2, R=2, N=3 (balanced), W=1, R=3 (fast writes, slow reads), W=3, R=1 (slow writes, fast reads).
Discuss CAP theorem implications: leader-based is typically CP (unavailable during leader election), leaderless is typically AP (available but may return stale data if quorum is not met).
Follow-up questions:
- How does Cassandra handle write conflicts with leaderless replication?
- What are sloppy quorums and hinted handoff?
- When would you choose multi-leader over single-leader replication?
9. How would you implement exactly-once message delivery in a distributed system?
What the interviewer is really asking: Do you understand why exactly-once is impossible to guarantee at the network level and how to achieve it effectively through idempotency and deduplication?
Answer framework:
Start with the impossibility: in a distributed system with unreliable networks, true exactly-once delivery is impossible. If a sender sends a message and does not receive an acknowledgment, it cannot know whether the receiver processed the message or not. It must retry, potentially causing duplicate processing. The two practical options are at-most-once (never retry, risk message loss) and at-least-once (always retry, risk duplicates).
The industry solution is at-least-once delivery with idempotent processing, which achieves the effect of exactly-once. Two approaches:
Idempotent consumers: design your message handlers so that processing the same message multiple times produces the same result as processing it once. For database writes, use upsert operations with a unique message ID. For API calls, use idempotency keys. For counter increments, store the set of processed message IDs and skip duplicates.
Transactional outbox pattern: instead of directly sending a message, write it to an outbox table in the same database transaction as your business logic. A separate process reads the outbox and publishes to the message queue. If publishing fails, the process retries. The consumer deduplicates using the message ID from the outbox. This ensures that the business logic and message publishing are atomic.
Kafka's approach: Kafka achieves effectively exactly-once for its internal processing using idempotent producers (each message has a sequence number, the broker deduplicates), transactional writes (atomically write to multiple topic-partitions), and consumer offset commits in the same transaction as the output. This is Kafka's "exactly-once semantics" but it only works end-to-end within Kafka. For external systems, you still need idempotent consumers.
Discuss the cost: deduplication requires storing processed message IDs, which uses memory/storage. You need to decide how long to keep deduplication records. Too short, and late retries cause duplicates. Too long, and storage grows unboundedly. Common approach: keep deduplication records for 7 days (matching the maximum retry window).
Follow-up questions:
- How does Kafka implement exactly-once semantics internally?
- How would you implement deduplication at 100K messages per second?
- What is the relationship between idempotency and exactly-once delivery?
10. Explain the split-brain problem and how to prevent it
What the interviewer is really asking: Do you understand one of the most dangerous failure modes in distributed systems and the mechanisms used to prevent data corruption?
Answer framework:
Split-brain occurs when a network partition divides a cluster into two groups, and both groups believe they are the active primary. Both accept writes, leading to data divergence and potential corruption. This is catastrophic for systems that require strong consistency.
Scenario: a 2-node primary-secondary database setup. A network partition separates them. The secondary, unable to reach the primary, promotes itself to primary. Now both accept writes. When the partition heals, conflicting writes must be reconciled, and some data will be lost.
Prevention mechanisms:
Quorum-based approach: require a majority (N/2 + 1) of nodes to agree before any operation. In a 3-node cluster, you need 2 nodes. If a partition splits nodes 2-1, only the side with 2 nodes can operate. The single node knows it cannot form a majority and refuses writes. This is why consensus algorithms like Raft use odd-numbered clusters.
Fencing tokens: when a new leader is elected, it receives a monotonically increasing fencing token (epoch number). All write operations include this token. The storage layer rejects writes with a token lower than the highest it has seen. Even if an old leader does not know it has been replaced, its writes will be rejected because it has a stale token.
STONITH (Shoot The Other Node In The Head): when a node believes it should become the primary, it first ensures the old primary is actually dead by powering it off through an out-of-band mechanism (IPMI, cloud API). This guarantees only one primary exists. Used by Pacemaker and many high-availability solutions.
Disk-based heartbeats: in addition to network heartbeats, nodes write heartbeats to a shared disk (SAN). If a node can see the other's disk heartbeats but not network heartbeats, it knows the other node is alive and the network is partitioned, so it should not promote.
Discuss high availability implications: preventing split-brain often means sacrificing availability. If a partition leaves a minority group, those nodes become unavailable. This is the fundamental CAP trade-off.
Follow-up questions:
- How does ZooKeeper prevent split-brain in its leader election?
- What happens if a node receives a fencing token but the old leader has already written data?
- How do cloud providers handle split-brain in managed database services?
11. How does distributed garbage collection and resource cleanup work?
What the interviewer is really asking: Can you think about the harder operational challenges of distributed systems beyond the initial design?
Answer framework:
Distributed garbage collection is more complex than single-process GC because objects may be referenced across node boundaries, and you cannot take a global snapshot atomically.
Reference counting across services: each service tracks references to shared resources (database connections, cached objects, distributed locks). When the reference count reaches zero, the resource can be cleaned up. Challenge: if a node crashes while holding a reference, the count never reaches zero. Solution: use leases instead of permanent references. Each reference has a TTL and must be renewed. If not renewed (because the holder crashed), the lease expires and the resource is freed.
Lease-based cleanup: assign every distributed resource a lease (TTL). The holder must renew the lease periodically. If the lease expires, any node can reclaim the resource. This is how distributed locks work in ZooKeeper and etcd. The challenge is choosing the lease duration: too short causes unnecessary renewals and false expirations, too long delays cleanup.
Tombstones for deleted data: in replicated systems, you cannot simply delete a record because replicas that have not seen the delete will re-replicate it back. Instead, mark the record as deleted (tombstone) and propagate the tombstone to all replicas. After a sufficient time (the gc_grace_seconds in Cassandra, typically 10 days), the tombstones themselves are garbage collected.
Discuss the compaction problem in log-structured systems: append-only logs and LSM trees (how LSM trees work) continuously accumulate data. Background compaction processes merge and remove deleted entries. This must be balanced against live traffic: too aggressive compaction steals I/O from queries, too lazy compaction wastes disk space.
Follow-up questions:
- How does Cassandra handle tombstone accumulation?
- What problems can occur if GC pauses happen in a distributed system node?
- How would you clean up orphaned resources in a microservices architecture?
12. Explain the concept of linearizability and how to implement it
What the interviewer is really asking: Can you go deep on the strongest consistency model and understand both its guarantees and its costs?
Answer framework:
Linearizability means every operation appears to execute atomically at some point between its invocation and its response (the linearization point). This gives the illusion that there is only one copy of the data, even though it is replicated.
To test if a system is linearizable: given a history of operations (with start and end times), can you find a total ordering of operations such that (1) each read returns the value of the most recent write in the ordering, and (2) the ordering respects real-time: if operation A completes before operation B starts, A appears before B in the ordering.
Implementation approaches:
Single leader with synchronous replication: the leader processes all reads and writes. Writes are replicated synchronously to a quorum before acknowledgment. Reads go to the leader. This provides linearizability but the leader is a bottleneck and a single point of failure. Used by etcd and ZooKeeper.
Consensus-based: use a consensus algorithm like Raft for every write. For reads, either route all reads through the leader (simple but bottleneck) or use a lease mechanism (the leader knows it is still the leader because its lease has not expired, so it can serve reads locally).
Quorum reads and writes: with N replicas, W + R > N. But quorum alone is not sufficient for linearizability. You also need either version numbers and read-repair, or a consensus protocol for conflicting writes. Cassandra uses quorum but is not linearizable because concurrent writes to different replicas can be ordered differently.
Cost of linearizability: increased latency (every operation requires coordination), reduced availability (cannot serve requests during partitions in a CP system), and lower throughput (the coordination bottleneck limits operations per second). Benchmarks show linearizable reads in etcd at around 10K per second per node versus millions per second for eventually consistent reads.
When to use: distributed locks, leader election, unique constraint enforcement, financial transactions. When to avoid: analytics, caching, content feeds, anything that tolerates staleness.
Follow-up questions:
- How does Spanner achieve linearizability across data centers using TrueTime?
- What is the performance difference between linearizable and eventually consistent reads?
- Can you have linearizability without a leader?
13. How do you handle data migration in a live distributed system?
What the interviewer is really asking: Can you plan and execute a risky operational task without downtime, showing mature engineering judgment?
Answer framework:
Data migration in a live system requires a multi-phase approach. The dual-write or double-read pattern is the foundation.
Phase 1 - Dual writes: modify the application to write to both the old and new data stores. Reads still go to the old store. This ensures new data is in both systems. Use async writes to the new store to avoid impacting latency. If the write to the new store fails, log it for retry but do not fail the request.
Phase 2 - Backfill: migrate existing data from the old store to the new store. Run this as a background process at a controlled rate (do not overwhelm either system). Use checksums to verify data integrity. Handle conflicts where the dual-write has a newer version than the backfill source.
Phase 3 - Verification: run a consistency checker that compares data between old and new stores. Sample randomly or check exhaustively depending on data volume. Fix any discrepancies. This phase builds confidence that the migration is correct.
Phase 4 - Cutover reads: switch reads to the new store. Use a feature flag for gradual rollout (1 percent, 10 percent, 50 percent, 100 percent). Monitor error rates, latency, and correctness at each step. Keep the ability to revert instantly.
Phase 5 - Stop dual writes: once reads are fully on the new store and stable, stop writing to the old store. Keep the old store in read-only mode for a rollback window (typically 1-2 weeks).
Phase 6 - Decommission: after the rollback window, decommission the old store.
Discuss database design considerations: schema differences between old and new stores (you often migrate to take advantage of a better schema), data transformation during migration, and handling the increased load of dual writes.
Discuss using event-driven architecture: instead of dual writes, capture changes from the old store using CDC (Change Data Capture) and apply them to the new store. This decouples the migration from the application code.
Follow-up questions:
- How would you migrate a 10TB database with zero downtime?
- What if the new store has a different data model?
- How do you handle data that is written to the old store but not yet replicated to the new store during cutover?
14. Explain Byzantine fault tolerance and when it matters
What the interviewer is really asking: Do you understand the difference between crash faults and Byzantine faults, and can you reason about when the complexity of BFT is justified?
Answer framework:
Crash faults: a node either works correctly or stops completely. This is the failure model assumed by most distributed systems (Raft, Paxos, ZooKeeper). You need 2f + 1 nodes to tolerate f crash faults.
Byzantine faults: a node can behave arbitrarily, including sending incorrect or contradictory messages. This covers bugs, hardware corruption, and malicious actors. You need 3f + 1 nodes to tolerate f Byzantine faults, and the protocols are significantly more complex and slower.
Practical Byzantine Fault Tolerance (PBFT): the best-known BFT protocol. A leader proposes a value, and nodes go through three phases: pre-prepare (leader broadcasts proposal), prepare (nodes echo the proposal), commit (nodes confirm). A value is committed when 2f + 1 nodes agree in each phase. The three-phase protocol ensures that even if f nodes lie, the honest majority reaches consensus.
When BFT matters: blockchain systems (nodes do not trust each other), aerospace systems (hardware can have transient faults from radiation), and multi-party computation (mutually distrusting organizations sharing a system).
When BFT is overkill: most internal distributed systems at companies trust their own nodes. A server sending incorrect data is almost always a software bug, which BFT cannot fully solve (if all nodes run the same buggy code, they all produce the same wrong answer). Crash fault tolerance (Raft) is sufficient and much more performant.
Discuss the performance cost: PBFT has O(n squared) message complexity per consensus round (every node sends to every other node). This limits practical cluster sizes to tens of nodes. Compare with Raft's O(n) complexity. This is why blockchains are orders of magnitude slower than traditional databases.
Modern improvements: HotStuff (used by Meta's Diem/Libra) achieves O(n) message complexity for BFT using threshold signatures and a pipelining approach.
Follow-up questions:
- How does Bitcoin solve the Byzantine generals problem differently than PBFT?
- Can you give an example of a Byzantine fault that is not malicious?
- What is the latency overhead of BFT compared to crash fault tolerance?
15. How do you design a system for exactly-once state machine replication?
What the interviewer is really asking: Can you combine consensus, state machines, and idempotency into a coherent replicated system design?
Answer framework:
State machine replication (SMR) is the foundation of many fault-tolerant systems. The idea: if multiple nodes start in the same state and apply the same operations in the same order, they will end in the same state. The challenge is ensuring all nodes agree on the order of operations. This is exactly what consensus algorithms solve.
Architecture: clients send operations to the leader. The leader proposes the operation to the consensus group (Raft, Paxos). Once committed (majority agreement), all nodes apply the operation to their state machine. Because all nodes apply the same operations in the same order, their states are identical.
Handling client retries: if a client sends a request and the leader crashes before responding, the client does not know if the operation was committed. It retries, potentially causing the operation to be applied twice. Solution: assign each client a unique session ID and sequence number. The state machine tracks the latest sequence number per client and ignores duplicate operations. This provides exactly-once semantics for client operations.
Handling leader changes: when a new leader is elected, it must discover which operations from the previous leader were committed and which were not. In Raft, the new leader's log contains all committed entries (guaranteed by the election safety property). The new leader replicates any uncommitted entries from its log to followers.
Snapshotting: the replicated log grows unboundedly. Periodically, take a snapshot of the state machine and truncate the log up to that point. New nodes or nodes that fall behind can bootstrap from the snapshot plus the remaining log entries.
Practical examples: etcd (Raft-replicated key-value store), ZooKeeper (Zab-replicated configuration store), CockroachDB (Raft-replicated SQL database with ranges as the unit of replication). Each database layer sits on top of a replicated state machine.
Performance considerations: batching (group multiple operations into one consensus round), pipelining (start the next consensus round before the previous one completes), and read leases (serve reads from the leader without consensus, using a time-bound lease).
Follow-up questions:
- How does Raft handle the case where the new leader has uncommitted entries from the old leader?
- What is the latency profile of a replicated state machine?
- How would you scale reads in a state machine replication system?
Common Mistakes in Distributed Systems Interviews
-
Confusing consistency models. Linearizability, serializability, and sequential consistency are different guarantees. Using them interchangeably signals shallow understanding.
-
Assuming reliable networks. Networks partition, messages are lost, and packets arrive out of order. Every design must account for network failures.
-
Ignoring the cost of coordination. Consensus, distributed locks, and synchronous replication all add latency. Know the performance implications of your consistency choices.
-
Handwaving about failure handling. Saying the system retries on failure is not enough. Discuss idempotency, deduplication, compensating transactions, and the specific failure modes of each component.
-
Not understanding impossibility results. FLP (no deterministic consensus in async systems), CAP, and the two generals problem set fundamental limits. Understanding what is impossible helps you appreciate why practical systems make the trade-offs they do.
How to Prepare for Distributed Systems Interviews
Start with the foundational papers: Lamport's "Time, Clocks, and the Ordering of Events" and Brewer's CAP conjecture. Then study Raft (the paper is designed to be readable) and the Dynamo paper for leaderless replication.
Build something: implement a simple key-value store with Raft consensus using a library like etcd's Raft. The act of implementing teaches you edge cases that reading alone cannot.
Study production systems: read engineering blogs about how companies like Google handle distributed systems challenges. Spanner's TrueTime, Dynamo's eventual consistency, and Kafka's exactly-once semantics are common interview reference points.
For a structured learning path, explore our distributed systems course and the system design interview guide. Those targeting staff roles should review the senior to staff engineer transition for the depth expected at that level.
Related Resources
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.