Two-Phase Commit Protocol Explained: Coordinating Distributed Transactions
Understand the Two-Phase Commit (2PC) protocol — how it coordinates atomic transactions across distributed nodes, its blocking problem, and alternatives.
Two-Phase Commit Protocol
Two-phase commit (2PC) is a distributed coordination protocol that ensures all participants in a distributed transaction either commit or abort together, providing atomicity across multiple databases or services.
What It Really Means
When a single transaction spans multiple databases — debit Account A on Database 1, credit Account B on Database 2 — you need both operations to succeed or both to fail. If Database 1 commits the debit but Database 2 crashes before crediting, money disappears. Two-phase commit prevents this.
The protocol works in two phases. In Phase 1 (Prepare), the coordinator asks all participants: "Can you commit?" Each participant does all the work (writes to disk, acquires locks) and responds "Yes" (vote commit) or "No" (vote abort). In Phase 2 (Commit/Abort), if all participants voted "Yes," the coordinator sends "Commit" and everyone makes the changes permanent. If any participant voted "No," the coordinator sends "Abort" and everyone rolls back.
The key property is that once a participant votes "Yes" in Phase 1, it has promised it can commit — it must hold its locks and resources until the coordinator tells it the final decision. This promise is what makes the protocol blocking: if the coordinator crashes after collecting votes but before sending the decision, all participants that voted "Yes" are stuck, unable to commit or abort, holding their locks indefinitely.
2PC is used in traditional database systems (XA transactions), Google Spanner (with modifications), and PostgreSQL's PREPARE TRANSACTION. Most modern microservice architectures avoid 2PC in favor of the saga pattern due to its blocking nature.
How It Works in Practice
The Protocol in Detail
Phase 1: Prepare (Voting Phase)
- Coordinator sends
PREPAREto all participants - Each participant:
- Writes all changes to a write-ahead log (WAL)
- Acquires necessary locks
- Responds
VOTE_COMMITif ready, orVOTE_ABORTif unable
- Coordinator collects all votes
Phase 2: Commit/Abort (Decision Phase)
- If all votes are
VOTE_COMMIT: Coordinator writesCOMMITto its own log, sendsCOMMITto all participants - If any vote is
VOTE_ABORT(or timeout): Coordinator writesABORTto its log, sendsABORTto all participants - Each participant executes the decision and acknowledges
The Blocking Problem
Consider this failure scenario:
- Coordinator sends PREPARE to participants A, B, C
- All three vote COMMIT
- Coordinator writes COMMIT to its log
- Coordinator sends COMMIT to A (A commits)
- Coordinator crashes before sending COMMIT to B and C
Now B and C are in limbo. They voted COMMIT and are holding locks, but they do not know the final decision. They cannot commit (the coordinator might have decided to abort). They cannot abort (the coordinator might have decided to commit, and A already committed). They must wait for the coordinator to recover.
This blocking problem is why 2PC is avoided in systems that require high availability.
Real-World: Google Spanner
Spanner uses a modified 2PC where the coordinator is a Paxos group (not a single node), eliminating the single-point-of-failure problem. Each participant is also a Paxos group. If the coordinator leader fails, a new leader is elected from the Paxos group and can complete the protocol. This makes Spanner's 2PC non-blocking in practice, though it adds latency from the consensus protocol.
Real-World: XA Transactions
The XA standard implements 2PC for coordinating transactions across multiple databases. A Java application might use JTA (Java Transaction API) to atomically update both MySQL and PostgreSQL:
XA transactions are supported by most relational databases but are rarely used in modern systems due to performance overhead and the blocking problem.
Implementation
Trade-offs
Advantages
- Atomicity guarantee: All participants commit or all abort — no partial failures
- Straightforward protocol: The two-phase structure is easy to understand
- Database support: Built into most relational databases via XA
- Strong consistency: No intermediate states visible to other transactions
Disadvantages
- Blocking protocol: Coordinator failure blocks all participants holding locks
- Performance overhead: Two round trips plus WAL writes; locks held for the entire protocol duration
- Availability reduction: The system's availability is the product of all participants' availabilities. With 3 services at 99.9% each, the combined availability is 99.7%.
- Does not scale: Adding more participants increases latency and failure probability
- Not partition-tolerant: Network partitions between coordinator and participants cause indefinite blocking
Common Misconceptions
-
"2PC guarantees the transaction will eventually complete" — 2PC guarantees safety (all commit or all abort) but not liveness. If the coordinator crashes after the prepare phase, the transaction can be blocked indefinitely until the coordinator recovers. Three-phase commit (3PC) addresses this but is rarely used in practice.
-
"2PC is the same as Paxos/Raft" — 2PC coordinates a distributed transaction (should we commit this change?). Consensus protocols like Paxos and Raft ensure replicas agree on a value. They solve different problems. Spanner combines both: 2PC for cross-shard transactions, Paxos for intra-shard replication.
-
"Modern systems don't use 2PC at all" — Microservices avoid 2PC in favor of sagas, but databases internally use 2PC extensively. PostgreSQL uses 2PC for PREPARE TRANSACTION. Google Spanner uses 2PC (backed by Paxos). CockroachDB uses a variant of 2PC for distributed transactions.
-
"The saga pattern is strictly better than 2PC" — Sagas sacrifice isolation: intermediate states are visible to concurrent transactions. For use cases requiring strict atomicity and isolation (e.g., double-entry accounting), 2PC is still the correct choice.
How This Appears in Interviews
2PC is a fundamental distributed systems concept tested in interviews:
- "How do you implement a distributed transaction across microservices?" — Discuss 2PC and its limitations, then explain why sagas are preferred for microservices. Show you understand the trade-off between atomicity and availability. See interview questions on distributed transactions.
- "What happens if the coordinator crashes?" — Explain the blocking problem: participants hold locks, cannot decide independently, and must wait for coordinator recovery. Discuss how Spanner solves this with Paxos-backed coordinators.
- "Compare 2PC and saga pattern" — 2PC gives atomicity and isolation but blocks on failure; sagas give availability and performance but sacrifice isolation. See our system design interview guide.
- "Design a money transfer system" — discuss whether you need 2PC (same database) or sagas (cross-service), and justify your choice.
Related Concepts
- Saga Pattern — The non-blocking alternative to 2PC for microservices
- Raft Consensus — Consensus within a shard, often combined with 2PC across shards
- Paxos — How Spanner makes 2PC non-blocking
- Eventual Consistency — The consistency model when you abandon 2PC
- CAP Theorem — 2PC chooses consistency over availability
- Algoroq Pricing — Practice distributed transaction interview questions
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.