You’re about to design concurrency control for a distributed system that handles money, inventory, bookings, or collaborative edits. You have two archetypal strategies:
- Pessimistic locking: “Assume collisions will happen - reserve access up front.”
- Optimistic locking: “Assume collisions are rare - detect conflicts at commit time.”
This article turns those into interactive decision games, failure drills, and mental models you can reuse when you’re choosing between row locks, compare-and-swap, leases, version checks, and distributed transactions.
Distributed locking discussions get misleading unless we state assumptions:
Scenario or challenge You run a busy coffee shop. There is one last croissant in the display case. Two baristas take orders at the same time:
Interactive question (pause and think) If you do nothing special, what are the possible outcomes?
Pause for 10 seconds.
Explanation with analogy (progressive reveal) The realistic outcomes are (1) and (2). Without concurrency control, you can oversell.
Real-world parallel Replace “croissant” with:
In distributed systems, we need a way to ensure that concurrent operations don’t violate invariants.
Key insight: Locking is about protecting invariants under concurrency. The “right” strategy depends on contention, latency, failure modes, and what you can tolerate: blocking, retries, or occasional aborts.
Challenge question If customers mostly order coffee (no contention) and croissants rarely run out (rare contention), which style sounds better: optimistic or pessimistic?
Scenario or challenge You need a mental picture that generalizes from “croissant” to “distributed state.”
Interactive question (pause and think) Which is more like your workload: a crowded checkout line (frequent collisions) or a mostly empty store (rare collisions)?
Explanation with analogy Two mental pictures:
Pessimistic locking (reserve first)
Optimistic locking (work first, validate later)
[IMAGE: Side-by-side diagram. Left: Pessimistic path with lock acquisition -> critical section -> release. Right: Optimistic path with read version -> do work -> compare version on commit; if mismatch, abort/retry.]
Real-world parallel
Key insight: Pessimistic trades waiting for certainty. Optimistic trades retries/aborts for parallelism.
Challenge question In a geo-distributed system where round trips are expensive, which cost hurts more: waiting for locks or retrying on conflicts?
Scenario or challenge A user checks out a cart:
You need the invariant: don’t charge if you can’t reserve inventory.
Interactive question (pause and think) Where should locking happen?
Pause for 15 seconds.
Explanation with analogy (progressive reveal) It depends on architecture:
SELECT ... FOR UPDATE).Production insight: even if you “lock inventory”, you still need idempotency for payment and order creation because timeouts create ambiguous outcomes.
Real-world parallel This is like a delivery service that must coordinate between:
If the warehouse can’t pick, charging is a bad customer experience.
Key insight: In distributed systems, “locking” is not just a DB feature - it can be a protocol spanning services.
Challenge question If the payment provider is external and slow, do you want to hold an inventory lock while waiting for payment?
Scenario or challenge Teams hear “optimistic” and assume “no synchronization.”
Interactive question (pause and think) If no one ever blocks, how can the system prevent two writers from committing incompatible updates?
Explanation with analogy Optimistic locking is usually:
Example: UPDATE ... WHERE version = ? is “optimistic,” but the database still uses internal latching/locking to apply the update atomically.
Real-world parallel Two people can both fill out the last reservation form, but the host accepts only the one with the latest ticket/sequence.
Key insight: Optimistic locking avoids long-held locks, not all synchronization.
Challenge question What is the “lock” in optimistic locking? (Hint: it’s the commit-time atomicity.)
Scenario or challenge Reserving the last table at a restaurant: the host says, “I’ll put your name on it and hold the table.” Others must wait.
Interactive question (pause and think) In distributed locking, which component must be strongly consistent?
Explanation with analogy (progressive reveal) Answer: B. If the lock service isn’t strongly consistent, you can get split-brain locks (two owners).
[IMAGE: Sequence diagram of distributed lock acquisition using etcd: client -> etcd put lock key with lease -> success; other client blocks/watches key; release or lease expiry.]
Real-world parallel A restaurant’s reservation book must be authoritative. If two hosts keep separate books that don’t sync reliably, you’ll double-book tables.
Key insight: Distributed pessimistic locking requires a strongly consistent coordinator (or an equivalent single-writer mechanism).
Challenge question If your lock coordinator is a Raft cluster, what happens during a network partition - can clients still acquire locks?
Clarification (CAP): a correctly implemented Raft-based lock service will typically refuse lock acquisition if it cannot reach a quorum. That preserves consistency (no split-brain) at the cost of availability.
Scenario or challenge
A worker acquires a lock on inventory:sku123, then crashes mid-operation.
Interactive question (pause and think) How does the system avoid deadlock (lock never released)?
Explanation with analogy (progressive reveal) Answer: B. The standard answer is leases.
TCP disconnect can help but isn’t sufficient in partitions: TCP may stay open or fail late.
Production insight: leases solve “stuck locks” but introduce “lock loss” (a paused process can lose its lease). To be safe you often need fencing tokens (covered later).
Real-world parallel A restaurant holds your table for 10 minutes. If you don’t show up, they give it away.
Key insight: Pessimistic locks in distributed systems must be time-bounded to tolerate crashes and partitions.
Challenge question What’s the trade-off of short leases vs long leases?
Scenario or challenge You must decide whether to block or to abort.
Interactive question (pause and think) Pick the true statement(s):
Pause.
Explanation with analogy (progressive reveal) Answers:
Real-world parallel A single host controlling a waitlist reduces double-booking but can create a long line.
Key insight: Pessimistic locking often trades throughput for predictable correctness - but not necessarily predictable latency.
Challenge question In a system with strict SLOs (p99 latency), when might you prefer abort/retry over waiting?
Scenario or challenge Two people editing the same Google Doc line: you type a sentence, someone else edits the same line. The system merges or asks you to resolve.
Interactive question (pause and think) Which of these is a classic optimistic locking pattern?
SELECT ... FOR UPDATEUPDATE item SET qty=qty-1 WHERE id=? AND version=?SETNX lockkeyExplanation with analogy (progressive reveal) Answer: B. It updates only if the version matches.
[IMAGE: Timeline diagram showing two clients read version v=7; client A commits to v=8; client B tries commit with v=7 fails and must retry.]
Real-world parallel Two customers fill out a form; the system accepts the first one that reaches the clerk and rejects the stale one.
Key insight: Optimistic locking is “validate on write.” Pessimistic locking is “block before write.”
Challenge question If conflicts are frequent, what happens to throughput under optimistic locking?
Scenario or challenge You see it in JPA or SQL and assume it’s a DB-only technique.
Interactive question (pause and think) Where else have you seen: “If version matches, apply; else reject”?
Explanation with analogy Optimistic concurrency is everywhere:
If-Match)Real-world parallel A delivery app updates your address only if you are editing the latest version of the order.
Key insight: Optimistic concurrency is a protocol pattern, not a DB feature.
Challenge question Where in your stack could you add version checks to prevent lost updates?
Scenario or challenge A popular SKU goes viral. Thousands of clients attempt to decrement stock using optimistic locking.
Interactive question (pause and think) What failure mode can occur?
Explanation with analogy (progressive reveal) Answer: B. Under high contention, optimistic locking can cause:
Production mitigations:
Real-world parallel It’s like telling everyone, “Just walk to the counter and if the croissant is gone, go back and try again.” The crowd blocks the entrance.
Key insight: Optimistic locking shifts contention from waiting to wasted work.
Challenge question Which is easier to control: waiting queues (pessimistic) or retry loops (optimistic)?
| Dimension | Pessimistic locking | Optimistic locking |
|---|---|---|
| Assumption | Conflicts likely | Conflicts rare |
| Coordination | Up front (acquire lock) | At commit (validate version/CAS) |
| Latency profile | Can block; tail latency from lock waits | Fast on success; spikes under conflict due to retries |
| Failure handling | Needs leases/heartbeats; risk of lock loss | Needs retry/backoff; risk of livelock under contention |
| Partition behavior | Often stops granting locks to preserve safety (C over A) | Reads can continue; writes depend on store’s consistency + leader/quorum |
| Implementation | Lock manager/leader/DB locks | Version fields, CAS, MVCC, conditional writes |
| Good for | High contention, strong invariants, short critical sections | Read-heavy, low contention, high latency networks |
| Bad for | Long operations, external calls while holding lock | Hot keys, bursty contention, expensive retries |
Scenario or challenge Your checkout flow:
Interactive question (pause and think) What’s the worst thing about this design?
Explanation with analogy (progressive reveal) Answer: D. Holding locks across slow, failure-prone external calls is a classic distributed systems foot-gun.
Production pattern: split into two phases:
This is a saga/reservation workflow, not a single lock.
Real-world parallel It’s like a host holding the only table while you step outside to call your friend to decide what to order.
Key insight: In distributed systems, time is the enemy of locks.
Challenge question If you must guarantee “no charge without inventory,” what compensation or idempotency mechanisms do you need?
Scenario or challenge You have a single strongly consistent database for inventory.
Interactive question (pause and think) What happens to other writers while you hold a row lock?
Explanation with analogy Others wait in line; you are holding the “croissant behind the counter.”
Real-world parallel A single clerk handles one customer at a time for a specific ticket.
Key insight: Keep pessimistic critical sections short; do not call external systems while holding locks.
Important clarification: the following TCP lock server is a teaching demo to illustrate blocking waiters. It is not a correct distributed lock implementation (no persistence, no fencing, no quorum, no crash recovery).
Scenario or challenge You want high read concurrency and can tolerate retries.
Interactive question (pause and think) If two writers race, how does the loser learn it lost?
Explanation with analogy The loser’s commit sees “someone updated this since you read it,” like arriving at the counter with a stale ticket.
Real-world parallel A reservation form is accepted only if the table is still free.
Key insight: Optimistic locking needs a retry policy (backoff, jitter, max attempts) and good UX for conflict errors.
Scenario or challenge You store state in a KV store with compare-and-swap support.
Interactive question (pause and think) Why is “read then write” dangerous without a compare primitive?
Explanation with analogy Without compare, you can overwrite someone else’s update (“lost update”). Compare makes it a single atomic decision.
Real-world parallel The host checks the reservation book at the moment of writing your name, not 2 minutes earlier.
Key insight: If your store supports transactional compare (e.g., etcd Txn), you can avoid extra coordination round trips.
Important clarification: the following single-threaded server demonstrates the API shape of CAS. Real CAS correctness in distributed systems requires a strongly consistent store (single leader/quorum) or a linearizable primitive.
Challenge question If your store is only eventually consistent, does CAS still work across replicas?
Answer: not in the linearizable sense. You might get “successful” CAS on two replicas during a partition (split brain). For correctness you need a single-writer leader or quorum/consensus providing linearizable compare-and-set.
Scenario or challenge You must coordinate access to a shared resource not protected by a single DB (e.g., singleton job, migration, cross-store invariant).
Interactive question (pause and think) What’s the minimum requirement for correctness if locks can be lost?
Explanation with analogy You need a way to prevent a previous holder from continuing after losing ownership.
Real-world parallel A reservation ticket expires; you can’t show up later and claim the table.
Key insight: Leases prevent deadlocks; fencing tokens prevent stale owners from corrupting state.
Challenge question What should your worker do if it loses the lock mid-operation?
Production answer: stop making side effects immediately; if possible, roll back/compensate; and ensure downstream writes are protected by fencing token checks so stale workers are rejected.
Scenario or challenge You want both perfect correctness and perfect availability.
Interactive question (pause and think) Why is distributed locking hard?
Explanation with analogy (progressive reveal) Because you’re negotiating between:
In partitions, you must choose:
[IMAGE: CAP-style diagram annotated with “lock service chooses C over A during partitions” and “optimistic validation depends on store consistency model”.]
Real-world parallel Two restaurant hosts on opposite sides of a building with no communication: do they keep seating (availability) or stop seating to avoid double-booking (consistency)?
Key insight: Locking is coordination, and coordination is where distributed systems pay latency and availability costs.
Challenge question If your product can tolerate occasional oversell but not downtime, how does that change your locking choice?
Scenario or challenge You’re considering “no waiting, just retry.”
Interactive question (pause and think) Pick the true statement(s):
Explanation with analogy (progressive reveal) Answers:
Clarification: for invariants like “counter must not go below zero”, multi-leader geo writes are hard without coordination; many systems choose single-writer per key/shard.
Real-world parallel Letting everyone walk up to the counter works if the store is empty; it fails if it’s a stampede.
Key insight: Optimistic locking is a bet: “conflicts are rare enough that retries are cheaper than waiting.”
Challenge question What makes retries “cheap” in your system: fast storage, small payloads, or idempotent operations?
Scenario or challenge You must map workload shape to coordination strategy.
Interactive question (pause and think) Match the scenario (left) to the best default approach (right). Pause before the reveal.
Scenarios: A) Updating a user profile (name, avatar) in a read-heavy app B) Reserving seats for a concert that’s about to sell out C) Running a once-per-day billing job (must not run twice) D) Collaborative editing with frequent concurrent edits E) Decrementing a rate-limit counter per API call
Approaches:
Explanation with analogy (progressive reveal) One reasonable mapping: A->2, B->1, C->3, D->4, E->5
Real-world parallel
Key insight: The “best” approach depends on whether you can merge, retry, wait, or approximate.
Challenge question Which of your system’s invariants can be approximated (rate limits) and which cannot (money)?
Scenario or challenge You attempt an update. The database commits, but the client times out and retries.
Interactive question (pause and think) What do you need regardless of optimistic vs pessimistic locking?
Explanation with analogy (progressive reveal) Answer: A. You need idempotency to handle ambiguous outcomes.
Production insight: idempotency must be enforced at the side-effect boundary (e.g., payment charge, order creation). If you only dedupe at the edge but not at the DB/payment provider, you can still double-charge.
Real-world parallel You place a delivery order; your phone loses signal after you hit “Pay.” You try again. Without an idempotency key, you might pay twice.
Key insight: Locking controls concurrency, but idempotency controls retries under uncertainty.
Challenge question Where should idempotency be enforced: API gateway, service layer, or database?
Scenario or challenge You add a lock and expect multi-service atomicity.
Interactive question (pause and think) If Service A uses the lock but Service B doesn’t, is the invariant protected?
Explanation with analogy A lock only protects what all participants agree to protect.
Real-world parallel One host uses the reservation book; the other seats walk-ins without checking it.
Key insight: Locks are coordination agreements with failure modes.
Challenge question What’s the scope of your lock - one row, one shard, one service, or the whole business invariant?
Scenario or challenge Worker W1 acquires a lease-based lock and starts processing. A network hiccup delays renewals; the lease expires. Worker W2 acquires the lock and starts processing too. W1 comes back and continues, unaware it lost the lock.
Interactive question (pause and think) How do you prevent W1 from corrupting state after losing the lock?
Explanation with analogy (progressive reveal) Use fencing tokens:
t.t.[IMAGE: Diagram showing W1 token=41, lease expires; W2 token=42; W1 tries write with 41 and is rejected by resource that tracks max token.]
Real-world parallel A restaurant gives numbered tickets. Only the highest ticket number is allowed to claim the reservation.
Key insight: Leases prevent deadlocks, but fencing tokens prevent stale owners from causing damage.
Challenge question Where do you store and enforce the “max token seen” - in the database row, the service, or a proxy?
Production answer: enforce it at the resource that is being protected (often the DB row or a single-writer service). If enforcement is only in the lock client, it’s not safe.
Scenario or challenge You run an e-commerce platform with multiple regions. Inventory for a hot SKU is shared globally.
Interactive question (pause and think) Which approach is most realistic?
Explanation with analogy (progressive reveal) Often B or D:
Production insight: another common approach is regional reservations (allocate each region a quota, periodically rebalance). This reduces cross-region coordination.
Real-world parallel One store manager controls the last-item decisions; other branches forward requests.
Key insight: For counters and decrements, you often end up with single-writer or reservation tokens rather than pure optimistic multi-writer.
Challenge question If you must support writes in every region with low latency, what invariant would you relax to avoid global coordination?
Scenario or challenge Pure optimistic vs pure pessimistic is rarely the whole story.
Interactive question (pause and think) Where can you reduce coordination scope: by key, by shard, by time, or by semantics?
Explanation with analogy Hybrid strategies:
Optimistic reads + pessimistic writes
Reservation systems (soft locks)
Escalation
Partitioned ownership
[IMAGE: Architecture diagram: request router -> shard leader -> local DB; shows ownership boundaries reducing coordination scope.]
Real-world parallel A restaurant uses walk-in seating (optimistic) on quiet days and a reservation list (pessimistic) on busy nights.
Key insight: The best systems minimize the scope and duration of coordination.
Challenge question What’s the smallest “unit” you can lock - order, SKU, user, or shard?
Scenario or challenge You have:
Interactive question (pause and think) Pick a strategy:
Write down your choice and why.
Explanation with analogy (progressive reveal) A typical industry answer is (3) Pre-mint tokens:
But:
Failure scenarios to plan for:
Real-world parallel Hand out 10,000 numbered wristbands at the door; only wristband holders can buy.
Key insight: When contention is extreme, neither naive pessimistic nor naive optimistic locking scales well. You redesign the invariant.
Challenge question What’s the failure mode if token expiry is too short? Too long?
Scenario or challenge You deployed your locking strategy. Now production traffic arrives.
Interactive question (pause and think) What would you alert on first: lock wait time or conflict rate?
Explanation with analogy Signals you should be measuring:
If you chose pessimistic locking:
If you chose optimistic locking:
[IMAGE: Dashboard mockup showing lock wait histogram vs optimistic conflict rate chart.]
Real-world parallel
Key insight: Your locking strategy is a control loop: measure contention, tune backoff/leases, and redesign hotspots.
Challenge question What threshold indicates trouble sooner in your system: “conflict rate > X%” or “lock wait p99 > Y ms”? Why?
Scenario or challenge You avoided locks and feel safe.
Interactive question (pause and think) If everyone keeps retrying and no one makes progress, what is that called?
Explanation with analogy
Real-world parallel In a narrow hallway, two people keep stepping aside in the same direction repeatedly.
Key insight: Pessimistic -> deadlock risk. Optimistic -> livelock/retry storm risk.
Challenge question What’s your plan for livelock: backoff, jitter, queueing, or switching strategy?
Scenario or challenge You want atomic updates across:
Interactive question (pause and think) Can a lock make this atomic?
Explanation with analogy (progressive reveal) Not by itself.
Production insight: if you need “DB update + event publish” reliably, the transactional outbox is usually the pragmatic choice.
Real-world parallel Reserving a table doesn’t automatically reserve a taxi and charge your card; you still need coordination across parties.
Key insight: Locking is about concurrency. Transactions are about atomicity. They overlap but aren’t the same.
Challenge question If you use an outbox, where would optimistic vs pessimistic locking apply?
Scenario or challenge You implement optimistic locking for account withdrawal:
Interactive question (pause and think) What’s missing for correctness in a distributed environment?
Explanation with analogy (progressive reveal) Two common missing pieces:
Real-world parallel A customer calls twice because the line cut out; without a “same order” identifier, you charge twice.
Key insight: Optimistic locking protects against concurrent writers, not duplicate requests or multi-entity invariants.
Challenge question How would you implement transfers between two accounts without deadlocks or invariant violations?
Production answer (sketch):
Scenario or challenge You must choose a default strategy and defend it in design review.
Interactive question (pause and think) Answer these in order:
Explanation with analogy You’re deciding whether to:
Key insight: Pick the strategy that fails in the way you can tolerate.
Challenge question Which failure is worse for your product: occasional “please retry” errors or occasional oversells?
Scenario or challenge You’re building a shared whiteboard:
Interactive question (pause and think) Design concurrency control for shape position updates.
Choose one approach and justify:
Also answer:
Explanation with analogy (progressive reveal) One strong solution outline:
Production insight: for “dragging”, you often want intent-based operations (deltas) and smoothing on the client, not strict locking.
Real-world parallel A shared dry-erase board in a meeting room: people can write simultaneously, but you need conventions for overwriting and erasing.
Key insight: The most scalable “locking” strategy is often: don’t lock - change the data type and semantics so concurrent updates can safely merge.
Final challenge question What invariant in your system is truly non-negotiable - and how much coordination are you willing to pay to protect it?