Challenge: Two replicas, one truth… but whose?
You run a globally distributed service: user profiles stored in multiple regions. A network partition hits. Region A accepts updates for 20 minutes, Region B accepts different updates. When the partition heals, you need to synchronize the datasets.
You have constraints:
Pause and think: If each region has 10 million keys, what’s the naïve approach to find differences?
(…pause…)
[ANSWER]
This is where Merkle trees show up: a compact way to summarize large sets so you can locate differences with logarithmic-ish network cost when divergence is small.
Production note: Merkle trees reduce network transfer for diff discovery. They do not eliminate the cost of reading/hashing data to build/maintain the tree, which is often the real bottleneck.
Imagine two coffee shops (Region A and Region B) that both keep a daily ledger of orders. At the end of the day, you want to confirm the ledgers match.
Each ledger has thousands of orders.
You don’t want to read every order over the phone.
Interactive question (pause and think)
If you could exchange just one number to confirm the ledgers match, what would it be?
(…pause…)
You’d exchange a digest: a hash of the entire ledger.
Merkle trees are like taking the ledger, splitting it into pages, hashing each page, then hashing groups of pages, until you get one top hash (the root). That root is the “receipt fingerprint.” If roots differ, you can zoom in to find the mismatching page(s).
Real-world parallel
Git uses Merkle-DAG hashing for objects/trees/commits.
Blockchains use Merkle trees to summarize transactions.
Dynamo-style databases use Merkle trees for replica synchronization (anti-entropy/repair).
Key insight: A Merkle tree lets you answer: “Do we match?” and if not, “Where do we differ?” with minimal data transfer.
If you only compare a single top-level hash, what important capability do you lose compared to Merkle trees?
(…pause…)
[ANSWER] You lose localization: you can’t efficiently narrow down which subset differs, so you fall back to expensive full scans or ad-hoc sampling.
You’re in a design review and someone says: “If we use Merkle trees, replicas become consistent automatically.”
What do Merkle trees actually provide: consistency guarantees, or something else?
(…pause…)
Merkle trees are detection and localization tools. They tell you:
They do not decide:
Those require one (or more) of:
version vectors / vector clocks / dotted version vectors
CRDTs
last-write-wins (LWW) timestamps (unsafe if clocks are not tightly bounded; even with NTP you still need a clear skew bound)
application-level reconciliation
Decision game: Which statement is true?
A) Merkle trees guarantee linearizability.
B) Merkle trees help identify which partitions of data differ between replicas.
C) Merkle trees prevent split-brain.
(…pause…)
[ANSWER] B.
Key insight: Merkle trees are about efficient diff discovery, not about conflict semantics.
Name one mechanism you’d pair with Merkle-tree-based sync to resolve concurrent updates.
(…pause…)
[ANSWER] Vector clocks / dotted version vectors, or CRDTs, or application-specific merge logic.
You’re a delivery manager verifying a warehouse inventory against a store inventory.
You want a hierarchy of receipts:
At the top: hash(warehouse).
If the top hashes differ, what’s the next step?
(…pause…)
[ANSWER] Compare children hashes. Only descend into subtrees whose hashes differ.
A Merkle tree is a hash tree:
If two replicas have the same root hash, they (almost certainly) have the same data for the same snapshot/version.
[IMAGE: A Merkle tree diagram showing leaves as key ranges (e.g., 0-99, 100-199, etc.), internal nodes hashing children, and a root hash. Two replicas shown side-by-side; one subtree differs, highlighted.]
Key insight: Merkle trees enable progressive refinement: compare coarse summaries first, then zoom in.
What property must the hashing process have so that two replicas compute the same root when they store the same data?
(…pause…)
[ANSWER] Determinism: same partitioning, same ordering, same canonicalization, and same hash function/encoding.
Two regions store a map: key -> value. Keys are strings. Values are JSON blobs.
You want to build a Merkle tree.
If you hash “all key-value pairs,” what subtle problem appears if replicas iterate keys in different orders?
(…pause…)
[ANSWER] If ordering differs, concatenation differs, hashes differ—even if the set is the same.
To build a deterministic Merkle tree, you need deterministic structure.
Option 1: Sort keys
Option 2: Partition by key ranges
[0000-0FFF], [1000-1FFF], etc.Option 3: Partition by consistent hashing ring positions
What to hash at a leaf?
H( H(key) || H(value_bytes) || version_metadata || tombstone_flag )Canonicalization matters
JSON must be canonicalized (field order, whitespace) or hashed as the exact bytes stored.
If you compress values, hash the uncompressed canonical bytes or the stored bytes consistently on all replicas.
Where code helps
[CODE: Python, demonstrate deterministic leaf hashing with canonical JSON and sorted keys]
Key insight: Merkle trees require deterministic partitioning and hashing or you’ll detect “differences” that aren’t real.
If values are large blobs (e.g., images), do you hash the whole blob at leaves? What trade-off does that create?
(…pause…)
[ANSWER] Hashing full blobs increases CPU/IO; hashing only metadata risks missing corruption. Common approach: store/compute a content hash at write time (or chunked hashes) and reuse it during Merkle maintenance.
You’re implementing anti-entropy between replicas.
You must choose:
Tree arity (binary? 16-way?)
Leaf granularity (per key? per range? per page?)
Rebuild frequency
Decision game (pause and think)
Which statement is true?
“More leaves always means faster synchronization.”
“Higher arity reduces tree depth but increases per-node fanout comparison cost.”
“Merkle trees eliminate the need for read repair.”
(…pause…)
[ANSWER] 2.
Explanation
More leaves: better localization, but bigger tree, more memory, more CPU to maintain, and potentially more leaf-level metadata exchange.
Higher arity: fewer levels (fewer RTTs), but each comparison step may involve sending many child hashes.
Merkle trees don’t eliminate read repair; they complement it.
Comparison table
| Design choice | Benefit | Cost | When it shines |
|---|---|---|---|
| Binary tree | Simple, small child list | Deeper tree (more RTTs) | Low RTT, simpler implementations |
| 16-way tree | Shallower tree (fewer RTTs) | More hashes per node | Cross-region/high RTT, very large datasets |
| Leaves per key | Precise diffs | Huge tree | Small datasets or very small ranges |
| Leaves per range/page | Compact | Less precise diffs | Big datasets, LSM/B-tree pages |
Key insight: Merkle trees are a tunable index over differences: you trade maintenance cost for diff precision.
If your network RTT is high (cross-region), which might matter more: tree depth or per-node payload size?
(…pause…)
[ANSWER] Often depth/round trips dominate; higher arity can reduce RTTs, but watch payload size and MTU/fragmentation.
Replica A and Replica B periodically synchronize.
You can think of this like two restaurants comparing inventory:
Fill in the blanks (pause and think):
A sends ___________ to B.
If B’s value differs, B requests/returns ___________.
Continue until reaching _________.
(…pause…)
[ANSWER]
A sends root hash (or root + metadata like tree version).
B requests/returns child hashes for the differing node.
Continue until reaching leaves, then exchange actual key/value versions.
A typical exchange:
A -> B: (tree_id, snapshot_id, root_hash)
B compares with its root.
A -> B: list of child hashes
B compares each child; for mismatches, repeat.
At leaves: exchange keys/versions/value-hashes (and tombstones), then fetch missing versions.
Where visuals help
[IMAGE: Sequence diagram of Merkle sync: root compare, child compare, descend, leaf exchange.]
Key insight: Merkle sync is a guided search over the keyspace: hashes steer you to the minimal set of data to exchange.
What happens if the dataset changes while you’re mid-sync? How could you make the comparison consistent?
(…pause…)
[ANSWER] Without a stable snapshot/version, you can get churn and non-convergence. Use MVCC snapshots or include a snapshot/version ID and restart on mismatch.
While A and B are comparing trees, writes keep arriving.
It’s like comparing two restaurant inventories while customers are actively buying ingredients.
If A builds a Merkle tree at time T1 and B compares against its current state at time T2, what can happen?
(…pause…)
[ANSWER] You can chase differences forever or miss a stable convergence point. You may also repeatedly detect mismatches caused by concurrent writes rather than replication lag.
Anti-entropy runs continuously.
Differences due to concurrent writes will be corrected in later rounds.
Trade-off table
| Approach | Pros | Cons |
|---|---|---|
| Snapshot-based | Deterministic, converges | Snapshot creation/retention cost; can increase storage/compaction pressure |
| Incremental maintenance | Lower rebuild cost | Complexity; correctness bugs; needs careful locking/versioning |
| Fuzzy continuous | Simple, robust | More bandwidth/CPU; slower convergence under high write rates |
Key insight: In distributed systems, “compare states” often means “compare versions of states.” Merkle trees need a stable reference point.
If you can’t take snapshots, what’s one practical mitigation to avoid infinite churn during sync?
(…pause…)
[ANSWER] Bound the work: cap retries/time per range, and if too many version mismatches occur, restart from root later (or temporarily skip hot ranges).
You run anti-entropy over an unreliable network:
packets drop
connections reset
one node restarts mid-sync
Decision game (pause and think)
Which failure is the most dangerous if unhandled?
A) A message containing child hashes is duplicated.
B) A message containing child hashes is lost.
C) A node sends child hashes computed from a different tree version than the root hash it previously advertised.
(…pause…)
[ANSWER] C.
Explanation
Duplication: usually harmless with idempotent handling.
Loss: retry fixes it.
Version mismatch: you can descend into the wrong subtree and fetch incorrect diffs. You may “repair” data incorrectly or waste time.
Robustness techniques
Include tree version / snapshot ID in every message.
Use request IDs and idempotency.
Timeouts + retry with exponential backoff.
If version mismatch detected: restart sync from root.
Where code helps
[CODE: Python, protocol structs including snapshotID/version and requestID]
Key insight: The protocol must be versioned: hashes are only meaningful relative to the dataset snapshot they summarize.
What’s the simplest safe behavior if a peer can’t provide a consistent snapshot for the duration of the sync?
(…pause…)
[ANSWER] Abort that range’s sync and retry later (or fall back to a different mechanism like log catch-up/snapshot transfer).
Someone worries: “If two different subtrees collide, we’ll think data matches when it doesn’t.”
Is collision risk the main practical issue in Merkle synchronization?
(…pause…)
[ANSWER] Usually no. With cryptographic hashes, collisions are negligible; real issues are determinism, canonicalization, partitioning, and bugs.
Cryptographic hashes (SHA-256, BLAKE3) make collisions so unlikely that distributed systems treat them as impossible for practical purposes.
But there are more realistic failure modes:
inconsistent canonicalization
inconsistent partition boundaries
bugs in incremental updates
using non-cryptographic hashes where adversarial input exists
Pause-and-think
If your system is exposed to untrusted clients (multi-tenant), is a fast non-cryptographic hash (xxHash) safe for Merkle synchronization?
(…pause…)
[ANSWER] Often no. An adversary might craft collisions to hide differences or force pathological behavior. Use a cryptographic hash or a keyed hash (HMAC) depending on threat model.
Key insight: Collisions are rarely the practical problem; determinism and adversarial robustness are.
Name one reason to prefer BLAKE3 over SHA-256 in high-throughput Merkle maintenance.
(…pause…)
[ANSWER] BLAKE3 is typically much faster on modern CPUs (SIMD/parallelism), reducing CPU cost for hashing large datasets.
You use consistent hashing with tokens/vnodes. Each node owns multiple token ranges.
You build a Merkle tree per range.
Why ranges help:
Replica sets are defined by ranges.
Anti-entropy can run per range in parallel.
Leaves correspond to contiguous segments in storage.
Interactive matching exercise
Match the concept to its role:
Token range
Merkle leaf
Internal node
Root
Roles: A) Summarizes an entire range’s contents.
B) Summarizes a subset of the range.
C) The unit of ownership/replication in a consistent-hash ring.
D) The smallest summarized chunk; points to actual keys/versions.
(…pause…)
[ANSWER] 1->C, 2->D, 3->B, 4->A
[IMAGE: Consistent hash ring with token ranges; each range has an associated Merkle tree.]
Key insight: In Dynamo-like systems, Merkle trees are often scoped to token ranges, aligning synchronization with replication topology.
If a node changes ownership of a range due to rebalancing, what happens to the corresponding Merkle tree?
(…pause…)
[ANSWER] The tree must be rebuilt/reattached for the new owner’s local data; in-flight repairs for that range should be paused/restarted to avoid comparing different ownership epochs.
You identified a mismatching leaf range [k1..k2]. Now what?
You could exchange:
full values
hashes of values
version metadata (vector clocks)
Interactive question (pause and think)
Why might exchanging full values at the leaf be wasteful?
(…pause…)
[ANSWER] Many keys might already match; you only need to transfer the keys that differ.
At a mismatching leaf:
Production note: if you use multi-version concurrency (siblings) you may need to exchange multiple versions per key (e.g., vector-clock siblings) rather than a single “latest”.
[CODE: Python, example of leaf diff: compare (key, version, valueHash) tuples and request missing keys]
Key insight: Merkle trees narrow the search; per-key metadata prevents shipping full values unnecessarily.
What if both replicas have different versions for the same key (concurrent writes)? What must your system do next?
(…pause…)
[ANSWER] Fetch both versions (or all siblings) and apply your conflict-resolution strategy (CRDT merge, app merge, or store siblings for later resolution).
Replica A and B store 1 million keys. Only 3 keys differ.
You choose:
16-way Merkle tree
leaves cover 4096 keys each (by sorted order)
Progressive reveal walkthrough
Step 1: Compare roots.
Step 2: Compare 16 children.
Step 3: Descend into those 2 subtrees.
Step 4: Eventually reach 2 mismatching leaves.
Now you only exchange metadata for ~8192 keys, not 1 million.
Pause-and-think
If you instead had leaves per key, what changes?
(…pause…)
[ANSWER] You’d localize differences perfectly, but tree size and maintenance overhead explode.
[IMAGE: A “zoom-in” diagram showing root mismatch -> two child mismatches -> two leaf ranges highlighted, with counts of hashes exchanged at each step.]
Key insight: The power of Merkle trees is bandwidth proportional to differences (plus tree traversal overhead), not dataset size.
What happens to bandwidth if differences are widespread (e.g., after long partition)? Does Merkle still help?
(…pause…)
[ANSWER] Benefit diminishes: you end up descending into many subtrees and exchanging lots of leaf metadata/values. At that point, log catch-up or snapshot bootstrap may be cheaper.
A replica was offline for weeks. It missed millions of updates.
Will Merkle trees still save bandwidth?
(…pause…)
[ANSWER] Some, but less: if most subtrees differ, you descend everywhere and transfer lots of data.
Practical alternatives / complements:
Streaming catch-up via logs (WAL replication, CDC): if you have an ordered history.
Bootstrap via snapshots: ship a compact snapshot then resume incremental.
Hybrid: use Merkle for detection + logs for bulk catch-up.
Comparison table
| Method | Best when | Weak when |
|---|---|---|
| Merkle anti-entropy | Small/medium divergence, continuous sync | Massive divergence, cold start |
| Log-based replication | You have durable ordered history | Missing logs, retention limits |
| Full snapshot transfer | New node bootstrap | Frequent small drifts |
Key insight: Merkle trees are an anti-entropy tool, not a universal replication mechanism.
If you have both logs and Merkle trees, when would you prefer Merkle over logs?
(…pause…)
[ANSWER] When logs are incomplete/expired, when you suspect silent divergence/corruption, or when you need periodic full coverage for cold keys.
You don’t want to rebuild the tree from scratch every sync cycle.
In an LSM-tree storage engine, what makes incremental maintenance tricky?
(…pause…)
[ANSWER] Compactions rewrite SSTables, moving keys between files/pages; if leaves map to physical layout, boundaries change and invalidate hashes.
Two approaches:
Practical designs:
Important correction: a “rolling hash” updated as
H(prev || entry)is order-dependent and not a true Merkle leaf summary unless the entry order is stable and you can replay it deterministically. In production, leaf hashes are typically computed from a deterministic ordered iteration of entries in that leaf range (or from a stable per-entry hash multiset with a commutative combiner).
[CODE: Python, sketch of incremental maintenance using per-leaf recomputation trigger]
Key insight: Incremental Merkle maintenance must align with your storage engine’s notion of stable partitions.
If you maintain per-range hashes, how do you handle tombstones (deletes) so replicas converge?
(…pause…)
[ANSWER] Include tombstones (and their version metadata) in the hashed state until they are safe to GC.
Replica A deleted key x and keeps a tombstone. Replica B never saw the key.
If B doesn’t store tombstones, could A and B compute the same Merkle root?
(…pause…)
[ANSWER] Yes, if both treat x as absent—but that can be dangerous because B might later resurrect x via repair from an old replica.
In eventually consistent stores, deletes are typically represented as tombstones with version metadata.
During sync, tombstones must be included in hashing (until GC safe).
Otherwise, you get resurrection during repair.
Common misconception
“Deletes are just missing keys.”
In distributed replication, deletes are events that must be ordered/merged like writes.
Key insight: Merkle sync must include tombstones (or equivalent delete markers) in the hashed state until safe to purge.
What condition must hold before a tombstone can be safely garbage-collected in a replicated system?
(…pause…)
[ANSWER] You must ensure all replicas that could serve/repair the key have observed the tombstone (or the tombstone is older than a proven bound like gc_grace_seconds and you accept the trade-off), otherwise you risk resurrection.
You use quorum reads/writes (N=3, R=2, W=2).
Which statement is true?
A) With R+W>N, Merkle trees are unnecessary.
B) Merkle trees and read repair both fix divergence, but on different triggers.
C) Read repair guarantees full convergence without anti-entropy.
(…pause…)
[ANSWER] B.
Explanation
Quorums reduce inconsistency probability but don’t eliminate it (failures, hinted handoff, sloppy quorums).
Read repair: opportunistic, driven by client reads.
Merkle anti-entropy: background, ensures convergence even for cold keys.
Comparison table
| Mechanism | Trigger | Scope | Strength |
|---|---|---|---|
| Read repair | Client reads | Only read keys | Low overhead, incomplete coverage |
| Merkle anti-entropy | Background schedule | Whole dataset/range | Complete coverage, higher background cost |
| Hinted handoff | Failure recovery | Missed writes | Fast catch-up, limited retention |
Key insight: Merkle trees are your “night shift inventory audit,” while read repair is “fix it when a customer complains.”
If your workload has many cold keys never read, which mechanism becomes essential for convergence?
(…pause…)
[ANSWER] Background anti-entropy/repair (Merkle-based or equivalent).
You run a multi-tenant system. Nodes in different trust domains synchronize.
If you send Merkle hashes to a peer, are you leaking information?
(…pause…)
[ANSWER] Potentially yes: hashes can leak presence/absence patterns (especially with small leaves and guessable keys). Low-entropy values can be brute-forced.
Mitigations:
Use keyed hashes (HMAC) so hashes are not meaningful outside the cluster.
Encrypt transport.
Increase leaf granularity to reduce inference.
Consider privacy-preserving set reconciliation if threat model requires (more complex).
Key insight: Merkle trees are compact summaries—but summaries can still leak structure. Choose hashes and granularity with your threat model.
When would you use HMAC-based Merkle hashing instead of plain SHA-256?
(…pause…)
[ANSWER] When peers are not fully trusted or when you want to prevent offline guessing/brute-force of low-entropy keys/values from observed hashes.
You’re choosing a synchronization strategy for a replicated key-value store.
Where Merkle trees show up:
Apache Cassandra / Dynamo-style systems: anti-entropy repair uses Merkle trees to find inconsistent ranges.
Git: object graph hashing enables efficient synchronization.
Blockchains: Merkle roots commit to sets of transactions.
Distributed filesystems: chunk trees for verifying and syncing.
Pause-and-think
Why are Merkle trees especially compatible with content-addressed storage?
(…pause…)
[ANSWER] Because the hash is the identity. Synchronization becomes “do you have object X?” and Merkle paths prove inclusion.
[IMAGE: Example of content-addressed DAG vs Merkle tree; show how hashes name objects and enable sync.]
Key insight: Merkle trees pair naturally with systems that already have stable chunking and hashing.
Name one reason a relational database might not use Merkle trees for replica sync.
(…pause…)
[ANSWER] Many RDBMS replication schemes are log-based (WAL/binlog) with strict ordering/transactions; Merkle trees don’t naturally capture transactional semantics or predicate-based queries, and building them over rows can be expensive without a clear stable partitioning.
You need to justify Merkle trees to your performance team.
What dominates cost when the dataset is huge but differences are tiny?
(…pause…)
[ANSWER] Often CPU/IO to build or read the tree (if you rebuild) dominates, not network.
Costs:
Trade-offs:
Rebuild vs incremental maintenance.
Leaf size: bigger leaves reduce tree size but increase leaf diff payload.
Hash function choice: SHA-256 vs BLAKE3.
Comparison table: leaf size
| Leaf size | Tree size | Diff precision | Leaf diff cost |
|---|---|---|---|
| Small | Large | High | Low per leaf, but more leaves |
| Large | Small | Low | High per leaf |
Key insight: Merkle trees shift the problem from “send everything” to “compute smart summaries.” Computation can become the bottleneck.
If your bottleneck is disk IO, what Merkle-tree design choice helps most?
(…pause…)
[ANSWER] Avoid full rebuild scans: use incremental maintenance or dirty-range recomputation; also align leaves with storage iteration (range scans) to minimize random IO.
You and a teammate each have a set of key-value pairs. You want to reconcile.
Given:
If you build a binary Merkle tree over sorted keys [a,b,c,d], which leaf differs?
(…pause…)
[ANSWER] Leaf for c differs, and only the path from that leaf to root differs.
[CODE: Python, build a Merkle tree from sorted (key,value) pairs and compute diff path]
Key insight: A single key difference only perturbs O(log n) hashes (along its path), enabling efficient localization.
If two keys differ in different halves of the tree, how many internal nodes will differ?
(…pause…)
[ANSWER] All nodes on both leaf-to-root paths differ; the paths share only the root (and possibly some ancestors depending on tree shape), so the number of differing internal nodes is roughly the union of both paths.
Your repair job runs nightly. Sometimes it spikes CPU, sometimes it never finishes.
Which is more dangerous: repair that’s slow, or repair that’s fast but starves client traffic?
(…pause…)
[ANSWER] Fast-but-starving is often worse: it can cause cascading timeouts and trigger failovers, increasing divergence.
Common gotchas:
Mitigations:
Observability (production-grade):
Per-range: last repaired time, bytes compared, bytes transferred, mismatch rate.
Per-node: CPU, disk read IOPS, compaction backlog, p99 latency during repair.
Protocol: retries, timeouts, snapshot/version mismatch counts.
Key insight: Merkle sync is a background protocol; in production, resource governance matters as much as correctness.
How would you detect that repair is causing user-visible latency regressions?
(…pause…)
[ANSWER] Correlate repair concurrency/bandwidth with p95/p99 read/write latency, queue depths, disk IO saturation, and error rates; add automated canaries that pause repair when SLOs degrade.
You’re integrating Merkle anti-entropy into a system that already has quorums and conflict resolution.
Think of Merkle trees as:
They connect to:
replication topology (ranges/vnodes)
versioning semantics (vector clocks, CRDTs)
storage engine constraints (snapshots, compaction)
operational realities (rate limits, topology churn)
Final decision game (pause and think)
You’re designing a geo-replicated KV store.
Pick the best combination for strong eventual convergence with manageable cost:
A) LWW timestamps + no anti-entropy
B) Vector clocks + read repair only
C) Vector clocks (or CRDTs) + Merkle-tree anti-entropy + rate-limited repair
(…pause…)
[ANSWER] C is the robust design pattern.
Key insight: Merkle trees make finding divergence cheap; versioning makes resolving divergence correct.
[TARGET] Challenge scenario
You operate a 12-node cluster across 3 regions. Each key is replicated to 3 nodes (N=3). You use sloppy quorums during outages. Writes continue during partitions.
You need a repair strategy that:
guarantees eventual convergence for cold keys
doesn’t overwhelm inter-region links
handles node restarts and topology changes
Your tasks (pause and think)
(…pause…)
Suggested solution outline (progressive reveal)
Granularity: per token range, leaves sized to balance diff precision vs tree size (e.g., 4K–64K keys depending on key distribution and RTT).
Snapshot: prefer MVCC snapshot IDs for deterministic runs; otherwise fuzzy but bounded (restart if too many version mismatches; skip hot ranges).
Leaf exchange: (key, versionVector/dottedVersion, valueHash, tombstoneFlag) then fetch missing versions; include tombstones.
Failures: every message includes (snapshotID, rangeID, requestID); if snapshotID mismatch, restart from root; idempotent requests; timeouts/backoff.
Guardrails: rate-limit repair bandwidth, pause on topology change, track repair lag per range, alert on CPU/IO saturation.
Final challenge question
If you could only implement one improvement to make Merkle-based repair safer in production, what would you choose and why?
(…pause…)
[ANSWER] Version/snapshot IDs end-to-end (and restart-on-mismatch). It prevents incorrect comparisons and “repairing the wrong thing” under concurrent writes/topology churn.