You're about to change the shape of your data while the system is live, traffic is unpredictable, and multiple services are reading/writing concurrently.
The goal: no user-visible downtime and no data loss.
This is an interactive, progressive-reveal guide. You'll repeatedly see:
If you're reading this as a team, assign roles:
At the end you'll face a final synthesis challenge that combines everything.
You run a global e-commerce platform. Your orders table is central: every checkout writes to it; fulfillment reads it; customer support queries it; analytics streams it.
A product manager asks: "We need to support multi-currency with per-line exchange rates. Can you add currency, fx_rate, and change total_amount semantics?"
Traffic is constant. You can't take a maintenance window.
If you change the schema in place (e.g., ALTER TABLE orders ADD COLUMN currency ...), what can go wrong in a distributed system?
Write down 5 failure modes.
(Keep reading when you've got your list.)
You're renovating a coffee shop while it stays open. Customers keep ordering. If you move the espresso machine, you must ensure:
Zero-downtime migrations are not a single DDL statement. They're a coordinated, multi-step protocol across schema, application versions, data backfills, and operational safeguards.
Challenge question: What's the one thing you must assume in distributed deployments?
Pause and think.
Answer: You will run mixed versions concurrently (for minutes to hours), and they will interact with the same data.
You have 12 microservices. 4 write to the same table. Deploys are rolling; some pods update slower. Meanwhile, read replicas serve 60% of reads.
What does "backward compatibility" mean for a schema?
Hold your answer.
It depends on the migration phase, but you must design for both directions across time.
So in practice: C) both, but not always simultaneously for all operations—hence phased protocols.
Think of schema + application as a client-server protocol:
A safe migration is like evolving an API: additive changes first, destructive changes last.
A delivery platform changes its address format. If drivers' apps update gradually, the backend must accept both old and new address formats until everyone catches up.
Treat schema changes like API versioning. Your DB is a shared dependency with slow, partial rollouts.
Challenge question: What's the schema equivalent of "breaking an API"? List 3.
Examples:
NOT NULL, CHECK) that old writers violateDuring a migration you often choose between:
Most production systems choose AP-ish behavior during partitions (keep serving), which means:
So your migration protocol must tolerate inconsistency windows and reconcile.
Production insight: If you require strong consistency during cutover, you must route reads to the primary (or use synchronous replication / quorum reads), which reduces availability and increases latency.
"If we do it without downtime, users won't notice and we're safe."
Zero-downtime migrations often increase complexity and risk because:
Which is riskier?
Don't answer "it depends" yet—pick one.
Often (2) is riskier operationally, but (1) may be unacceptable product-wise.
The engineering goal becomes: reduce the risk of (2) with disciplined patterns.
"Zero downtime" is a product requirement; "low risk" is an engineering requirement. You must optimize both.
Challenge question: What observability signals become more important during dual-write?
Examples:
This is the foundational protocol used across many systems.
You need to rename orders.total_amount to orders.total_amount_minor and change semantics.
Why is "contract" last?
Think in terms of rollback.
Because once you drop old columns/tables, old application versions can no longer function, and rollback becomes impossible without restoring schema and possibly data.
You introduce a new menu item (expand), train staff and shift customers to it (migrate), then remove the old item (contract). Removing the old item first breaks orders during the transition.
Expand/Contract makes schema evolution monotonic during rollout: you only add capabilities until you're sure you don't need the old ones.
Challenge question: Name 2 schema changes that are "expand-safe" and 2 that are "contract-only."
Examples:
You have service checkout writing to orders. You want to add a NOT NULL column currency.
Choose the safest sequence:
Option A
ALTER TABLE orders ADD COLUMN currency TEXT NOT NULL;currencyOption B
ALTER TABLE orders ADD COLUMN currency TEXT NULL;currency (and defaulting if absent)NOT NULL constraintOption C
currencyPause and think.
Option B is the classic safe approach.
NOT NULL without a default will fail on existing rows and can take stronger locks.Prefer add nullable -> dual-write -> backfill -> enforce constraint.
Challenge question: What's the hidden risk in step 4 (adding NOT NULL) on large tables?
Answer: it can require a table scan and take locks that block writes (Postgres requires a table scan to verify no NULLs; lock level and duration depend on version and concurrent activity). Plan it like a production rollout.
Your primary DB is PostgreSQL with streaming replicas. You run ALTER TABLE ... at peak traffic.
What does "online DDL" actually mean?
Pick one.
C) depends.
Even "online" operations can:
Diagram description (include in your visuals):
t1, an ALTER TABLE acquires a lock (label lock type, e.g., ACCESS EXCLUSIVE vs SHARE UPDATE EXCLUSIVE depending on operation).Your database is a busy bridge. Some maintenance can happen on the shoulder (online), but certain repairs require closing a lane (locking). Traffic jams propagate backward (timeouts, retries, queue buildup).
DDL isn't just a schema event; it's a cluster-wide performance and consistency event.
Challenge question: How can replica lag create "Schrödinger's schema" for read traffic?
Answer: some requests routed to replicas see old schema/data while others routed to primary see new schema/data, producing inconsistent behavior across requests.
You deploy checkout v1 -> v2 while the DB migrates from schema S1 -> S2.
For each cell, mark safe or unsafe.
total_amounttotal_amount_minor and reads it if present, otherwise falls backtotal_amount and total_amount_minor| Schema S1 (old only) | Schema S2 (old+new) | |
|---|---|---|
| App v1 | ? | ? |
| App v2 | ? | ? |
Pause and think.
| Schema S1 (old only) | Schema S2 (old+new) | |
|---|---|---|
| App v1 | safe | safe |
| App v2 | safe (if coded to tolerate missing new column) | safe |
If v2 assumes the new column exists, then v2 x S1 becomes unsafe.
Your application must be explicitly coded for schema feature detection (or tolerant reads/writes) during the migration window.
Challenge question: What's the equivalent of "feature flags" at the schema level?
Answer: capability detection (e.g., checking column existence, or using a migration state table), plus runtime flags controlling read/write paths.
Add currency and fx_rate to order_lines.
lock_timeout to avoid long blocking.statement_timeout to avoid runaway DDL.ADD COLUMN without default (metadata-only in modern Postgres) and backfill separately.CHECK ... NOT VALID then VALIDATE CONSTRAINT.Why can NOT VALID constraints (Postgres) be safer than immediate validation?
Because you can add the constraint without scanning the whole table immediately, then validate asynchronously, reducing lock time and IO spikes.
Backfill is a distributed workload: it competes with production traffic and must be rate-limited, observable, and restartable.
Challenge question: What makes a backfill "restartable" without double-writing errors?
Answer: idempotent updates + durable checkpoints + safe retry semantics.
"We'll write a one-off script that updates all rows."
In production distributed systems, backfill is closer to a data pipeline:
If your backfill updates 500M rows, what's your plan for:
Treat backfills like production jobs: SLOs, throttles, and abort paths.
Challenge question: What's the simplest checkpoint key you can use for a table backfill?
Answer: a monotonic, indexed key (often the primary key), plus a job name/version.
You're moving from orders.total_amount to orders.total_amount_minor and want to keep them consistent.
Dual-write is like writing the same message to two delivery companies at once. If one delivery fails, you must detect and reconcile.
Dual-write sounds safe. What's the hardest part?
Pick one.
Usually B.
If you write two columns in the same row in the same transaction, you get atomicity at the row level.
But if you dual-write across two tables or two databases, atomicity becomes distributed.
Diagram description (include in your visuals):
checkout v1, checkout v2, DB.v2 writes both columns in one transaction (success).v2 writes old column, then crashes before writing new (if not in same transaction / across systems).Dual-write is safe only when you have a reconciliation strategy (read-repair, async repair, or periodic consistency checks).
Challenge question: What's the difference between read-repair and backfill?
Answer:
You're migrating from Postgres to a new distributed SQL cluster. For months you will write to both.
Which statement is true?
Pause and think.
2 is the most accurate.
Challenge question: When is 2PC acceptable, and when is it a trap?
Guideline:
You need to change a table's primary key or partitioning scheme—operations that are expensive in place.
orders_v2 with new schemaorders and orders_v2 (or capture changes via CDC)orders_v2orders_v2You build a new prep line next to the old one. For a while, every order is prepared on both lines to validate the new process. Once proven, you route all orders to the new line.
What's your cutover risk if you "swap tables" instantly?
Instant cutover can cause:
A safer cutover is progressive: route a small percentage of reads first.
Shadow table migrations are powerful, but you must treat cutover like a traffic migration (canary, progressive rollout, rollback plan).
Challenge question: How do you rollback after cutover if writes already went to the new table?
Answer: you usually rollback by routing (feature flag / traffic shift) while keeping dual-write on, not by reverting the DB. If you must revert, you need a reverse replication/backfill plan.
You add an index to support a new query, but index build can be heavy. In distributed systems, index builds can also impact replication and storage.
Which is safer for large Postgres tables?
CREATE INDEX ...;CREATE INDEX CONCURRENTLY ...;Usually B: CONCURRENTLY reduces blocking but takes longer and has different failure modes.
Index changes are performance migrations. Treat them like production rollouts with metrics and rollback.
Challenge question: What metrics would you watch immediately after an index appears?
Examples:
Your migration plan looks perfect in staging. Production fails anyway.
Match each failure mode to a mitigation.
Failures: A. Replica lag causes reads to miss new column B. Old service writes rows without new required field C. Backfill job overloads DB D. CDC consumer crashes on schema change E. Partial dual-write across databases
Mitigations:
Pause and think.
A->1, B->2, C->3, D->4, E->5
In distributed systems, schema changes propagate at different speeds to different components.
Challenge question: Which component is often forgotten in migration plans: OLTP DB, replicas, caches, CDC, or analytics?
Answer: often CDC + downstream consumers (and caches).
"We use an ORM. Schema migrations are safe."
ORMs help generate DDL, but they don't automatically solve:
What's one ORM behavior that can be dangerous during migration?
Auto-generated migrations might:
NOT NULL / UNIQUE / FOREIGN KEY without a phased planCONCURRENTLY (Postgres)SELECT * in critical queries*You publish OrderCreated events to Kafka. Multiple teams consume them.
You add currency and change amount semantics.
Is a DB migration enough?
No. In distributed systems, data contracts exist at multiple layers:
Diagram description (include in your visuals):
DB -> Service -> Event Stream -> Consumers -> Cache / Search / Warehouse.If using Avro/Protobuf:
Important Protobuf note: never reuse tag numbers; removing fields requires reserving tags.
Zero-downtime migrations require contract evolution across all data planes, not just the database.
Challenge question: If you change semantics (not just fields), what kind of compatibility is that?
Answer: it’s a semantic breaking change even if the schema is backward-compatible. You need versioning at the meaning level (new field name, new event version, or explicit amount_type).
You store order_total in Redis for fast reads, and also in Elasticsearch for search.
You change the meaning of totals.
What's the risk if you backfill DB but forget Redis?
You get "split brain" at the application layer:
order:v1:{id} -> order:v2:{id}Caches are schemas too—just implicit ones.
Challenge question: Why is "delete all cache keys" often not an acceptable migration plan?
Answer: it can cause cache stampedes, overload downstream systems, and create user-visible latency spikes.
You've added new columns and deployed dual-write. Now you need to turn on "read-new" safely.
What's safer?
B.
Treat flags as part of the migration protocol:
You must backfill currency for 800M order lines.
Design a backfill loop. What do you page by?
Choose one.
Prefer primary key range (or another indexed, monotonic key). OFFSET becomes slower as you progress.
Key fixes vs the earlier naive version:
cursor.rowcount after multiple statements (it only reflects the last statement)... WHERE new_col IS NULL AND id BETWEEN ...FOR UPDATE SKIP LOCKED if multiple workers claim work from a queue/tableBackfills must be safe to retry and safe to run in parallel.
Challenge question: What does "exactly-once" mean for a backfill update, and do you actually need it?
Answer: exactly-once would mean each row is updated once and only once. For most backfills you don’t need exactly-once; you need idempotent at-least-once (safe retries) with correctness invariants.
Which statement is true?
Pause and think.
4 is reliably true.
1 and 2 depend on engine/version and may still take locks.
3 can break clients that use cached query plans, SELECT *, or ORM field mapping assumptions.*
The risk is not just the DB; it's every client's assumptions.
Challenge question: Why is SELECT * a migration hazard?*
Answer: column order and presence can change; clients may deserialize positionally or assume a fixed set of fields.
"If something goes wrong, we'll roll back."
Rollbacks become hard when:
What's the rollback plan for a semantic change?
Example: total_amount used to mean "pre-tax," now means "post-tax."
You may need:
total_pre_tax, total_post_tax)Rollback might mean forward-fixing rather than reverting.
In data migrations, rollback is often another migration.
Challenge question: What must you preserve to make rollback possible: old schema, old data, or old semantics?
Answer: usually old semantics (or a way to recompute them). Schema alone is not enough.
You want to:
user_id -> customer_idamount from INT cents to DECIMALWhich is easiest to do zero-downtime?
Often A looks easy but can still break clients; B can be expensive; C is the hardest because it affects correctness.
customer_iduser_id and customer_idProduction fixes vs naive approach:
information_schema can be slow; prefer to_regclass / pg_attribute in Postgres.Avoid in-place destructive transformations; prefer parallel representation.
Challenge question: When would you still choose an in-place type change?
Answer: when the table is small, traffic is low, you can tolerate a maintenance window, or the DB provides a truly online type change with bounded locks—and you’ve tested it under production-like load.
Three services write to the same table:
checkout writes ordersrefunds updates status and amountsfraud annotates risk fieldsWho should drive the migration timeline?
C.
You need a single migration plan with:
Multiple airlines (services) use the same runway (table). Maintenance requires a coordinated schedule, not unilateral changes.
Shared databases create organizational coupling. Zero-downtime migrations are as much coordination as engineering.
Challenge question: What artifact should exist for every migration: a ticket, a runbook, or a design doc?
Answer: all three, but minimally a design doc + runbook for anything that can impact availability/correctness.
You start a backfill and dual-write. How do you know it's safe?
Diagram description (include in your visuals):
replication_lag_seconds, lock_waits, p99_latency, backfill_rows_per_sec, backfill_checkpoint, mismatch_rate.What's the single most important correctness metric during dual-write?
A mismatch rate (or checksum discrepancy) between old and new representations.
You can't observe correctness indirectly via latency. You must instrument it.
Challenge question: Where do you compute mismatches: in the app, in SQL queries, or in offline jobs?
Answer: ideally all three at different cadences—fast signal in-app, authoritative checks in SQL/offline.
You have orders_v2 shadow table fully backfilled.
Which is safer first: dark reads or canary?
Dark reads first: you validate correctness without user impact.
Treat data cutovers like traffic routing: validate silently, then gradually expose.
Challenge question: What's your rollback lever during canary: routing, feature flag, or DB revert?
Answer: almost always routing/feature flag.
"Our replicas will catch up quickly; it's fine."
During migrations, replication lag can spike due to:
Applications reading from replicas may see:
What's the safest read strategy during critical migration phases?
Often:
Replicas are not just "slower primaries"; they're different timelines.
Challenge question: How could you detect schema mismatch errors caused by replica lag?
Answer: track error signatures like column does not exist, correlate with replica host, and monitor lag metrics; add structured logs with DB endpoint identity.
Your orders table is sharded by customer_id. Some shards are busier.
You need to add a column and backfill.
What's different vs single-node?
Diagram description (include in your visuals):
A..N each labeled with state: S1, S1+S2, S2.In sharded systems, mixed versions happens at the shard level too.
Challenge question: Would you rather migrate all shards at once or one shard at a time? Why?
Answer: usually one shard at a time (or small batches) to limit blast radius and learn from early shards.
You run MySQL with large tables. Some ALTER operations copy tables.
What's a common zero-downtime approach in MySQL ecosystems?
Tools/patterns like:
pt-online-schema-changegh-ostThey create a shadow table, copy data, and apply changes via triggers/binlog.
Trigger-based copy under heavy write load can:
Some databases require external tooling to approximate online DDL safely.
Challenge question: What's the main risk of trigger-based copy during heavy write load?
Answer: it amplifies write cost and can fall behind, creating lag and potentially inconsistent cutover if drift accumulates.
You dual-write amounts in old and new formats.
How do you prove correctness before contract?
COUNT(*) WHERE old != convert(new)"Looks fine" is not a correctness proof. You need explicit invariants.
Challenge question: What's the trade-off between full-table checksums and sampling?
Answer: full checksums are expensive but comprehensive; sampling is cheap but can miss rare edge cases.
Match the migration goal to the best pattern.
Goals:
Patterns: A. Add nullable + backfill + enforce constraint B. Shadow table + cutover C. Dual-write across systems + reconciliation D. Parallel fields + versioned semantics E. Online index build + canary query plan
Pause and think.
1->A, 2->B, 3->C, 4->D, 5->E
Different migration goals require different risk envelopes.
Challenge question: Which of these patterns is hardest to roll back cleanly?
Answer: cross-system dual-write (C) and semantic changes (D) are typically hardest.
You've been dual-writing for weeks. You want to drop old columns.
What must be true before contract?
Write a checklist.
Contract is where you pay down complexity—but it's also where you can permanently break rollback.
Challenge question: Why do teams often postpone contract forever, and what's the cost?
Answer: fear of breaking something + lack of ownership. Cost: schema bloat, ongoing dual-write bugs, higher storage/index costs, slower queries, and cognitive load.
"We'll keep old and new columns indefinitely. No need to drop."
Long-lived dual schemas cause:
Contract is not optional; it's how you restore simplicity.
Challenge question: What governance process ensures contract actually happens?
Answer: a migration RFC/runbook with an explicit contract date, plus an owner and an SLO/OKR to remove deprecated schema.
You must change from orders.total_amount (INT cents) to:
orders.total_amount_minor (BIGINT)orders.currency (TEXT)And you must update events and caches.
total_amount_minor >= 0currency in (supported set)total_amount_minor == total_amount for USD legacy orders (if applicable)NOT VALIDDiagram description (include in your visuals):
A "simple" column change becomes a distributed choreography across DB, services, events, caches, and ops.
Challenge question: Which phase would you schedule during the lowest traffic period, and why?
Answer: backfill and constraint validation (and sometimes index builds) because they are IO-heavy and can increase lag.
You run a ride-sharing platform.
You need to migrate from:
trips table storing pickup_lat, pickup_lng as floatsto:
pickup_location as a geospatial typeConstraints:
tripsTripCreated eventsWrite a plan using Expand -> Migrate -> Contract.
Include:
Pause and think. Draft it.
(Compare to your plan; adjust yours.)
Expand
pickup_location columnpickup_locationDeploy tolerant code (phase A)
lat/lng and pickup_location)lat/lng) by default; add dark-read from newBackfill
pickup_location IS NULLValidate correctness
Cutover reads
Enforce constraints
Contract
Rollback
A good migration plan is a protocol with verification, not a sequence of hopeful commands.
Final challenge question: If you could add only one thing to reduce risk, would you add (a) dark reads, (b) mismatch metrics, or (c) throttled backfill? Defend your choice.
Production answer: mismatch metrics. Dark reads and throttling help, but mismatch metrics are the earliest, most actionable signal that you’re corrupting data or diverging semantics.
If you implement only one habit: instrument correctness. Latency tells you you're slow; correctness tells you you're right.