Additional Database Concepts · Chapter 33 of 42

Database Sharding

Akhil Sharma

 20 min 

← → to navigate

Database Sharding

Splitting your database horizontally across multiple servers — the technique Instagram, Discord, and Uber use when a single database can't handle the load.

Database Sharding: Breaking Up Your Data for Massive Scale (Instagram's Growth Secret) 🎯 Challenge 1: The Phone Book Problem Imagine this scenario: You're managing a phone book for an entire country - 300 million people.

Single Book Approach (Traditional Database):

yaml

Sharded Approach (Distributed Database):

yaml

Pause and think: What if you could split your massive database across multiple servers, with each server handling only a portion of the data?

The Answer: Database Sharding splits your data horizontally across multiple databases! It's like: ✅ Each shard = independent database with subset of data ✅ Distribute load across multiple servers (no single bottleneck) ✅ Scale linearly (add shards = add capacity) ✅ Queries hit only relevant shards (faster lookups) ✅ Each shard can be optimized independently

Key Insight: Sharding trades simplicity for scalability - you get massive throughput but with added complexity!

🎬 Interactive Exercise: Vertical vs Horizontal Scaling

Before Sharding - Vertical Scaling (Bigger Server):

With Sharding - Horizontal Scaling (More Servers):

The Trade-off:

Real-world parallel: Vertical scaling is like building a taller building (expensive, has limits). Horizontal scaling is like building more buildings (cheaper, unlimited).

🏗️ Sharding Strategies: How to Split Your Data

Strategy 1: Range-Based Sharding

yaml

Strategy 2: Hash-Based Sharding

javascript

Strategy 3: Consistent Hashing

yaml

Strategy 4: Directory-Based Sharding

yaml

Code Example (Hash-based Sharding):

python

Real-world parallel:

Range-based = Library sections (A-F, G-M, N-Z)
Hash-based = Random assignment (ensures even distribution)
Consistent hashing = Circle of responsibility (easy to add/remove)
Directory = Phone book with map (flexible but need to check map)

🎮 Decision Game: Choose Your Sharding Key

Context: You're sharding different types of data. What should be the shard key?

Scenarios: A. Social media posts table (need to show user's posts) B. E-commerce orders table (need order history per user) C. Log events table (time-series data) D. Product catalog (need to search by category) E. Messages table (conversation threads) F. Analytics events (billions of events)

Options:

user_id (group by user)
created_at (group by time)
category (group by type)
conversation_id (group by thread)
event_id (random distribution)

Think about: How is data accessed most often?

Answers:

The Golden Rules for Shard Key Selection:

🚨 Common Misconception: "Sharding Solves All Scaling Problems... Right?"

You might think: "Just shard my database and infinite scale!"

The Reality: Sharding introduces significant complexity!

Problems Sharding Creates:

Problem 1: Cross-Shard Queries

sql

Problem 2: Distributed Transactions

python

Problem 3: Auto-increment IDs Don't Work

Problem 4: Schema Changes

Problem 5: Resharding (Changing Number of Shards)

Real-world parallel: Sharding is like opening multiple branch offices:

More capacity (good!)
But coordination meetings are harder (cross-shard queries)
Transactions across branches are complex (distributed transactions)
Reorganizing offices is disruptive (resharding)

⚡ Sharding in Practice: Real-World Architecture

Instagram's Sharding Strategy:

Application Code with Sharding:

python

Vitess (MySQL Sharding Framework):

yaml

Real-world parallel: Sharding framework is like a postal system. You write address (user_id), postal system (Vitess) figures out which post office (shard) to route to.

🔧 Best Practices for Sharding

Denormalize to Avoid Cross-Shard Queries:

sql

Use Consistent Hashing or Add Shards Carefully:

python

Monitor Shard Balance:

python

Have a Resharding Plan:

💡 Final Synthesis Challenge: The Library System

Complete this comparison: "A single library building is simple but limited. A sharded database is like..."

Your answer should include:

Data distribution strategy
Query routing
Trade-offs and complexity
When to shard vs when not to

Take a moment to formulate your complete answer...

The Complete Picture: A sharded database is like a library system with multiple branches across the city:

✅ Distribution: Books split across branches (by author, genre, or location) ✅ Capacity: Each branch manageable size, unlimited growth (add more branches) ✅ Parallel access: Multiple people can search simultaneously (different branches) ✅ Local optimization: Each branch optimized for its collection ✅ Routing: Need directory to know which branch has which books ✅ Complex queries: "Find all science books" requires checking all branches (slow) ✅ Coordination: Moving books between branches is expensive (resharding) ✅ Trade-off: More capacity but more complexity

When to shard:

Single database can't handle load (>10K qps)
Data too large for one server (>1TB)
Clear sharding key (user_id, tenant_id)
Mostly single-shard queries

When NOT to shard:

Can still scale vertically (< 1TB, < 10K qps)
Lots of cross-entity queries (JOINs everywhere)
No clear sharding key
Team too small (sharding requires expertise)

Real-world examples:

Instagram: Sharded by user_id (billions of users)
Slack: Sharded by workspace_id (millions of workspaces)
GitHub: Sharded by repository_id
Twitter: Custom sharding (Snowflake IDs)

Sharding transforms single-server limits into distributed scalability - but only when the complexity is worth it!

🎯 Quick Recap: Test Your Understanding Without looking back, can you explain:

What's the difference between vertical and horizontal scaling?
How does hash-based sharding differ from range-based?
Why are cross-shard queries problematic?
When should you shard your database?

Mental check: If you can design a sharding strategy, you understand database sharding!

🚀 Your Next Learning Adventure Now that you understand sharding, explore:

Advanced Sharding:

Vitess (MySQL sharding)
Citus (PostgreSQL sharding)
MongoDB sharding architecture
YugabyteDB distributed SQL

Sharding Challenges:

Resharding strategies
Cross-shard transactions
Global secondary indexes
Consistent hashing deep-dive

Related Concepts:

Database replication
Partitioning vs sharding
Multi-tenancy strategies
NewSQL databases

Real-World Case Studies:

How Discord scaled to billions of messages
Uber's sharding evolution
Pinterest's sharding journey
Shopify's multi-tenant sharding

Common Mistakes

Mistake	Why it's wrong	Correct approach
Sharding too early	Adds massive complexity before it's needed	Exhaust vertical scaling, read replicas, and caching first
Choosing the wrong shard key	Causes hotspots or makes common queries cross-shard	Analyze query patterns and data distribution before choosing
Not planning for resharding	Adding shards later requires painful data migration	Design for growth — use consistent hashing or a shard mapping table
Cross-shard joins in application code	Extremely slow and error-prone	Denormalize data so related records live on the same shard
Ignoring shard-local indexes	Queries without shard key hit every shard (scatter-gather)	Always include the shard key in queries when possible

Key Takeaways

Sharding splits data horizontally across multiple databases — each shard holds a subset of the data and operates independently
Choose shard keys carefully — a bad key causes hotspots; a good key distributes load evenly and avoids cross-shard queries
Range-based sharding enables range queries but risks hotspots — hash-based sharding distributes evenly but loses ordering
Cross-shard queries are expensive — design your data model to keep related data on the same shard whenever possible
Resharding is painful — plan your sharding strategy for growth, because migrating data between shards causes downtime

Previous Heartbeats and Healthchecks Up next Master Slave Archiecture

Chapter complete!

Up next Master Slave Archiecture

Continue

Database Sharding

Database Sharding

Common Mistakes

Key Takeaways

Course Complete!