Courses 0%
33
Additional Database Concepts · Chapter 33 of 42

Database Sharding

Akhil
Akhil Sharma
20 min

Database Sharding

Splitting your database horizontally across multiple servers — the technique Instagram, Discord, and Uber use when a single database can't handle the load.

Database Sharding: Breaking Up Your Data for Massive Scale (Instagram's Growth Secret) 🎯 Challenge 1: The Phone Book Problem Imagine this scenario: You're managing a phone book for an entire country - 300 million people.

Single Book Approach (Traditional Database):

yaml

Sharded Approach (Distributed Database):

yaml

Pause and think: What if you could split your massive database across multiple servers, with each server handling only a portion of the data?

The Answer: Database Sharding splits your data horizontally across multiple databases! It's like: ✅ Each shard = independent database with subset of data ✅ Distribute load across multiple servers (no single bottleneck) ✅ Scale linearly (add shards = add capacity) ✅ Queries hit only relevant shards (faster lookups) ✅ Each shard can be optimized independently

Key Insight: Sharding trades simplicity for scalability - you get massive throughput but with added complexity!

🎬 Interactive Exercise: Vertical vs Horizontal Scaling

Before Sharding - Vertical Scaling (Bigger Server):

With Sharding - Horizontal Scaling (More Servers):

The Trade-off:

Real-world parallel: Vertical scaling is like building a taller building (expensive, has limits). Horizontal scaling is like building more buildings (cheaper, unlimited).

🏗️ Sharding Strategies: How to Split Your Data

Strategy 1: Range-Based Sharding

yaml

Strategy 2: Hash-Based Sharding

javascript

Strategy 3: Consistent Hashing

yaml

Strategy 4: Directory-Based Sharding

yaml

Code Example (Hash-based Sharding):

python

Real-world parallel:

  • Range-based = Library sections (A-F, G-M, N-Z)
  • Hash-based = Random assignment (ensures even distribution)
  • Consistent hashing = Circle of responsibility (easy to add/remove)
  • Directory = Phone book with map (flexible but need to check map)

🎮 Decision Game: Choose Your Sharding Key

Context: You're sharding different types of data. What should be the shard key?

Scenarios: A. Social media posts table (need to show user's posts) B. E-commerce orders table (need order history per user) C. Log events table (time-series data) D. Product catalog (need to search by category) E. Messages table (conversation threads) F. Analytics events (billions of events)

Options:

  1. user_id (group by user)
  2. created_at (group by time)
  3. category (group by type)
  4. conversation_id (group by thread)
  5. event_id (random distribution)

Think about: How is data accessed most often?

Answers:

The Golden Rules for Shard Key Selection:

🚨 Common Misconception: "Sharding Solves All Scaling Problems... Right?"

You might think: "Just shard my database and infinite scale!"

The Reality: Sharding introduces significant complexity!

Problems Sharding Creates:

Problem 1: Cross-Shard Queries

sql

Problem 2: Distributed Transactions

python

Problem 3: Auto-increment IDs Don't Work

Problem 4: Schema Changes

Problem 5: Resharding (Changing Number of Shards)

Real-world parallel: Sharding is like opening multiple branch offices:

  • More capacity (good!)
  • But coordination meetings are harder (cross-shard queries)
  • Transactions across branches are complex (distributed transactions)
  • Reorganizing offices is disruptive (resharding)

⚡ Sharding in Practice: Real-World Architecture

Instagram's Sharding Strategy:

Application Code with Sharding:

python

Vitess (MySQL Sharding Framework):

yaml

Real-world parallel: Sharding framework is like a postal system. You write address (user_id), postal system (Vitess) figures out which post office (shard) to route to.

🔧 Best Practices for Sharding

  1. Denormalize to Avoid Cross-Shard Queries:
sql
  1. Use Consistent Hashing or Add Shards Carefully:
python
  1. Monitor Shard Balance:
python
  1. Have a Resharding Plan:

💡 Final Synthesis Challenge: The Library System

Complete this comparison: "A single library building is simple but limited. A sharded database is like..."

Your answer should include:

  • Data distribution strategy
  • Query routing
  • Trade-offs and complexity
  • When to shard vs when not to

Take a moment to formulate your complete answer...

The Complete Picture: A sharded database is like a library system with multiple branches across the city:

Distribution: Books split across branches (by author, genre, or location) ✅ Capacity: Each branch manageable size, unlimited growth (add more branches) ✅ Parallel access: Multiple people can search simultaneously (different branches) ✅ Local optimization: Each branch optimized for its collection ✅ Routing: Need directory to know which branch has which books ✅ Complex queries: "Find all science books" requires checking all branches (slow) ✅ Coordination: Moving books between branches is expensive (resharding) ✅ Trade-off: More capacity but more complexity

When to shard:

  • Single database can't handle load (>10K qps)
  • Data too large for one server (>1TB)
  • Clear sharding key (user_id, tenant_id)
  • Mostly single-shard queries

When NOT to shard:

  • Can still scale vertically (< 1TB, < 10K qps)
  • Lots of cross-entity queries (JOINs everywhere)
  • No clear sharding key
  • Team too small (sharding requires expertise)

Real-world examples:

  • Instagram: Sharded by user_id (billions of users)
  • Slack: Sharded by workspace_id (millions of workspaces)
  • GitHub: Sharded by repository_id
  • Twitter: Custom sharding (Snowflake IDs)

Sharding transforms single-server limits into distributed scalability - but only when the complexity is worth it!

🎯 Quick Recap: Test Your Understanding Without looking back, can you explain:

  1. What's the difference between vertical and horizontal scaling?
  2. How does hash-based sharding differ from range-based?
  3. Why are cross-shard queries problematic?
  4. When should you shard your database?

Mental check: If you can design a sharding strategy, you understand database sharding!

🚀 Your Next Learning Adventure Now that you understand sharding, explore:

Advanced Sharding:

  • Vitess (MySQL sharding)
  • Citus (PostgreSQL sharding)
  • MongoDB sharding architecture
  • YugabyteDB distributed SQL

Sharding Challenges:

  • Resharding strategies
  • Cross-shard transactions
  • Global secondary indexes
  • Consistent hashing deep-dive

Related Concepts:

  • Database replication
  • Partitioning vs sharding
  • Multi-tenancy strategies
  • NewSQL databases

Real-World Case Studies:

  • How Discord scaled to billions of messages
  • Uber's sharding evolution
  • Pinterest's sharding journey
  • Shopify's multi-tenant sharding

Common Mistakes

MistakeWhy it's wrongCorrect approach
Sharding too earlyAdds massive complexity before it's neededExhaust vertical scaling, read replicas, and caching first
Choosing the wrong shard keyCauses hotspots or makes common queries cross-shardAnalyze query patterns and data distribution before choosing
Not planning for reshardingAdding shards later requires painful data migrationDesign for growth — use consistent hashing or a shard mapping table
Cross-shard joins in application codeExtremely slow and error-proneDenormalize data so related records live on the same shard
Ignoring shard-local indexesQueries without shard key hit every shard (scatter-gather)Always include the shard key in queries when possible

Key Takeaways

  1. Sharding splits data horizontally across multiple databases — each shard holds a subset of the data and operates independently
  2. Choose shard keys carefully — a bad key causes hotspots; a good key distributes load evenly and avoids cross-shard queries
  3. Range-based sharding enables range queries but risks hotspots — hash-based sharding distributes evenly but loses ordering
  4. Cross-shard queries are expensive — design your data model to keep related data on the same shard whenever possible
  5. Resharding is painful — plan your sharding strategy for growth, because migrating data between shards causes downtime
Chapter complete!

Course Complete!

You've finished all 42 chapters of

System Design Indermediate

Browse courses
Up next Master Slave Archiecture
Continue