System Design: Object Storage System

Requirements

Functional Requirements:

PUT, GET, DELETE, LIST objects in named buckets
Objects from 1 byte to 10 TB; multipart upload for large objects
Bucket-level and object-level access control policies
Server-side encryption at rest
Configurable replication: local (single AZ), regional (3 AZ), cross-region
Object tagging and metadata for lifecycle management

Non-Functional Requirements:

11 nines of durability for regional replication class
4 nines of availability
Sub-50ms latency for small object (<1 MB) GET operations
Support 1 million QPS aggregate across operations
Horizontal scalability to exabyte capacity

Scale Estimation

With 1 exabyte of raw data and 128 KB average object size, the system stores ~8 trillion objects. Object metadata (key, size, etag, acl, tags) at 200 bytes per object = 1.6 PB of metadata. The metadata must be indexed and searchable (LIST operations by prefix). At 1 million QPS with 70% reads, 300,000 writes/sec, and 30,000 deletes/sec: the metadata layer handles 1 million operations/sec. A metadata service cluster of 100 shards, each handling 10,000 ops/sec with an LSM-tree storage engine, meets this requirement.

High-Level Architecture

The system is a three-tier architecture: API gateway, metadata service, and storage engine. The API gateway handles authentication, request parsing, rate limiting, and routing. The metadata service manages the namespace (bucket → object key → object location mapping) using a distributed key-value store. The storage engine handles actual byte storage using a log-structured format, distributing object chunks across a fleet of storage nodes.

PUT flow: client sends PUT to API gateway → gateway authenticates (IAM service) → gateway selects a placement group of storage nodes (consistent hashing on object key) → gateway writes object data to storage nodes in parallel (quorum write: majority ack) → gateway writes metadata (bucket, key, etag, locations) to metadata service → returns 200 OK to client. The metadata write is the final commit step; before it completes, the object is not readable. GET flow: gateway reads metadata (key → chunk locations) → fetches chunks from storage nodes in parallel → streams reassembled object to client.

The metadata service is built on a distributed key-value store with range query support (for LIST by prefix). B-tree or LSM-tree based engines (RocksDB, Apache Cassandra, or a purpose-built system like Bigtable) are appropriate. LIST operations require scanning all keys with a given prefix, which is natively supported by sorted key storage. Large buckets (millions of keys) use pagination (continuation tokens based on the last-seen key) to avoid full table scans in one request.

Core Components

Metadata Service

The metadata service stores (bucket_name, object_key) → (object_id, size, etag, storage_locations, acl_ref, tags, created_at, version_id). It is sharded by consistent hash of (bucket_name + object_key), distributing load across 100–1,000 metadata shards. Each shard is a RocksDB instance on an SSD-equipped server. Raft replication (3-way) ensures each shard's data survives node failures. LIST operations execute a sorted range scan within the shard responsible for the given prefix. For cross-shard LIST operations (when a prefix spans multiple shards), a scatter-gather fan-out collects results and performs a merge-sort, returning a paginated, lexicographically ordered result set.

Storage Engine

The storage engine uses a log-structured object store. Objects are written to append-only segment files on storage nodes. Each segment file holds many objects concatenated sequentially. An in-memory index maps (object_id, chunk_index) → (segment_file_id, offset, length). On GET, the storage node seeks to the offset in the segment file and streams the bytes. Compaction runs periodically to merge fragmented segments (caused by deletes leaving holes) and free up disk space. Erasure coding is applied at segment write time: a segment of 14 data stripes + 4 parity stripes is distributed across 18 storage nodes in different failure domains.

Replication Modes

Three replication modes offer different durability/cost trade-offs. Single-AZ: 3x replication within one availability zone; cheapest but vulnerable to AZ outage. Multi-AZ (regional): erasure coding (14+4) across 3 AZs; 11 nines durability, survives 1 full AZ failure. Cross-region: asynchronous replication to a secondary region; RPO of ~1 minute; protects against full region outage at 2x storage cost. The replication mode is set per bucket and can be applied retroactively (triggers background replication jobs). Cross-region replication uses a replication log (Kafka-backed) to track and replay writes to the target region asynchronously.

Database Design

Metadata uses RocksDB (LSM-tree) on each shard server, optimized for write-heavy workloads (PUT-heavy object storage). Column families separate hot metadata (key, size, locations — frequently read) from cold metadata (full ACL, tags, user-defined headers — less frequently read). Bloom filters per SSTable file speed up point lookups for GET operations. Object version history is stored as separate records with a version_id suffix in the key, allowing efficient range scans for ListObjectVersions. A soft-delete mechanism uses a tombstone record (key with a delete_marker flag) for versioned deletes, enabling undelete within the retention period before garbage collection.

API Design

Scaling & Bottlenecks

LIST performance is the primary UX bottleneck. With millions of objects in a bucket, LIST scans require efficient prefix range queries. RocksDB's sorted key iteration is fast (1M keys/sec per shard), but a LIST spanning 10 shards with 1,000 keys each still requires 10 parallel shard scans and a merge-sort. Pre-sharding by bucket name (assigning a bucket to a fixed shard group) simplifies LIST but creates hot shard problems for popular buckets. A hybrid approach uses per-bucket range partitioning within a shard group, distributing LIST load across multiple nodes in the group while keeping all keys for a bucket group co-located.

Write throughput scales linearly with storage nodes. The main bottleneck for writes is the quorum write requirement: all 3 replicas (or the majority of erasure-coded nodes) must acknowledge before the client gets a 200 OK. Network latency within a data center (0.1–0.5ms) dominates write latency. For cross-AZ writes (multi-AZ mode), inter-AZ latency (1–5ms) is the bottleneck. An optimization is to ack the client after 14/18 erasure-coded shards are written (minimum for data integrity) rather than waiting for all 18, reducing p99 write latency by avoiding the slowest 4 shard writes.

Key Trade-offs

Strong consistency vs. availability: Strong read-after-write consistency requires all metadata writes to be serialized through a Raft leader; relaxing to eventual consistency enables higher availability but risks stale reads immediately after PUT
Erasure coding vs. replication: Erasure coding is 1.3x storage overhead vs. 3x for replication, but requires reconstruction CPU for degraded reads and adds complexity to the storage engine
Flat namespace vs. hierarchical: Flat namespaces (bucket + key with / as delimiter) are simpler and scale better than true directories; hierarchical directories require distributed locking for rename operations
Metadata co-location vs. separation: Co-locating metadata with data (on the same storage nodes) simplifies architecture but couples metadata performance to data I/O; separate metadata service scales independently