System Design: Pastebin

Requirements

Functional Requirements:

Create pastes (text content) with optional title, syntax language, and expiration time
Generate short unique URL for each paste (6-8 character alphanumeric key)
Support anonymous and authenticated pastes; authenticated users can edit and delete
Syntax highlighting for 100+ programming languages
Paste visibility: public (listed), unlisted (URL-only access), private (owner-only)
View count tracking; flagging and moderation for abuse (malware links, illegal content)

Non-Functional Requirements:

Short URL generation must be collision-free at scale (1B+ pastes)
Read-heavy workload: 100:1 read-to-write ratio; reads must be served in under 50ms
Support pastes up to 10MB (code files, logs)
Paste content must be served from CDN for global low-latency access
Expiration enforcement: expired pastes must become inaccessible within 5 minutes

Scale Estimation

At Pastebin scale: 1M new pastes/day = ~12/second writes. 100M reads/day = ~1,157/second reads. Average paste size: 10KB (most are small code snippets). Total storage: 1M pastes/day × 10KB × 365 days × 5 years = 18TB. For unique key generation at 1M/day over 10 years = 3.65B total pastes — a 7-character base62 key (62^7 = 3.5T combinations) provides ample keyspace.

High-Level Architecture

The system is composed of a Paste Service, a Key Generation Service, and a Read/CDN tier. The Paste Service handles paste creation and updates. The Key Generation Service pre-generates unique keys in batches and hands them out to the Paste Service, avoiding key collision under high concurrency without distributed locking on each write. The Read tier is the dominant traffic path and is optimized for latency: paste content is served from CDN (CloudFront) with the origin being S3 (for large pastes) or a read-through Redis cache (for small pastes).

On paste creation: request arrives at the Paste Service → fetch a pre-generated key from KGS → store paste metadata in PostgreSQL → store paste content in S3 → invalidate/pre-warm CDN cache for the new key → return the short URL to the user. Reading a paste: request to CDN → cache hit serves directly (sub-10ms); cache miss → origin fetch from Redis (for pastes < 64KB) or S3 (for larger pastes) → response cached at CDN edge.

Expiration is handled by a scheduled cleanup job that scans PostgreSQL for pastes with expires_at < now(), deletes the S3 object, removes from Redis, and issues CDN invalidations. The CDN TTL for paste objects is set to match the expiration window — pastes expiring in 1 hour have a 1-hour CDN cache TTL, preventing serving expired content beyond the 5-minute SLA.

Core Components

Key Generation Service (KGS)

Pre-generates batches of unique 7-character base62 keys and stores them in a keys_unused table in a dedicated PostgreSQL database. When the Paste Service needs a key, it calls KGS, which atomically moves a key from keys_unused to keys_used and returns it. KGS runs multiple instances; each instance pre-fetches a batch of 1,000 keys into memory to avoid per-paste database calls. If an instance crashes, at most 1,000 keys are lost — a negligible fraction of the 3.5T keyspace.

Paste Storage Service

Handles paste content storage with a size-tiered strategy: pastes ≤ 64KB are stored in PostgreSQL's content TEXT column (for fast queries) AND cached in Redis; pastes > 64KB are stored only in S3, with PostgreSQL holding only the S3 key. This avoids PostgreSQL bloat from large pastes while keeping small pastes fast. Syntax highlighting is performed at render time on the client (Highlight.js), not stored server-side — this eliminates the need to re-process pastes when language detection improves.

Moderation & Abuse Service

New public pastes are scanned by an async worker: URL extraction + Safe Browsing API check (Google/Virustotal) for malware links, content hashing against a known-abuse hash list. Flagged pastes are auto-hidden pending manual review. Users can report pastes via a report endpoint that enqueues for moderator review. A rate limiter (Redis token bucket per IP) limits paste creation to 10/hour for anonymous users and 100/hour for authenticated users, preventing bulk spam.

Database Design

PostgreSQL: pastes (paste_id VARCHAR(7) PK, owner_id UUID nullable, title VARCHAR(255), language VARCHAR(50), visibility ENUM(public, unlisted, private), created_at, expires_at nullable, view_count INT, content_size_bytes, content TEXT nullable, s3_key VARCHAR nullable, status ENUM(active, flagged, deleted)). The content field is NULL when the paste is stored in S3 (large pastes). An index on expires_at WHERE expires_at IS NOT NULL supports efficient expiration scanning.

Separate paste_analytics (paste_id, date, view_count) table stores daily view counts, updated by a batch job from the raw view event stream (Kafka) rather than on every read — avoids hot-row contention on popular pastes. Redis stores: hot paste content cache (paste:{paste_id} → content, TTL = min(1 hour, expiry_remaining)), key batches from KGS, and rate limiting counters.

API Design

POST /api/v1/pastes — creates a paste; body: {content, title?, language?, visibility, expires_in_seconds?}; returns {paste_id, url, expires_at}.

GET /api/v1/pastes/{pasteId} — returns paste metadata and content; increments view counter asynchronously via Kafka event.

PUT /api/v1/pastes/{pasteId} — authenticated owner updates content or visibility.

DELETE /api/v1/pastes/{pasteId} — authenticated owner or moderator deletes; triggers CDN invalidation.

Scaling & Bottlenecks

Reads at 1,157/second are the dominant load but trivial with CDN caching. The CDN serves >95% of traffic at the edge. The remaining 5% (cache misses on newly created or rarely accessed pastes) hit the origin — Redis handles hot misses and S3 handles cold large-paste reads. PostgreSQL only receives write traffic and the 5% cache-miss read path, well within a moderately sized RDS instance.

The view counter increment is a classic hot-row problem for viral pastes. Incrementing on every read would serialize writes to popular paste rows. The solution: emit a lightweight Kafka event on each view and batch-aggregate view counts via a periodic job (every 5 minutes). This decouples the read path from view counting writes, allowing infinite read scale without database contention.

Key Trade-offs

Pre-generated keys vs. hash-based keys: Pre-generated keys from KGS guarantee uniqueness without collision checking but require the KGS as an additional dependency; MD5/SHA hash of content with collision retry is simpler but allows hash collisions for identical content (may be intentional for deduplication) and requires checking for key uniqueness.
PostgreSQL vs. NoSQL for metadata: PostgreSQL provides ACID, easy expiration queries, and familiar operational tooling; DynamoDB or Cassandra would handle higher write throughput but make expiration scans and analytics queries more complex.
Client-side vs. server-side syntax highlighting: Client-side highlighting (Highlight.js) shifts CPU to the browser, eliminating server-side rendering overhead and simplifying the API; server-side would allow serving pre-highlighted HTML but increases server cost and complicates caching.
CDN TTL vs. expiration precision: Setting CDN TTL to the paste's expiration time is precise but means that when a paste is deleted early, CDN edges may serve stale content until TTL expires (unless CDN invalidation is used, which has cost and latency); a short universal TTL (5 minutes) with CDN invalidation on deletion is safer but more expensive.