System Design: Event Sourcing System

Requirements

Functional Requirements:

Persist domain events (immutable facts) as the system's source of truth rather than current state
Reconstruct entity state by replaying its event sequence from the beginning or a snapshot
Support temporal queries: what was the state of entity X at time T?
Enable event replay to build new projections (read models) from historical events
Guarantee event ordering: events for the same aggregate are totally ordered
Support subscriptions: consumers receive new events in real time

Non-Functional Requirements:

Append-only event writes under 5ms p99
Reconstruct entity state from latest snapshot + subsequent events in under 50ms
Retain all events indefinitely (events are never deleted — the log is the truth)
Support 100,000 events/sec ingestion with 100 billion stored events

Scale Estimation

100,000 events/sec × 500 bytes average = 50 MB/sec write throughput. Over 10 years: 100,000 × 31.5M seconds/year × 10 × 500 bytes = ~158 PB. Practical mitigation: most queries only need recent events; cold events can be archived to object storage (S3 Glacier). Hot event store (last 90 days): 100,000 × 90 × 86,400 × 500 bytes = ~389 TB. Aggregates: 10M entities × average 1,000 events each = 10B event records. Snapshotting every 100 events reduces replay I/O: reconstruct from snapshot (~1 read) + max 99 subsequent events.

High-Level Architecture

The event store is an append-only log partitioned by aggregate ID. All writes for a given aggregate are routed to the same partition — guaranteeing ordering. Each event has a global sequence number (monotonic) and an aggregate-scoped sequence number (version). Optimistic concurrency control: a write specifying expected_version=5 fails if the aggregate has already reached version 6, preventing lost-update bugs in concurrent systems.

Projections (read models) are built by subscribing to the event stream. A projection is a stateful consumer that processes events in order and materializes query-optimized views (e.g., a SQL table of current account balances built from AccountDebited/AccountCredited events). Projections run asynchronously — they may lag behind the event stream by milliseconds to seconds. This is the fundamental CQRS split: writes go to the event store, reads come from pre-materialized projections.

The event store is separate from Kafka (though Kafka can serve as an event store for simpler use cases). A dedicated event store (EventStoreDB, Axon Server) provides features Kafka lacks: per-aggregate stream queries, optimistic concurrency control on writes, and built-in snapshotting. Kafka handles event fan-out to projection consumers.

Core Components

Aggregate Stream

Each aggregate (e.g., a bank account, an order) has its own stream: an ordered list of events keyed by aggregate_id. Streams are stored as partitioned append-only logs. A write to stream account-123 with expected_version=5 is a conditional append: if the stream's current head version != 5, the write fails (optimistic locking). This prevents two concurrent commands from both appending version 6 to the same stream. The expected_version check and the append are atomic — implemented as a SQL transaction (INSERT WHERE version = expected_version) or a compare-and-append in the event store's native API.

Snapshot Engine

Rebuilding state by replaying all events from the beginning becomes slow for long-lived aggregates. The snapshot engine periodically takes a state snapshot after every N events (configurable, typically 100-500). A snapshot is a serialized representation of the aggregate's current state stored alongside the event stream. On state reconstruction: load the latest snapshot (1 read), then replay only the events after the snapshot's version. Snapshots are stored in a separate snapshot table or stream, never in the event stream itself — snapshots are a performance optimization, not part of the event history.

Projection Builder

Projections subscribe to event streams via a push or pull mechanism. Each projection consumer tracks its position (last processed event sequence number) in a checkpoint store. On restart, it resumes from the last checkpoint. Projection logic: for each incoming event, apply a handler function that updates the projection's state (typically a relational table or a document store). Handlers must be idempotent — if the same event is processed twice (due to a crash between processing and checkpointing), the projection state must be correct. This is achieved by using upsert operations (INSERT ON CONFLICT DO UPDATE) keyed on the event's sequence number.

Database Design

The event store schema: events (id BIGSERIAL, aggregate_id UUID, aggregate_type VARCHAR, version INT, event_type VARCHAR, payload JSONB, metadata JSONB, occurred_at TIMESTAMP). Primary index: (aggregate_id, version) UNIQUE — enforces ordering and optimistic locking. Secondary index: (occurred_at) — for time-range queries and catching up projections after downtime. The JSONB payload stores the event data; metadata stores system fields (causation_id, correlation_id, user_id, tenant_id).

PostgreSQL advisory locks or serializable transactions enforce the version uniqueness constraint under concurrent writes to the same aggregate. For very high write throughput (>10,000 events/sec to the same aggregate), PostgreSQL's lock contention becomes a bottleneck — dedicated event store systems (EventStoreDB) use LSM-based storage optimized for append-only workloads.

API Design

Scaling & Bottlenecks

Hot aggregates (a single entity receiving thousands of events/sec — e.g., a global counter) bottleneck on the single-stream serialization point. Mitigation: partition high-frequency aggregates into sub-streams (shard the aggregate by user cohort or time bucket) and merge sub-streams in the projection. This trades aggregate-level ordering guarantees for throughput.

Projection rebuild time for 100 billion historical events is prohibitive (days of replay). Mitigation: snapshot aggressively (every 10-50 events for frequently-rebuilt projections); partition the event stream by time and process time windows in parallel; pre-compute aggregate snapshots at regular intervals that serve as replay starting points for new projections.

Key Trade-offs

Event granularity: Fine-grained events (one per field change) provide maximum audit detail but produce enormous event volumes and complex replays; coarse-grained events (one per command) are simpler but lose intermediate states
Event store vs. Kafka as source of truth: Kafka is an excellent event bus but lacks per-aggregate optimistic concurrency and stream API semantics; a dedicated event store provides these at the cost of another system to operate
Eventual vs. strong consistency for projections: Projections are eventually consistent by design — reads may return stale data while the projection catches up; applications must tolerate this or use read-your-own-writes techniques (read directly from the event store for immediate consistency)
Schema evolution: Events are immutable and stored forever — changing event schemas requires either versioned event types (V1, V2 of the same event) or event upcasting (transform old events to new schema on read), both of which add complexity