System Design: Audit Logging System

Requirements

Functional Requirements:

Capture security-relevant events: authentication, authorization decisions, data access, configuration changes, and admin actions
Guarantee tamper evidence: audit logs must be cryptographically verifiable as unmodified
Structured log schema: each event has actor, action, resource, outcome, timestamp, and request context
Support high-volume ingestion from hundreds of services with at-least-once delivery guarantees
Enable fast search: find all events for a specific user or resource within a 30-day window in under 5 seconds
Integrate with SIEM (Splunk, Elastic SIEM) and alerting systems for real-time threat detection

Non-Functional Requirements:

Ingestion throughput: 1 million audit events per second
Search latency: 95th percentile under 5 seconds for time-range queries over 90 days of data
Immutable storage: once written, audit logs must not be modifiable or deletable before their retention period
Retention: 7 years for financial and healthcare events (SOC 2, HIPAA); 2 years for general security events
99.99% availability for the ingestion pipeline; temporary ingestion failures must not lose events

Scale Estimation

1 million events/second * 500 bytes average = 500 MB/s ingestion. After 7 years: 500 MB/s * 86,400 * 365 * 7 = ~110 PB. With 10x compression (structured JSON → compressed columnar): 11 PB. For hot storage (90 days, for fast search): 500 MB/s * 86,400 * 90 = ~3.9 TB/day = 350 TB for 90 days. Elasticsearch hot tier for fast search over 90 days; S3 with Glacier for cold archival beyond 90 days.

High-Level Architecture

The audit logging system has three layers: Ingestion, Storage, and Query. Ingestion collects events from services via a client SDK (in-process logging library) that buffers events locally and flushes to Kafka every second or when the buffer reaches 1,000 events. Kafka provides durable, ordered event storage with 30-day retention as the source of truth buffer. Two consumers run off Kafka: a hot-path consumer writing to Elasticsearch (for the last 90 days of searchable data) and a cold-path consumer writing to S3 (for long-term immutable archival).

Tamper evidence is achieved through two mechanisms: (1) S3 Object Lock (WORM — Write Once Read Many) on the archival bucket prevents any deletion or modification before the retention period expires. (2) A Merkle tree hash chain is computed over log batches: each batch includes the hash of the previous batch, creating a verifiable chain where any modification to historical logs is detectable. The Merkle roots are published to an external ledger (AWS QLDB or a public blockchain attestation service) for third-party verifiability.

Real-time threat detection runs as a Flink job consuming the Kafka audit stream. Threat detection rules (e.g., more than 10 failed logins for a user in 5 minutes, admin action at 3am, access from impossible travel geography) trigger alerts to the SIEM. The Flink job maintains stateful session windows per user/IP and evaluates complex event patterns using Flink CEP (Complex Event Processing).

Core Components

Audit Log SDK

The SDK is an in-process library integrated into every service. It provides a typed AuditEvent builder enforcing required fields (actor_id, action, resource_type, resource_id, outcome, timestamp). Events are buffered in memory (up to 1,000 events or 1 second, whichever comes first) and flushed asynchronously to Kafka using a producer with acks=all and idempotent producer enabled. On flush failure, events are written to a local disk buffer (up to 10,000 events) before retry; events older than 5 minutes in the disk buffer are forwarded to a dead letter queue with a critical alert.

Immutable S3 Archival with Hash Chain

The cold-path consumer writes events to S3 as Parquet files partitioned by (service, year, month, day, hour). Each Parquet file's SHA-256 hash is computed after write and stored in a manifest.json alongside the data file. The manifest also contains the hash of the previous batch's manifest, forming a hash chain. S3 Object Lock in COMPLIANCE mode (not GOVERNANCE mode) prevents even root account users from modifying or deleting objects before the retention period. Periodic verification jobs re-compute file hashes and compare against the manifest to detect any tampering.

SIEM Integration

A Kafka consumer forwards audit events to Splunk HEC (HTTP Event Collector) or Elastic Logstash at 100,000 events/second. A separate enrichment pipeline adds context to raw events: user_display_name (from user directory lookup), resource_display_name, and geographic location (from IP geolocation database). SIEM correlation rules detect: brute force attacks, privilege escalation sequences, data exfiltration patterns (large volume downloads), and lateral movement (single user accessing many different resources in a short window).

Database Design

Elasticsearch index (hot, 90 days): audit-logs-{YYYY-MM-DD} daily indices with ILM (Index Lifecycle Management) rolling to warm tier after 30 days and deleting at 90 days. Mapping: actor_id, actor_type, action, resource_type, resource_id, outcome, ip_address, user_agent, request_id, service, environment, timestamp. S3 archival: partitioned Parquet files. PostgreSQL for metadata: ingestion_checkpoints (kafka_topic, partition, last_offset, last_updated), integrity_manifests (batch_id, s3_path, file_hash, chain_hash, created_at).

API Design

GET /audit/events?actor_id={id}&from={ts}&to={ts}&action={action} — Search audit events with filtering; returns paginated results with cursor. GET /audit/events/{event_id} — Return a single audit event with full context. GET /audit/integrity/verify?from={date}&to={date} — Trigger hash chain verification for a date range; returns verified/tampered status. POST /audit/export — Export audit events for a time range as a signed, encrypted archive for regulatory submission.

Scaling & Bottlenecks

Elasticsearch indexing throughput bottlenecks at ~100,000 documents/second per hot shard. With 1 million events/second and 10-node Elasticsearch cluster with 10 shards per index, each shard receives 10,000 documents/second — within capacity. Refresh interval increased to 30 seconds (vs. default 1 second) during high-ingestion periods improves bulk indexing throughput by 5x at the cost of 30-second search visibility latency.

Kafka consumer lag during traffic spikes can delay Elasticsearch indexing. A Kafka consumer group with 30 consumers (matching the number of Kafka partitions) provides maximum parallelism for indexing. Dead letter queue consumers handle events that fail Elasticsearch indexing (schema mismatches, index capacity issues) for reprocessing after the issue is resolved.

Key Trade-offs

Elasticsearch for search vs. raw S3 with Athena: Elasticsearch provides 5-second search over 90 days; S3 + Athena can query petabytes but at 30–60 second latency for full scans. A hybrid approach uses Elasticsearch for recent data and Athena for historical queries.
At-least-once vs. exactly-once delivery: Exactly-once audit log delivery requires Kafka transactions and idempotent consumers, adding complexity; at-least-once with deduplication by event_id achieves the same correctness for audit purposes since duplicate events are detectable and removable.
Client-side SDK vs. sidecar agent: In-process SDK has lower latency but requires library adoption by every service; a sidecar agent (eBPF-based syscall capture or service mesh interception) can capture audit events transparently but with higher overhead.
Real-time alerting vs. batch analysis: Real-time Flink-based detection catches threats within seconds but requires stateful stream processing infrastructure; batch analytics (daily anomaly detection jobs) are simpler but miss threats that unfold within a day.