SYSTEM_DESIGN
System Design: Multi-Tenant SaaS Architecture
Design a multi-tenant SaaS architecture that isolates tenant data, enforces resource quotas, and scales from startup to enterprise customers with configurable isolation models and zero cross-tenant data leakage.
Requirements
Functional Requirements:
- Isolate tenant data: tenant A cannot access tenant B's data under any circumstances
- Support per-tenant configuration: custom domains, feature flags, branding, and integrations
- Enforce resource quotas: CPU, storage, API rate limits, and seat counts per tenant
- Support multiple isolation tiers: shared infrastructure (free tier) to dedicated infrastructure (enterprise)
- Onboard new tenants programmatically (tenant provisioning API)
- Provide per-tenant usage metrics for billing
Non-Functional Requirements:
- Noisy neighbor isolation: one tenant's heavy usage must not degrade others beyond SLA
- Sub-10ms tenant context resolution on every request
- Support 100,000 tenants on shared infrastructure and 100 enterprise tenants with dedicated resources
- Tenant data deletion (GDPR right to erasure) within 24 hours
Scale Estimation
100,000 tenants, average 10 active users each = 1M active users. Average 10 requests/user/min = 167,000 requests/sec. Data per tenant: average 10 GB = 1 PB total. Tenant lookup (resolve tenant from request): on every request — must be cached. Tenant metadata: 100,000 tenants × 2 KB = 200 MB — fits in Redis. Enterprise tenants: 100 tenants with dedicated DB clusters = 100 PostgreSQL clusters to manage. Provisioning: 100 new tenants/day = ~1 provisioning operation/minute.
High-Level Architecture
Three isolation models, applied based on tenant tier:
Silo (dedicated): Each tenant gets dedicated infrastructure (own database, own compute). Maximum isolation and customization. Used for enterprise/high-compliance tenants. High cost per tenant; impractical for free/low-tier.
Bridge (schema-per-tenant): Single shared database server, one schema (or database) per tenant. Good isolation at the data layer (tenant A's tables are in a separate schema from tenant B's). Shared compute infrastructure. Used for mid-tier tenants. Schema proliferation (100,000 schemas in one Postgres instance) hits connection pool and migration complexity limits.
Pool (shared schema): All tenants share the same tables. Every row has a tenant_id column. Access control enforced at the application layer via row-level security (RLS) policies in PostgreSQL — every query automatically filters by the current tenant's ID. Lowest cost per tenant, maximum density. Used for free/small tenants. Noisy neighbor risk; a missing RLS policy is a data leak.
Core Components
Tenant Resolver
Every incoming request must be resolved to a tenant context. Resolution methods: subdomain (acme.app.example.com → tenant=acme), custom domain (require a DNS CNAME from the tenant's domain to the SaaS platform, resolved via a DNS-to-tenant mapping table), JWT claim (tenant_id in the authentication token), or API key lookup (key → tenant_id). The resolved tenant context (tenant_id, plan, feature flags, database connection string for silo tenants) is cached in Redis with a 60-second TTL. Cold path (cache miss): database lookup in the tenant registry (~5ms). Hot path (cache hit): <1ms.
Row-Level Security (Pool Model)
For pool-model tenants, PostgreSQL RLS policies enforce data isolation. A policy on every table: CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.tenant_id')::uuid). Before any query, the application sets SET LOCAL app.tenant_id = '<tenant_id>' in the database session. RLS policies are then automatically applied to all subsequent queries in that session — developers cannot accidentally omit the tenant filter. RLS policies are tested with adversarial test suites that attempt cross-tenant access.
Resource Quota Enforcement
Quotas are enforced at multiple layers. API rate limits: the API gateway enforces per-tenant rate limits using Redis counters (same sliding window algorithm as standalone rate limiting). Storage quotas: a background job periodically computes tenant storage usage (query row counts, index sizes per tenant schema, or S3 prefix byte counts) and stores in the tenant registry. When a tenant exceeds their storage quota, new writes are rejected with a 402 Payment Required error. CPU quotas (compute isolation): Kubernetes namespace resource quotas limit CPU/memory per tenant in silo/bridge models; in pool models, slow query timeouts and connection pool limits prevent any single tenant from starving others.
Database Design
The tenant registry (PostgreSQL): tenants (id UUID, name, plan, status, isolation_model, db_connection_string, custom_domain, provisioned_at, metadata JSONB), tenant_quotas (tenant_id, max_seats, max_storage_gb, api_rate_limit, current_storage_gb, current_seats), tenant_features (tenant_id, feature_flag, enabled). The registry is the source of truth for tenant configuration.
For pool-model tenants, all application tables include tenant_id UUID NOT NULL as the first column and a composite index (tenant_id, primary_key) — this ensures all tenant-scoped queries use the index efficiently. Without the composite index, a query WHERE tenant_id = X AND id = Y would require a full-table scan for large tenants.
API Design
Scaling & Bottlenecks
The pool model bottlenecks on large tenants generating disproportionate query load. A single enterprise-sized tenant in the pool model can saturate shared database connections and lock pages needed by smaller tenants. Mitigation: enforce a maximum tenant size for the pool model (migrate tenants above a threshold to the bridge or silo model); use connection pooling (PgBouncer) with per-tenant connection limits to prevent any single tenant from holding all connections.
Schema migrations are an operational challenge in multi-tenant architectures. In pool model, one migration runs across all tenants' data simultaneously — a simple ALTER TABLE can take hours on a large shared table. Online DDL tools (pg_rewrite, pt-online-schema-change) minimize locking. In bridge/silo models, migrations must run against each tenant's schema independently — a migration rollout to 100,000 schemas requires orchestration (run migrations in batches of 1,000, verify, then proceed).
Key Trade-offs
- Pool vs. silo isolation model: Pool maximizes density and reduces cost but risks data leakage via bugs (missing tenant filter) and noisy neighbor; silo eliminates these risks but costs 10-100x more per tenant
- Application-layer vs. database-layer isolation: Application-layer filtering (WHERE tenant_id = X) is fast but one developer mistake leaks data; database-layer RLS adds slight overhead but is defense-in-depth
- Shared vs. dedicated connection pools: Shared pools maximize connection efficiency but allow one tenant to exhaust the pool; per-tenant pools provide isolation but scale poorly (100,000 tenants × 5 connections = 500,000 PostgreSQL connections)
- Eager vs. lazy tenant migration between tiers: Migrating a tenant from pool to silo proactively (before they need it) avoids emergency migrations; waiting until they outgrow the pool risks SLA violations during the migration window
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.