System Design: Multi-Tenant SaaS Architecture

Requirements

Functional Requirements:

Isolate tenant data: tenant A cannot access tenant B's data under any circumstances
Support per-tenant configuration: custom domains, feature flags, branding, and integrations
Enforce resource quotas: CPU, storage, API rate limits, and seat counts per tenant
Support multiple isolation tiers: shared infrastructure (free tier) to dedicated infrastructure (enterprise)
Onboard new tenants programmatically (tenant provisioning API)
Provide per-tenant usage metrics for billing

Non-Functional Requirements:

Noisy neighbor isolation: one tenant's heavy usage must not degrade others beyond SLA
Sub-10ms tenant context resolution on every request
Support 100,000 tenants on shared infrastructure and 100 enterprise tenants with dedicated resources
Tenant data deletion (GDPR right to erasure) within 24 hours

Scale Estimation

100,000 tenants, average 10 active users each = 1M active users. Average 10 requests/user/min = 167,000 requests/sec. Data per tenant: average 10 GB = 1 PB total. Tenant lookup (resolve tenant from request): on every request — must be cached. Tenant metadata: 100,000 tenants × 2 KB = 200 MB — fits in Redis. Enterprise tenants: 100 tenants with dedicated DB clusters = 100 PostgreSQL clusters to manage. Provisioning: 100 new tenants/day = ~1 provisioning operation/minute.

High-Level Architecture

Three isolation models, applied based on tenant tier:

Silo (dedicated): Each tenant gets dedicated infrastructure (own database, own compute). Maximum isolation and customization. Used for enterprise/high-compliance tenants. High cost per tenant; impractical for free/low-tier.

Bridge (schema-per-tenant): Single shared database server, one schema (or database) per tenant. Good isolation at the data layer (tenant A's tables are in a separate schema from tenant B's). Shared compute infrastructure. Used for mid-tier tenants. Schema proliferation (100,000 schemas in one Postgres instance) hits connection pool and migration complexity limits.

Pool (shared schema): All tenants share the same tables. Every row has a tenant_id column. Access control enforced at the application layer via row-level security (RLS) policies in PostgreSQL — every query automatically filters by the current tenant's ID. Lowest cost per tenant, maximum density. Used for free/small tenants. Noisy neighbor risk; a missing RLS policy is a data leak.

Core Components

Tenant Resolver

Every incoming request must be resolved to a tenant context. Resolution methods: subdomain (acme.app.example.com → tenant=acme), custom domain (require a DNS CNAME from the tenant's domain to the SaaS platform, resolved via a DNS-to-tenant mapping table), JWT claim (tenant_id in the authentication token), or API key lookup (key → tenant_id). The resolved tenant context (tenant_id, plan, feature flags, database connection string for silo tenants) is cached in Redis with a 60-second TTL. Cold path (cache miss): database lookup in the tenant registry (~5ms). Hot path (cache hit): <1ms.

Row-Level Security (Pool Model)

For pool-model tenants, PostgreSQL RLS policies enforce data isolation. A policy on every table: CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.tenant_id')::uuid). Before any query, the application sets SET LOCAL app.tenant_id = '<tenant_id>' in the database session. RLS policies are then automatically applied to all subsequent queries in that session — developers cannot accidentally omit the tenant filter. RLS policies are tested with adversarial test suites that attempt cross-tenant access.

Resource Quota Enforcement

Quotas are enforced at multiple layers. API rate limits: the API gateway enforces per-tenant rate limits using Redis counters (same sliding window algorithm as standalone rate limiting). Storage quotas: a background job periodically computes tenant storage usage (query row counts, index sizes per tenant schema, or S3 prefix byte counts) and stores in the tenant registry. When a tenant exceeds their storage quota, new writes are rejected with a 402 Payment Required error. CPU quotas (compute isolation): Kubernetes namespace resource quotas limit CPU/memory per tenant in silo/bridge models; in pool models, slow query timeouts and connection pool limits prevent any single tenant from starving others.

Database Design

The tenant registry (PostgreSQL): tenants (id UUID, name, plan, status, isolation_model, db_connection_string, custom_domain, provisioned_at, metadata JSONB), tenant_quotas (tenant_id, max_seats, max_storage_gb, api_rate_limit, current_storage_gb, current_seats), tenant_features (tenant_id, feature_flag, enabled). The registry is the source of truth for tenant configuration.

For pool-model tenants, all application tables include tenant_id UUID NOT NULL as the first column and a composite index (tenant_id, primary_key) — this ensures all tenant-scoped queries use the index efficiently. Without the composite index, a query WHERE tenant_id = X AND id = Y would require a full-table scan for large tenants.

API Design

Scaling & Bottlenecks

The pool model bottlenecks on large tenants generating disproportionate query load. A single enterprise-sized tenant in the pool model can saturate shared database connections and lock pages needed by smaller tenants. Mitigation: enforce a maximum tenant size for the pool model (migrate tenants above a threshold to the bridge or silo model); use connection pooling (PgBouncer) with per-tenant connection limits to prevent any single tenant from holding all connections.

Schema migrations are an operational challenge in multi-tenant architectures. In pool model, one migration runs across all tenants' data simultaneously — a simple ALTER TABLE can take hours on a large shared table. Online DDL tools (pg_rewrite, pt-online-schema-change) minimize locking. In bridge/silo models, migrations must run against each tenant's schema independently — a migration rollout to 100,000 schemas requires orchestration (run migrations in batches of 1,000, verify, then proceed).

Key Trade-offs

Pool vs. silo isolation model: Pool maximizes density and reduces cost but risks data leakage via bugs (missing tenant filter) and noisy neighbor; silo eliminates these risks but costs 10-100x more per tenant
Application-layer vs. database-layer isolation: Application-layer filtering (WHERE tenant_id = X) is fast but one developer mistake leaks data; database-layer RLS adds slight overhead but is defense-in-depth
Shared vs. dedicated connection pools: Shared pools maximize connection efficiency but allow one tenant to exhaust the pool; per-tenant pools provide isolation but scale poorly (100,000 tenants × 5 connections = 500,000 PostgreSQL connections)
Eager vs. lazy tenant migration between tiers: Migrating a tenant from pool to silo proactively (before they need it) avoids emergency migrations; waiting until they outgrow the pool risks SLA violations during the migration window