System Design: Financial Data Aggregator (Plaid-style)

Requirements

Functional Requirements:

Connect to 10,000+ financial institutions (banks, credit unions, brokerages) via APIs and screen scraping
Authenticate users securely with bank credentials using OAuth where available, credential-based where not
Normalize account data (balances, transactions, identity) into a standardized schema across all institutions
Transaction categorization and enrichment (merchant name cleaning, category assignment)
Webhook notifications for real-time balance changes and new transactions
Link token flow enabling end-user applications to connect bank accounts via a drop-in UI

Non-Functional Requirements:

Support 200M linked bank accounts across all client applications
Data freshness: transaction data updated within 4 hours for API-connected institutions, 24 hours for scraped
99.9% availability for the Link flow (account connection); 99.5% for data refresh (dependent on institution uptime)
Credential storage with AES-256 encryption and HSM-managed keys, SOC 2 Type II certified
Rate limiting per institution to avoid triggering fraud detection or account lockouts

Scale Estimation

With 200M linked accounts refreshed on average every 8 hours: 200M ÷ 8 hours = 25M refreshes/hour = 6,944 refreshes/sec. Each refresh pulls 2-3 API pages of transactions (30-day rolling window) = 15,000-20,000 API calls/sec outbound to financial institutions. Inbound API traffic from fintech clients: 50M API requests/day for balance and transaction queries = 580 requests/sec. Transaction categorization: 200M accounts × 30 transactions/month = 6B transactions/month to categorize. Credential vault: 200M encrypted credential sets requiring HSM-backed decryption at 7K decryptions/sec during peak refresh.

High-Level Architecture

The aggregation platform is organized into three layers: the Connection Layer (interfaces with financial institutions), the Data Layer (stores and normalizes financial data), and the API Layer (serves fintech client applications).

The Connection Layer contains Institution Adapters — each financial institution has a dedicated adapter implementing one of several connection strategies: OAuth/Open Banking APIs (used by ~500 institutions that support standardized APIs like FDX or Open Banking UK), proprietary bank APIs (500+ institutions with undocumented APIs reverse-engineered from mobile apps), and screen scraping (remaining institutions where no API exists — a headless browser logs into the bank's website and extracts data from HTML). Each adapter normalizes the institution's data format into the platform's canonical schema.

The Data Layer uses a two-tier storage model. Hot storage (PostgreSQL partitioned by account_id) holds the latest 90 days of transactions and current balances — this is what client APIs query. Cold storage (S3 + Parquet) holds historical transactions beyond 90 days, accessible via async API. A Transaction Enrichment Pipeline processes raw transactions through ML-based merchant name cleaning ("SQ COFFEE SHOP NYC" → "Starbucks, New York"), category assignment (Food & Drink → Coffee Shops), and recurring transaction detection.

The API Layer exposes a REST API consumed by fintech applications. The Link flow is a critical UX component: a JavaScript SDK renders an institution search modal → user selects their bank → enters credentials (or completes OAuth) → the platform authenticates with the institution → returns a persistent access_token to the fintech app for subsequent data requests.

Core Components

Institution Connection Manager

The Connection Manager orchestrates 200M account connections across 10,000 institutions with vastly different reliability profiles. Each institution has a health score computed from: authentication success rate (last 24h), data retrieval success rate, average latency, and error rate by type (rate limited, credential expired, site down). The Refresh Scheduler prioritizes connections based on health score and data freshness — healthy institutions are refreshed every 4 hours, degraded institutions are retried with exponential backoff. A rate limiter per institution ensures the platform doesn't trigger the bank's fraud detection (typically 2-5 requests/sec per institution). Screen-scraped connections use a pool of headless Chromium browsers (managed via Playwright) running on dedicated infrastructure to avoid IP blacklisting.

Credential Vault

User bank credentials are the most sensitive data in the system. The Credential Vault stores credentials encrypted with AES-256-GCM using envelope encryption: each credential set has a unique data encryption key (DEK) which is itself encrypted by a key encryption key (KEK) stored in an HSM (AWS CloudHSM or Thales Luna). Decryption requires: reading the encrypted credential from the database, calling the HSM to unwrap the DEK, then decrypting the credential in memory. Decrypted credentials are never written to disk or logged. The vault runs in an isolated VPC with no internet egress except to financial institutions. Access is restricted to the Connection Manager service via mutual TLS. Credential rotation monitoring detects when a user changes their bank password (failed authentication) and triggers a re-link flow notification to the end-user via the client application.

Transaction Enrichment Pipeline

Raw bank transactions contain messy merchant descriptions ("CHECKCARD 0423 SQ JOE S PIZZA NEW Y"). The Enrichment Pipeline cleans and categorizes these through a multi-stage process: (1) pattern matching against a curated merchant database (500K known patterns covering 80% of transactions), (2) ML classification for unrecognized merchants using a fine-tuned BERT model trained on 10B+ historical transactions, (3) geographic enrichment by extracting location from the description and matching to a merchant database (Foursquare/Google Places). The pipeline assigns each transaction a clean merchant name, logo URL, category (hierarchical: Food & Drink → Restaurants → Pizza), and a confidence score. Transactions with confidence below 0.8 are flagged for human review to expand the pattern database.

Database Design

The primary store is PostgreSQL (Citus) sharded by account_id. Core tables: items (item_id, institution_id, user_id, access_token_hash, status ACTIVE/NEEDS_REAUTH/REVOKED, last_refresh_at, created_at), accounts (account_id, item_id, account_type CHECKING/SAVINGS/CREDIT/BROKERAGE, name, mask_last4, current_balance, available_balance, currency, last_updated), transactions (transaction_id, account_id, date, amount, merchant_name, merchant_name_clean, category, category_id, pending BOOLEAN, description_raw, location JSONB, created_at).

The transactions table is partitioned by date (monthly partitions) within each shard. A unique index on (account_id, transaction_id_from_institution) prevents duplicate transaction insertion during refreshes. The credential vault uses a separate, isolated PostgreSQL instance: credentials (credential_id, item_id, encrypted_data BYTEA, dek_encrypted BYTEA, kek_id, created_at, rotated_at). Institution metadata lives in a separate database: institutions (institution_id, name, url, logo_url, connection_type API/SCRAPE, health_score, supported_products, country, last_health_check).

API Design

POST /v1/link/token/create — Create a Link token for initializing the account connection UI; body contains client_id, user_id, products (transactions/auth/identity), country_codes; returns link_token (expires in 30 minutes)
POST /v1/item/public_token/exchange — Exchange a public_token (from Link success callback) for a persistent access_token; returns access_token, item_id
GET /v1/transactions/get — Fetch transactions for linked accounts; body contains access_token, start_date, end_date, options (count, offset); returns array of normalized transactions
GET /v1/accounts/balance/get — Fetch real-time balances; body contains access_token; triggers synchronous refresh for live balance if cached data is >1 hour old

Scaling & Bottlenecks

The outbound connection volume (7K refreshes/sec to financial institutions) is the hardest scaling challenge. Each institution has different rate limits, session handling, and failure modes. The Connection Manager uses per-institution worker pools with dynamic sizing based on the institution's observed rate limit threshold. Screen-scraping connections are 100x more resource-intensive than API connections (headless browser per session, 30-second average session duration). With 30% of institutions requiring scraping, the platform needs ~2,000 concurrent headless browsers during peak refresh hours, consuming significant CPU and RAM. Browser pools are managed by a custom orchestrator that pre-warms browsers and recycles them between sessions.

The Transaction Enrichment Pipeline processes 6B transactions/month. The ML categorization model runs on CPU (ONNX-optimized BERT producing category predictions in 5ms per transaction). Batch processing via Spark handles historical enrichment, while a Flink streaming job categorizes new transactions in real-time as they are ingested. The merchant pattern database (500K patterns) is loaded into each worker's memory for fast regex matching.

Key Trade-offs

Screen scraping as fallback over API-only: Scraping enables connectivity to 10,000+ institutions (most lack APIs), but is fragile (website redesigns break scrapers), resource-intensive, and legally grey — the industry is gradually shifting to standardized APIs (FDX) which will reduce scraping dependence
Envelope encryption with HSM over application-level encryption: HSM-managed keys provide stronger security guarantees and are required for SOC 2 certification, but HSM operations are a throughput bottleneck — mitigated by caching unwrapped DEKs in memory with short TTL
Periodic polling over real-time webhooks from institutions: Most institutions don't offer webhooks, so polling is necessary — the 4-hour refresh interval balances data freshness against institutional rate limits and infrastructure costs
ML-based transaction categorization over rule-only: ML handles the long tail of merchant names that rules miss (20% of transactions), but requires ongoing model retraining and can miscategorize — confidence thresholds route low-confidence predictions to human review