System Design: KYC (Know Your Customer) System

Requirements

Functional Requirements:

Identity verification: validate customer identity using government-issued ID (passport, driver's license), selfie biometric matching, and SSN/TIN verification
Document verification: authenticate identity documents using ML-based forgery detection
Sanctions and PEP (Politically Exposed Persons) screening against global watchlists
Risk scoring: assign customer risk levels (low, medium, high) based on identity, geography, and activity
Enhanced Due Diligence (EDD) workflow for high-risk customers requiring additional documentation
Ongoing monitoring: continuous screening against updated watchlists and transaction monitoring

Non-Functional Requirements:

Process 100,000 KYC verifications/day with 90% automated pass-through for low-risk customers
Identity verification completion within 60 seconds for automated cases
False rejection rate below 2% (minimizing customer friction)
Compliance with BSA/AML (US), 4AMLD/5AMLD/6AMLD (EU), FATF recommendations
Full audit trail retained for 5 years after customer relationship ends

Scale Estimation

100K verifications/day = 1.2 verifications/sec average, peaking at 5/sec during business hours. Each verification involves: document image upload and analysis (2 images minimum: ID front + selfie), OCR extraction from ID, biometric face matching (ID photo vs. selfie), SSN verification against government databases, and sanctions screening. Document processing: 200K images/day at 5MB average = 1TB/day of identity document storage (highly sensitive, encrypted at rest). Biometric face matching: 100K comparisons/day using face embedding models. Sanctions screening: each name checked against 500K+ watchlist entries. Ongoing monitoring: 10M existing customers rescreened monthly against updated watchlists = 333K screenings/day.

High-Level Architecture

The KYC system is architected as a workflow-driven platform using Temporal for orchestrating the multi-step verification process. The system is divided into the Verification Pipeline (processes new customer applications), the Screening Engine (sanctions and PEP checks), the Risk Engine (assigns risk scores), and the Monitoring Service (ongoing compliance).

The verification flow: Customer uploads identity documents via the client application's KYC SDK → images are uploaded to an encrypted S3 bucket → the Verification Orchestrator (Temporal workflow) initiates parallel processing: (1) Document Verification Service analyzes the ID for authenticity, (2) OCR Service extracts text fields (name, DOB, address, ID number), (3) Biometric Service compares the selfie against the ID photo, (4) Data Verification Service validates extracted data against authoritative sources (SSN verification via SSA, address verification via USPS). Once all verifications complete, the Risk Engine computes a composite risk score. Low-risk customers (score <30 on a 0-100 scale) are auto-approved. Medium-risk (30-70) and high-risk (>70) customers are routed to a manual review queue.

The Screening Engine runs both at onboarding and continuously thereafter. At onboarding, it screens the customer's name and known aliases against OFAC SDN, EU consolidated lists, UN sanctions, Interpol notices, and country-specific PEP databases. Ongoing monitoring re-screens all existing customers whenever watchlists are updated (typically weekly) and also monitors transaction patterns for suspicious activity.

Core Components

Document Verification Service

The Document Verification Service authenticates identity documents using a multi-model ML pipeline. Stage 1: Document Classification — a CNN model identifies the document type (passport, driver's license, national ID) and issuing country from the image. Stage 2: Forgery Detection — a specialized model checks for signs of tampering: inconsistent fonts, misaligned security features, digital manipulation artifacts (assessed via noise analysis and JPEG compression artifacts), and validates holographic/UV features when captured under appropriate lighting (progressive web app guides users to tilt the ID). Stage 3: Data Extraction — an OCR pipeline optimized for identity documents extracts: full name, date of birth, document number, expiration date, address (where present), and the Machine Readable Zone (MRZ) for passports. MRZ check digits are validated algorithmically. The pipeline achieves 98% accuracy on document authentication with a 0.5% false acceptance rate.

Biometric Matching Service

The Biometric Service performs facial comparison between the ID photo and the live selfie. The process: (1) face detection using MTCNN to locate faces in both images, (2) liveness detection to prevent spoofing (analyzing micro-movements in a short video capture, checking for 3D depth consistency, detecting screen reflections or printed photo edges), (3) face encoding using a FaceNet model that produces a 128-dimensional embedding vector for each face, (4) similarity comparison using cosine distance between embeddings — a score above 0.85 threshold confirms a match. The service handles challenging cases: aging (ID photo may be 10 years old), glasses, facial hair changes, and different lighting conditions. For edge cases near the threshold (0.75-0.85), the system requests a second selfie with specific instructions (remove glasses, improve lighting).

Risk Scoring Engine

The Risk Engine assigns a composite risk score (0-100) based on multiple risk dimensions: (1) Identity Risk (verification confidence score, document authenticity score, biometric match score), (2) Geographic Risk (customer's country rated by FATF grey/black list status, Transparency International CPI score, and jurisdiction-specific risk ratings), (3) Product Risk (account type and expected transaction volume — high-value investment accounts score higher than basic checking), (4) Behavioral Risk (for existing customers: transaction patterns, unusual activity flags). Each dimension produces a sub-score weighted by configurable factors. The composite score maps to risk tiers: Low (0-30, auto-approve), Medium (31-70, enhanced automated checks), High (71-100, manual EDD required). Risk scores are recalculated monthly for existing customers based on updated transaction behavior.

Database Design

The KYC database is PostgreSQL with row-level encryption for PII fields. Core tables: customers (customer_id, legal_name_encrypted, dob_encrypted, ssn_hash, nationality, risk_score, risk_tier LOW/MEDIUM/HIGH, kyc_status PENDING/APPROVED/REJECTED/UNDER_REVIEW, onboarded_at, last_reviewed_at), verifications (verification_id, customer_id, type DOCUMENT/BIOMETRIC/SSN/ADDRESS, status PASSED/FAILED/MANUAL_REVIEW, confidence_score, details JSONB, verified_at), documents (document_id, customer_id, document_type PASSPORT/DL/NATIONAL_ID, s3_key_encrypted, ocr_data_encrypted JSONB, authenticity_score, uploaded_at, expires_at), screening_results (screening_id, customer_id, list_type OFAC/EU/UN/PEP, match_status NO_MATCH/POTENTIAL_MATCH/CONFIRMED_MATCH, matched_entity, match_score, reviewed_by, screened_at).

Document images are stored in an encrypted S3 bucket with a separate KMS key from the main application. Access to the bucket requires both IAM authentication and a signed token from the KYC service. Images are retained for 5 years after customer relationship ends (regulatory requirement) and automatically deleted thereafter via S3 lifecycle policies. A separate audit_log table records every access to customer PII: who accessed what, when, and why (purpose_code).

API Design

POST /v1/verifications — Initiate KYC verification; body contains customer_id, document_images (front, back), selfie_image, consent_token; returns verification_id, status (PROCESSING)
GET /v1/verifications/{verification_id} — Check verification status and results; returns status, risk_score, risk_tier, individual check results (document, biometric, sanctions), required_actions if any
POST /v1/screening/batch — Submit batch screening request for ongoing monitoring; body contains customer_ids[]; returns job_id for async result polling
GET /v1/customers/{customer_id}/risk-profile — Comprehensive risk profile including current risk score, tier, all historical verifications, screening results, and EDD status

Scaling & Bottlenecks

Document verification ML models are the primary compute bottleneck. The multi-model pipeline (classification + forgery detection + OCR) takes 8-12 seconds per document on GPU. With 200K images/day, the pipeline requires 20 T4 GPUs. During customer onboarding surges (fintech launch campaigns driving 10x normal volume), auto-scaling GPU nodes take 5-10 minutes to provision — a pre-warmed pool of 5 standby GPUs absorbs initial spikes. The biometric matching service is less GPU-intensive (FaceNet inference takes 100ms per pair) but liveness detection video processing adds 3-5 seconds.

Ongoing monitoring (re-screening 10M customers monthly against updated watchlists) is a batch-intensive workload. Naive approach: 10M customers × 500K watchlist entries = 5 trillion comparisons. Optimized approach: pre-compute phonetic hashes (Double Metaphone) and trigram tokens for all watchlist entries → build an inverted index → for each customer, generate the same tokens and look up potential matches in the index, achieving sub-millisecond per-customer screening. The monthly batch completes in 8 hours across 50 workers.

Key Trade-offs

Multi-model ML pipeline over single end-to-end model: Separate models for classification, forgery detection, and OCR enable independent improvement and debugging of each stage, but increase total latency — parallel execution where possible (OCR and forgery detection run simultaneously) mitigates this
Liveness detection via video over single photo: Video-based liveness with micro-movement analysis provides stronger anti-spoofing than single-photo analysis, but increases user friction (users must record a 3-second video) — reducing the video requirement to a simple head turn balances security and UX
Pre-computed screening index over real-time fuzzy matching: Building an inverted index for watchlist screening enables sub-millisecond per-customer lookups, but requires index rebuilds on every watchlist update (weekly) — incremental index updates handle daily additions efficiently
Auto-approval for low-risk over all-manual review: Automating 90% of verifications reduces onboarding time from days to minutes, but risks approving sophisticated identity fraud — a random audit of 1% of auto-approved cases provides ongoing quality assurance