System Design: Credit Scoring System

Requirements

Functional Requirements:

Generate credit scores by ingesting traditional credit bureau data and alternative data sources (bank transactions, rent payments, utility bills)
Provide score explanations with top contributing factors for adverse action notices
Support batch scoring (portfolio re-scoring) and real-time single-applicant scoring
Model management: versioning, A/B testing, champion-challenger deployment
Fair lending monitoring and bias detection across protected classes
Score calibration ensuring predicted probabilities match actual default rates

Non-Functional Requirements:

Real-time scoring latency under 200ms for single-applicant requests
Batch scoring: 10M accounts in under 4 hours
Model explainability: top 4 reason codes per score (FCRA requirement)
99.9% availability for the scoring API
Auditability: full reproducibility of any historical score given the same inputs

Scale Estimation

Real-time scoring: 1,000 requests/sec during peak origination hours. Each scoring request requires assembling a feature vector from 200+ raw attributes pulled from 3-5 data sources. Batch re-scoring: 10M accounts × 200 features each = 2B feature values to compute and 10M model inferences. Feature computation involves querying 90 days of transaction history (average 150 transactions/account) = 1.5B transaction rows to process during batch. Model size: gradient-boosted ensemble with 500 trees = ~50MB serialized. Feature store: 10M accounts × 200 features × 8 bytes = 16GB of pre-computed features.

High-Level Architecture

The credit scoring system is organized into three layers: Data Ingestion, Feature Engineering, and Model Serving. Data Ingestion pulls raw data from credit bureaus (batch SFTP files and real-time API), banking transaction data (via Plaid or direct bank feeds), and alternative data providers (rent/utility payment services). Raw data is validated, cleansed, and stored in a Data Lake (Delta Lake on S3).

The Feature Engineering layer transforms raw data into model-ready features. A Feature Store (Feast) maintains both batch-computed features (aggregate statistics over 90-day windows: average balance, payment consistency ratio, utilization trend) and real-time features (days since last payment, current balance). Batch features are computed daily by Spark jobs writing to an offline store (Delta Lake) and loading into an online store (Redis) for real-time serving.

The Model Serving layer hosts trained models behind a scoring API. The API receives a scoring request, assembles the feature vector from the Feature Store, runs model inference, generates reason codes using SHAP (SHapley Additive exPlanations) values, and returns the score with explanations. Models are deployed in a champion-challenger setup: the champion model scores all production traffic, while challenger models score in shadow mode (logged but not returned) for comparison. A Model Registry (MLflow) tracks model versions, training data lineage, and performance metrics.

Core Components

Feature Engineering Pipeline

The Feature Engineering Pipeline computes 200+ features organized into categories: credit bureau features (utilization ratio, payment history score, age of oldest account, inquiry count), banking transaction features (average monthly income, income stability coefficient, discretionary spending ratio, overdraft frequency), alternative data features (rent payment consistency, utility payment timeliness, phone bill regularity). Each feature has a defined computation logic, data source dependency, and freshness SLA. The pipeline uses Spark for batch computation and Flink for real-time features that update on every new transaction. Feature drift monitoring (using Population Stability Index) detects when feature distributions shift from training data, triggering model revalidation alerts.

Model Serving & Explainability

The scoring model is a gradient-boosted ensemble (XGBoost) trained on labeled data (loans that defaulted vs. loans that performed over 24 months). The model outputs a probability of default (PD) which is mapped to a credit score scale (300-850) via a calibration function fitted to match the empirical default rate at each score band. Explainability uses TreeSHAP to compute the contribution of each feature to the prediction, producing top-4 reason codes from a library of 100+ standardized reasons (e.g., "High credit utilization," "Limited credit history," "Recent payment delinquency"). SHAP values are computed in real-time (~50ms) using the optimized TreeSHAP algorithm that exploits the tree structure. Models are served via ONNX Runtime for fast CPU inference.

Fair Lending Monitor

The Fair Lending Monitor continuously evaluates scoring outcomes for disparate impact across protected classes (race, gender, age, national origin). Since protected class data is not used in scoring (prohibited by ECOA), the monitor uses BISG (Bayesian Improved Surname Geocoding) to probabilistically estimate demographics from name and address. Statistical tests include: approval rate ratio (must exceed 80% threshold per the four-fifths rule), marginal effect analysis (logistic regression measuring the independent effect of each protected class on approval), and standardized mean difference in scores between groups. Violations trigger alerts to the compliance team with drill-down showing which features drive the disparity. The monitor runs daily on production scoring data and on every candidate model before deployment.

Database Design

The Feature Store online layer uses Redis Cluster keyed by applicant_id with a hash containing all 200+ pre-computed features. Each feature value includes a timestamp for staleness detection — features older than their freshness SLA trigger a re-computation request. The offline Feature Store uses Delta Lake tables partitioned by computation_date, enabling point-in-time feature retrieval for model training (preventing data leakage by only using features available at the time of the historical decision).

Scoring history is stored in PostgreSQL: scoring_records (record_id, applicant_id, score, pd, model_version, feature_snapshot JSONB, reason_codes JSONB, request_source, scored_at). The feature_snapshot JSONB column stores the exact feature vector used for scoring, enabling full reproducibility of historical scores. This table is partitioned by scored_at month with a 7-year retention policy. Model Registry tables in PostgreSQL track model_versions (model_id, version, algorithm, training_data_snapshot, performance_metrics JSONB, deployed_at, status CHAMPION/CHALLENGER/RETIRED).

API Design

POST /v1/scores — Score a single applicant; body contains applicant_id or raw applicant data (name, SSN, address for bureau pull); returns score (300-850), probability_of_default, reason_codes (top 4), model_version, confidence_interval
POST /v1/scores/batch — Submit batch scoring job; body contains list of applicant_ids or a file reference; returns job_id for async status polling
GET /v1/scores/{record_id}/explain — Detailed score explanation with full SHAP waterfall showing every feature's contribution
GET /v1/models/{model_id}/performance — Model performance metrics: KS statistic, Gini coefficient, PSI, approval rate by score band

Scaling & Bottlenecks

Real-time scoring at 1,000 RPS is bottlenecked by Feature Store reads. Assembling a 200-feature vector requires a single Redis HGETALL call (~1ms) for pre-computed features, but if any feature is stale, a synchronous re-computation delays the response. This is mitigated by proactive batch refresh (nightly Spark job ensures all features are fresh for the next business day) and a staleness tolerance policy (features up to 48 hours old are acceptable for all but payment-related features).

Batch re-scoring of 10M accounts requires careful resource management. The Spark job reads raw transaction data (1.5B rows), computes features, and runs model inference — all within a 4-hour window. Feature computation is parallelized by account_id range across 200 Spark executors. Model inference uses batch prediction with XGBoost's multi-threaded predict mode (10K predictions/sec per executor). The limiting factor is reading transaction history from Delta Lake — partitioning transaction data by account_id_hash ensures even data distribution across executors.

Key Trade-offs

XGBoost ensemble over neural networks: Gradient-boosted trees provide inherent feature importance and work natively with TreeSHAP for explainability — neural networks achieve marginally better AUC but are black boxes requiring post-hoc approximation methods for explanations
Point-in-time feature store over latest features for training: Prevents temporal data leakage and ensures model training reflects the information available at decision time, but adds significant complexity to the feature engineering pipeline
BISG proxy for demographics over self-reported data: Fair lending monitoring requires demographic analysis, but lenders often lack self-reported race/ethnicity data — BISG provides probabilistic estimates but introduces noise, especially for mixed-race individuals
Champion-challenger deployment over direct replacement: Shadow-scoring with challengers provides confidence in model performance before production exposure, but delays the deployment of improved models by weeks