System Design: ML Platform (Uber Michelangelo-style)

Requirements

Functional Requirements:

Provide a unified platform for feature engineering, model training, evaluation, and deployment
Support multiple ML frameworks: TensorFlow, PyTorch, XGBoost, Scikit-learn
Enable feature sharing across teams: define once, use everywhere, with online and offline access
Automate model training pipelines triggered by new data, scheduled runs, or manual triggers
Support A/B deployment: route a percentage of traffic to a new model version for gradual rollout
Provide automated model monitoring: detect performance degradation and data drift

Non-Functional Requirements:

Model serving latency under 50ms at the 99th percentile for online inference
Support 100,000 inference requests per second across all deployed models
Feature retrieval latency under 10ms for online feature lookup
Model training jobs scale to 1,000 GPUs for large neural network training
99.99% availability for the model serving layer

Scale Estimation

A large tech company runs 1,000 ML models in production. Average inference throughput: 100 QPS per model = 100,000 total QPS. At 50ms latency SLA, the serving layer needs to sustain 5,000 concurrent in-flight requests. Feature store serves 1 million feature lookups/second (online). Training cluster: 500 GPU machines active during peak training hours consuming 5 PB of training data per month.

High-Level Architecture

The platform is organized into five subsystems: Feature Platform, Training Platform, Model Registry, Serving Platform, and Monitoring Platform. These communicate through shared storage (feature store, model registry artifacts) and Kafka event streams (training job events, prediction logs, monitoring alerts). A unified SDK wraps all platform APIs, giving data scientists a consistent interface across Python notebooks, training scripts, and serving code.

The Feature Platform maintains feature pipelines (batch and streaming) that compute and store feature values in both an offline store (data lake for training) and an online store (low-latency key-value for serving). Feature definitions are versioned in a catalog; the same feature definition generates both the batch Spark job (for historical training data) and the streaming Flink job (for real-time online values), ensuring training-serving skew is minimized.

The Serving Platform runs model inference as microservices on Kubernetes. Each deployed model runs in a TorchServe, TFServing, or custom gRPC container with health checks, auto-scaling, and circuit breakers. A model gateway handles routing: it loads the deployment policy (A/B weights, canary percentage, fallback model), fetches required online features from the feature store, assembles the feature vector, calls the model service, and returns the prediction with latency and model version metadata.

Core Components

Feature Store

The feature store has two layers: the offline store (S3 + Parquet, organized as a time-series table per feature group with entity_id, event_timestamp, and feature columns) and the online store (Redis or DynamoDB, key = entity_id, value = latest feature vector as serialized Avro). Feature pipelines write to both stores simultaneously. Point-in-time correct joins for training data query the offline store using the event timestamp to reconstruct the feature vector as it existed at prediction time, preventing data leakage.

Distributed Training Platform

Large model training uses distributed training across a GPU cluster. PyTorch Distributed Data Parallel (DDP) synchronizes gradients across workers via NCCL all-reduce. The training orchestrator (Kubeflow Pipelines or MLflow Projects) handles: data sharding across workers, checkpointing to S3 every epoch, hyperparameter tuning (Optuna or Ray Tune), and early stopping. Training jobs request GPU resources from the Kubernetes GPU scheduler; spot GPU instances handle 60% of training workloads at reduced cost.

Model Registry

All trained models are registered in the Model Registry (MLflow Model Registry or a custom service) with: model artifact path (S3), training metrics (AUC, RMSE, loss curves), training data snapshot reference, feature store version used, framework version, and approval status (STAGING, PRODUCTION, ARCHIVED). Promotion from STAGING to PRODUCTION requires automated evaluation gates (performance above threshold vs. baseline) and optionally human approval for high-stakes models. Model lineage tracks the training data and features that produced each model version.

Database Design

The Model Registry uses PostgreSQL: models (model_id, name, framework, artifact_s3_path, created_by, created_at), model_versions (version_id, model_id, version_number, metrics_json, feature_store_version, training_dataset_ref, status, promoted_at), deployments (deployment_id, model_id, version_id, traffic_split_json, serving_endpoint, created_at, is_active). The Feature Store catalog stores: feature_groups (group_id, name, entity_type, online_enabled, offline_enabled, ttl_days), features (feature_id, group_id, name, dtype, description, owner_team).

API Design

POST /features/groups/{group_id}/get-online — Batch fetch online feature values for a list of entity IDs; returns feature vectors in <10ms. POST /training-jobs — Submit a training job with dataset config, model config, hyperparameter ranges, and resource requests. POST /models/{model_id}/versions/{version_id}/deploy — Deploy a model version with traffic split configuration. POST /predict/{model_name} — Online inference endpoint; accepts entity_id and optional feature overrides; returns prediction and model version used.

Scaling & Bottlenecks

The online feature store is the critical path for serving latency. Redis Cluster handles 1 million reads/second with sub-millisecond latency; partitioning by entity_id hash distributes load evenly. Feature vector serialization overhead (Avro deserialization) adds 1–2ms; using Protocol Buffers with compiled parsers reduces this to under 0.5ms. A client-side cache in the model gateway (LRU, 10,000 entity capacity, 30-second TTL) eliminates redundant feature lookups for hot entities.

GPU cluster utilization is typically 40–60% due to training job heterogeneity (small jobs waste GPU memory). A GPU sharing mechanism (NVIDIA MIG or time-slicing) allows multiple small training jobs to share a GPU. A gang scheduling algorithm ensures distributed training jobs receive all requested GPUs simultaneously rather than partially, preventing GPU starvation deadlocks.

Key Trade-offs

Unified platform vs. best-of-breed tools: A unified platform reduces integration friction and enforces standards but limits framework choices and moves slowly; best-of-breed tools enable cutting-edge capabilities but create integration complexity and training-serving skew.
Online vs. offline feature freshness: Streaming feature pipelines provide fresh online features (seconds stale) but are complex to operate; batch pipelines are simpler but features may be hours stale, degrading model quality for time-sensitive predictions.
Model-as-service vs. embedded model: Serving models as dedicated microservices enables independent scaling and versioning but adds network latency; embedding the model in the application removes network hops but couples model updates to application deployments.
Centralized feature store vs. decentralized: Centralized feature store maximizes reuse and prevents redundant computation but creates a single dependency for all production models; decentralized per-team stores give autonomy but result in feature duplication and inconsistency.