System Design: Image Classification Service

Requirements

Functional Requirements:

Classify images into one or more categories from a predefined taxonomy
Support multiple model configurations: binary classification, multi-class (single label), multi-label
Accept images via URL, base64-encoded payload, or S3 reference
Return top-K predictions with confidence scores and optional bounding boxes for object detection
Support model versioning: A/B test different model architectures simultaneously
Provide a batch API for asynchronous processing of large image datasets

Non-Functional Requirements:

Real-time classification latency under 200ms at the 99th percentile
Batch throughput of 1 million images per hour per GPU worker
Support images up to 50 MB; reject and return error for images exceeding the limit
99.9% availability for the classification API
GPU utilization target of 80% during sustained load

Scale Estimation

1,000 real-time requests/second with 5ms GPU inference per image. Dynamic batching groups 20 requests for one GPU forward pass, achieving 10x throughput vs. single-image inference. A single V100 GPU processes 200 images/second (ResNet-50, FP16). For 1,000 requests/second: 5 GPUs needed with 20% headroom. Batch processing: 1 million images/hour = 278 images/second per GPU worker; 2 GPUs suffice for dedicated batch processing.

High-Level Architecture

The service has three main subsystems: Image Ingestion & Preprocessing, Model Inference, and Result Delivery. Image Ingestion handles URL fetching (with timeout and size validation), base64 decoding, and JPEG/PNG/WebP format normalization. Preprocessing resizes images to the model's required input dimensions (224x224 for ResNet, 384x384 for EfficientNet-L), normalizes pixel values using ImageNet mean and standard deviation, and batches images for GPU throughput.

Model Inference runs on NVIDIA Triton Inference Server deployed on GPU nodes. Triton's dynamic batching collects individual inference requests within a 10ms window and runs them as a single batch. TensorRT-optimized models run in INT8 quantization for 2x throughput improvement with <1% accuracy degradation. The inference layer is stateless and horizontally scalable; Kubernetes HPA scales GPU pods based on GPU utilization and request queue depth.

Result Delivery returns predictions synchronously for real-time requests. For batch jobs, results are written to S3 as JSON Lines files, with a completion webhook notifying the caller. A result cache (Redis, keyed by image content hash, TTL 24 hours) stores recently seen image predictions, returning cached results for duplicate images without GPU inference.

Core Components

Image Preprocessing Service

Preprocessing runs on CPU (to keep GPU busy with inference). OpenCV and libjpeg-turbo handle JPEG decoding 3x faster than PIL. Preprocessing steps: decode → validate (dimensions, file size, format) → resize with bicubic interpolation → center crop → normalize → serialize to float32 tensor → batch. For URL-sourced images, a download manager with connection pool and 5-second timeout fetches images concurrently, returning a placeholder "download failed" prediction rather than blocking the pipeline.

Triton Inference Server Configuration

Each model is deployed as a Triton model repository entry with: config.pbtxt specifying the model name, backend (TensorRT, ONNX, PyTorch), input/output tensor shapes, and batching parameters. The max_batch_size is set to 64; the preferred_batch_size to [8, 16, 32] allows Triton to choose optimal batch sizes dynamically. Model ensemble configuration chains the preprocessing model (CPU) and inference model (GPU) into a single inference pipeline, eliminating intermediate result serialization overhead.

Model Versioning & A/B Serving

Multiple model versions are loaded simultaneously in Triton. A routing layer in front of Triton reads traffic allocation config from etcd (e.g., 90% to model_v3, 10% to model_v4) and routes each request accordingly, tagging the response with the model version used. This enables live comparison of model quality on real traffic. Offline evaluation metrics (precision, recall, top-5 accuracy on a held-out benchmark) are tracked per version in MLflow.

Database Design

Classification results are logged to Cassandra for audit and model monitoring: (image_hash VARCHAR, model_version VARCHAR, predicted_class VARCHAR, confidence FLOAT, processing_time_ms INT, classified_at TIMESTAMP). Batch job metadata is stored in PostgreSQL: batch_jobs (job_id, input_s3_path, output_s3_path, model_version, total_images, processed_images, status, submitted_at, completed_at). Redis stores the result cache with LRU eviction and TTL-based expiry.

API Design

POST /classify — Real-time classification for a single image (URL, base64, or S3 ref); returns top-K labels with confidence scores. POST /classify/batch — Submit a batch job with an S3 manifest of image URLs; returns job_id for async polling. GET /classify/batch/{job_id} — Return batch job status and result S3 path upon completion. GET /models — List deployed model versions with framework, input dimensions, taxonomy size, and benchmark metrics.

Scaling & Bottlenecks

GPU underutilization during low-traffic periods wastes cost. A cluster autoscaler (Karpenter on AWS) terminates idle GPU nodes within 5 minutes and launches new ones in 2 minutes. For spiky workloads, a pre-warming strategy keeps a minimum of 2 GPU nodes always ready. Spot GPU instances (G4dn.xlarge T4 GPUs) handle 70% of inference load at 70% cost reduction; on-demand instances handle the remainder for SLA guarantees.

Cold start (loading a model into GPU memory) takes 10–30 seconds for large models. Pre-loading all models at pod startup eliminates cold start during serving. Model pinning (keeping hot models always resident in GPU memory) prevents cache eviction for frequently used model versions, at the cost of reserving GPU memory for pinned models.

Key Trade-offs

Dynamic batching vs. fixed batch size: Dynamic batching improves GPU utilization under variable load but introduces up to 10ms queuing latency; fixed batch size is predictable but wastes GPU cycles when requests arrive individually.
TensorRT INT8 vs. FP32: INT8 quantization doubles throughput with <1% accuracy degradation for ResNet-class models; larger models or fine-grained classification may show >2% degradation, requiring per-model calibration to decide.
Single large model vs. ensemble: Ensemble (ResNet + EfficientNet) improves accuracy by 2–3% but doubles inference cost; for high-value classification tasks the quality improvement justifies the cost.
CPU preprocessing vs. GPU preprocessing (DALI): DALI moves image decoding and preprocessing to GPU, freeing CPU resources and improving end-to-end throughput by 30% but adding implementation complexity.