Blog / AI Engineering
AI Engineering

LLM Serving Infrastructure at Scale

How to serve LLMs in production with vLLM, TGI, and TensorRT-LLM — covering batching, KV cache, quantization, and GPU memory management.

Akhil Sharma

Akhil Sharma

February 4, 2026

12 min read

LLM Serving Infrastructure at Scale

Calling an API is the right choice for most teams. But when you need to serve models on your own hardware — for cost, latency, privacy, or customization — the infrastructure decisions matter enormously. A poorly configured serving stack can leave 60% of your GPU memory unused while users wait in a queue.

The Serving Stack

Three frameworks dominate self-hosted LLM serving in 2026:

vLLM — The default choice for most deployments. PagedAttention for efficient KV cache management, continuous batching, and broad model support. Written in Python with CUDA kernels.

TGI (Text Generation Inference) — Hugging Face's serving solution. Tight integration with the HF ecosystem. Good for teams already deep in the Hugging Face stack.

TensorRT-LLM — NVIDIA's optimized serving runtime. Best raw performance on NVIDIA hardware but harder to set up and less flexible with model support.

FeaturevLLMTGITensorRT-LLM
Setup complexityLowLowHigh
Model supportBroadBroadNarrow (needs conversion)
Peak throughputHighMediumHighest
KV cache managementPagedAttentionPagedPaged + optimized
QuantizationGPTQ, AWQ, GGUF, FP8GPTQ, AWQ, EETQFP8, INT8, INT4
Speculative decodingYesYesYes
Tensor parallelismYesYesYes
CommunityLargeLargeGrowing

Continuous Batching: Why It Matters

Naive batching waits for a batch to fill (or a timeout), processes all requests together, and returns all results when the slowest one finishes. This wastes GPU cycles because short completions finish early but their GPU slots sit idle until the batch completes.

Continuous batching (also called iteration-level batching) is different: the scheduler checks for new requests at every decoding step. When a request in the batch finishes generating, its slot is immediately given to a waiting request.

This alone can improve throughput by 2-5x. vLLM and TGI both support continuous batching out of the box.

KV Cache Management

The KV (key-value) cache stores attention states for each token in each layer. For a 7B parameter model with 32 layers, each token's KV cache consumes approximately:

For a batch of 32 requests each generating 2048 tokens, that's 32 GB of KV cache — potentially more than the model weights themselves.

PagedAttention (introduced by vLLM) solves KV cache fragmentation. Instead of allocating contiguous memory for each request's full sequence length, it allocates fixed-size blocks (pages) on demand:

AI Engineering Cohort

We build this end-to-end in the cohort.

Live sessions, real systems, your questions answered in real time. Next cohort starts 2nd July 2026 — 20 seats.

Reserve your spot →
python

Prefix caching takes this further. If multiple requests share the same system prompt (common in production), their KV cache for the shared prefix is computed once and reused. With a 2K-token system prompt and 100 concurrent requests, prefix caching saves ~100 GB of KV cache computation.

Quantization Trade-offs

Quantization reduces model precision from FP16 (16-bit floating point) to lower bit widths. This reduces memory usage and increases throughput at some cost to quality.

MethodBitsMemory ReductionQuality ImpactThroughput Gain
FP16 (baseline)161xNone1x
FP88~2xMinimal~1.5x
GPTQ4~4xSmall~2x
AWQ4~4xSmall (slightly better than GPTQ)~2x
GGUF Q4_K_M4~4xSmall-Medium~2x (CPU-friendly)
GPTQ 3-bit3~5xNoticeable~2.5x

GPTQ quantizes using calibration data (typically 128 examples from C4). Quality depends on calibration data relevance to your use case.

AWQ (Activation-aware Weight Quantization) preserves weights that correspond to large activations, producing better quality than GPTQ at the same bit width for most tasks.

FP8 is the sweet spot for NVIDIA H100/H200 hardware. Minimal quality degradation with significant memory savings. Native hardware support means no software overhead.

Serving a 70B model with different quantization on a single A100 80GB:

bash

GPU Memory Planning

Before deploying, calculate whether your model fits:

This tells you: 70B FP16 needs at least 3× A100 80GB GPUs (240 GB total) for a batch size of 32 with 4K context. With AWQ 4-bit quantization, the model fits in ~35 GB, leaving 45 GB on a single A100 for KV cache — enough for meaningful batch sizes.

Autoscaling Strategy

LLM workloads are bursty. You need autoscaling, but the metrics are different from web services.

Don't scale on CPU utilization — GPU workloads barely touch the CPU. Scale on:

  • GPU KV cache utilization — when the KV cache is >85% full, new requests queue
  • Time-to-first-token (TTFT) — measures queuing delay, directly impacts user experience
  • Request queue depth — leading indicator of saturation
yaml

Scale-up is the easy part. Scale-down is harder because each GPU pod takes 2-5 minutes to load model weights. Use a stabilization window to avoid thrashing, and consider keeping warm standby pods during business hours.

Speculative Decoding

Speculative decoding uses a smaller "draft" model to generate candidate tokens, which the larger "target" model verifies in parallel. Since verification is cheaper than generation (it's a single forward pass for multiple tokens), this can reduce latency by 2-3x without changing output quality.

python

The draft model should be from the same family as the target model (similar tokenizer and distribution). Acceptance rate depends on the task — highly predictable outputs (code completion, structured data) see 80%+ acceptance. Creative writing sees 40-60%.

What To Monitor

Beyond standard system metrics, track LLM-specific signals:

  • TTFT (Time to First Token): p50 < 200ms, p99 < 1s for interactive use cases
  • Token throughput: tokens/second per GPU — compare against your model's theoretical maximum
  • KV cache hit rate: if using prefix caching, track how often the cache is reused
  • Batch utilization: average number of requests being processed per iteration vs maximum batch size
  • Request rejection rate: how often requests are rejected due to capacity

Self-hosting LLMs is an infrastructure problem, not a model problem. The same model can perform vastly differently depending on quantization, batching configuration, and KV cache management. Get the infrastructure right, and the model does its job.

LLM Infrastructure GPU vLLM

become an engineering leader

Advanced System Design Cohort