LLM Serving Infrastructure at Scale

Calling an API is the right choice for most teams. But when you need to serve models on your own hardware — for cost, latency, privacy, or customization — the infrastructure decisions matter enormously. A poorly configured serving stack can leave 60% of your GPU memory unused while users wait in a queue.

The Serving Stack

Three frameworks dominate self-hosted LLM serving in 2026:

vLLM — The default choice for most deployments. PagedAttention for efficient KV cache management, continuous batching, and broad model support. Written in Python with CUDA kernels.

TGI (Text Generation Inference) — Hugging Face's serving solution. Tight integration with the HF ecosystem. Good for teams already deep in the Hugging Face stack.

TensorRT-LLM — NVIDIA's optimized serving runtime. Best raw performance on NVIDIA hardware but harder to set up and less flexible with model support.

Feature	vLLM	TGI	TensorRT-LLM
Setup complexity	Low	Low	High
Model support	Broad	Broad	Narrow (needs conversion)
Peak throughput	High	Medium	Highest
KV cache management	PagedAttention	Paged	Paged + optimized
Quantization	GPTQ, AWQ, GGUF, FP8	GPTQ, AWQ, EETQ	FP8, INT8, INT4
Speculative decoding	Yes	Yes	Yes
Tensor parallelism	Yes	Yes	Yes
Community	Large	Large	Growing

Continuous Batching: Why It Matters

Naive batching waits for a batch to fill (or a timeout), processes all requests together, and returns all results when the slowest one finishes. This wastes GPU cycles because short completions finish early but their GPU slots sit idle until the batch completes.

Continuous batching (also called iteration-level batching) is different: the scheduler checks for new requests at every decoding step. When a request in the batch finishes generating, its slot is immediately given to a waiting request.

This alone can improve throughput by 2-5x. vLLM and TGI both support continuous batching out of the box.

KV Cache Management

The KV (key-value) cache stores attention states for each token in each layer. For a 7B parameter model with 32 layers, each token's KV cache consumes approximately:

For a batch of 32 requests each generating 2048 tokens, that's 32 GB of KV cache — potentially more than the model weights themselves.

PagedAttention (introduced by vLLM) solves KV cache fragmentation. Instead of allocating contiguous memory for each request's full sequence length, it allocates fixed-size blocks (pages) on demand:

python

Prefix caching takes this further. If multiple requests share the same system prompt (common in production), their KV cache for the shared prefix is computed once and reused. With a 2K-token system prompt and 100 concurrent requests, prefix caching saves ~100 GB of KV cache computation.

Quantization Trade-offs

Quantization reduces model precision from FP16 (16-bit floating point) to lower bit widths. This reduces memory usage and increases throughput at some cost to quality.

Method	Bits	Memory Reduction	Quality Impact	Throughput Gain
FP16 (baseline)	16	1x	None	1x
FP8	8	~2x	Minimal	~1.5x
GPTQ	4	~4x	Small	~2x
AWQ	4	~4x	Small (slightly better than GPTQ)	~2x
GGUF Q4_K_M	4	~4x	Small-Medium	~2x (CPU-friendly)
GPTQ 3-bit	3	~5x	Noticeable	~2.5x

GPTQ quantizes using calibration data (typically 128 examples from C4). Quality depends on calibration data relevance to your use case.

AWQ (Activation-aware Weight Quantization) preserves weights that correspond to large activations, producing better quality than GPTQ at the same bit width for most tasks.

FP8 is the sweet spot for NVIDIA H100/H200 hardware. Minimal quality degradation with significant memory savings. Native hardware support means no software overhead.

Serving a 70B model with different quantization on a single A100 80GB:

bash

GPU Memory Planning

Before deploying, calculate whether your model fits:

This tells you: 70B FP16 needs at least 3× A100 80GB GPUs (240 GB total) for a batch size of 32 with 4K context. With AWQ 4-bit quantization, the model fits in ~35 GB, leaving 45 GB on a single A100 for KV cache — enough for meaningful batch sizes.

Autoscaling Strategy

LLM workloads are bursty. You need autoscaling, but the metrics are different from web services.

Don't scale on CPU utilization — GPU workloads barely touch the CPU. Scale on:

GPU KV cache utilization — when the KV cache is >85% full, new requests queue
Time-to-first-token (TTFT) — measures queuing delay, directly impacts user experience
Request queue depth — leading indicator of saturation

yaml

Scale-up is the easy part. Scale-down is harder because each GPU pod takes 2-5 minutes to load model weights. Use a stabilization window to avoid thrashing, and consider keeping warm standby pods during business hours.

Speculative Decoding

Speculative decoding uses a smaller "draft" model to generate candidate tokens, which the larger "target" model verifies in parallel. Since verification is cheaper than generation (it's a single forward pass for multiple tokens), this can reduce latency by 2-3x without changing output quality.

python

The draft model should be from the same family as the target model (similar tokenizer and distribution). Acceptance rate depends on the task — highly predictable outputs (code completion, structured data) see 80%+ acceptance. Creative writing sees 40-60%.

What To Monitor

Beyond standard system metrics, track LLM-specific signals:

TTFT (Time to First Token): p50 < 200ms, p99 < 1s for interactive use cases
Token throughput: tokens/second per GPU — compare against your model's theoretical maximum
KV cache hit rate: if using prefix caching, track how often the cache is reused
Batch utilization: average number of requests being processed per iteration vs maximum batch size
Request rejection rate: how often requests are rejected due to capacity

Self-hosting LLMs is an infrastructure problem, not a model problem. The same model can perform vastly differently depending on quantization, batching configuration, and KV cache management. Get the infrastructure right, and the model does its job.

LLM Serving Infrastructure at Scale

LLM Serving Infrastructure at Scale

The Serving Stack

Continuous Batching: Why It Matters

KV Cache Management

We build this end-to-end in the cohort.

Quantization Trade-offs

GPU Memory Planning

Autoscaling Strategy

Speculative Decoding

What To Monitor

More in AI Engineering

Building Reliable LLM Evaluation Pipelines

Prompt Caching Strategies That Cut Your LLM Costs in Half

become an engineering leader