Understand the transformer architecture — self-attention, positional encoding, encoder-decoder structure, and why transformers revolutionized NLP and beyond.

Transformer Architecture

The transformer is a neural network architecture based on self-attention mechanisms that processes input sequences in parallel, enabling the training of models with billions of parameters on massive datasets.

What It Really Means

Before transformers, the dominant architectures for sequence processing were RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory). These process tokens one at a time, left to right. This sequential processing created two problems: training was slow (no parallelism) and long-range dependencies were hard to learn (information decayed over many time steps).

The transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., replaced recurrence with self-attention. Instead of processing tokens sequentially, the transformer looks at all tokens simultaneously and learns which tokens should attend to which other tokens. This enables massive parallelism during training and captures long-range dependencies effectively.

Every major LLM — GPT-4, Claude, Llama, Gemini — is built on the transformer architecture. Understanding how transformers work is fundamental to understanding LLM serving, embedding models, prompt engineering, and practically everything in modern AI.

How It Works in Practice

High-Level Architecture

The original transformer has two halves:

Encoder: Processes the input sequence and builds a contextual representation
Decoder: Generates the output sequence token by token, attending to the encoder's output

Modern LLMs typically use decoder-only architectures (GPT, Llama, Claude). Encoder-only models (BERT) are used for classification and embeddings. Encoder-decoder models (T5, BART) are used for translation and summarization.

Core Components

1. Token Embedding + Positional Encoding

Input text is tokenized and each token is mapped to a dense vector (embedding). Since self-attention has no inherent notion of position, positional encodings are added to tell the model where each token appears in the sequence.

2. Multi-Head Self-Attention

The key innovation. For each token, the model computes:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

Attention score = softmax(QK^T / sqrt(d_k)) * V*

Multiple attention "heads" run in parallel, each learning different relationship patterns (syntactic, semantic, positional). See attention mechanism for a deep dive.

3. Feed-Forward Network

After attention, each token passes through a position-wise feed-forward network (two linear layers with a ReLU/GELU activation). This adds non-linearity and additional processing capacity.

4. Layer Normalization + Residual Connections

Each sub-layer (attention, feed-forward) is wrapped with a residual connection and layer normalization. This stabilizes training and enables very deep networks (GPT-4 has ~120 layers).

5. Output Head

The final layer projects the hidden state to a vocabulary-sized vector. A softmax converts this to a probability distribution over the next token.

Forward Pass Example

Input: "The cat sat on the"

Tokenize: ["The", "cat", "sat", "on", "the"]
Embed: Each token becomes a vector (e.g., 4096-dimensional)
Add positional encoding: Position information injected
Self-attention (x N layers): Each token builds a contextual representation by attending to all other tokens
After 96 layers: The representation for position 5 ("the") encodes that this is an article following a preposition, in a sentence about a cat sitting
Output head: Probability distribution over vocabulary, highest probability: "mat"

Implementation

python

Trade-offs

Advantages

Parallel processing during training — orders of magnitude faster than RNNs
Captures long-range dependencies through direct attention connections
Scales to billions of parameters with predictable performance improvements
Versatile — works for text, images (ViT), audio (Whisper), and multimodal tasks

Disadvantages

Quadratic memory and compute cost with sequence length (O(n^2) for attention)
No inherent sequential bias — must learn position from positional encodings
Very large models require massive compute for training and serving
Inference is autoregressive (one token at a time) — inherently sequential

Scaling Properties

Parameters: 100M → 1T (10,000x range in production models)
Training compute scales linearly with parameters and data
Quality improves predictably with scale (scaling laws by Kaplan et al.)
Inference cost scales linearly with model size but is memory-bandwidth bound

Common Misconceptions

"Transformers understand language" — Transformers learn statistical patterns in text. Whether this constitutes "understanding" is debated, but the mechanism is pattern matching and next-token prediction, not comprehension.
"More layers always improve performance" — Deeper models can have diminishing returns and are harder to train. The relationship between depth, width, and performance is nuanced. Scaling laws help predict optimal configurations.
"Transformers process text sequentially like humans" — During training, transformers process the entire sequence in parallel. During inference, they generate one token at a time, but each generation step attends to all previous tokens simultaneously.
"The attention mechanism is interpretable" — While you can visualize attention weights, interpreting them as explanations for model behavior is unreliable. High attention weight does not always mean a token was "important" for the output.
"Transformers are only for NLP" — Vision Transformers (ViT) for images, Audio Spectrogram Transformers for audio, and Decision Transformers for reinforcement learning all use the same core architecture.

How This Appears in Interviews

Transformer architecture questions are common in ML and AI engineering interviews:

"Explain how self-attention works" — walk through Q, K, V projections, scaled dot-product, and multi-head attention. See attention mechanism.
"Why do transformers use positional encoding?" — self-attention is permutation-invariant; without positional encoding, the model cannot distinguish word order.
"What is the computational complexity of self-attention and how do you reduce it?" — O(n^2) in sequence length. Discuss Flash Attention, sparse attention, and linear attention approximations. See our interview questions.

Related Concepts

Attention Mechanism — The core innovation in transformers
Embedding Models — Encoder-based transformers for embeddings
LLM Serving — Deploying transformer models in production
Token Budgeting — Context window limits derive from transformer architecture
Vector Embeddings — Transformer outputs used as representations
Algoroq Pricing — Practice ML architecture interview questions

Transformer Architecture Explained: The Engine Behind Modern AI