From the failure modes of RNNs, through the mathematical derivation of attention, to multi-head attention, positional encoding, and the complete encoder-decoder architecture. Every equation, every design choice.
To understand why transformers are revolutionary, you must first understand what they replaced — and why RNNs fundamentally cannot handle long-range dependencies, no matter how large you make them.
An RNN processes a sequence one token at a time, updating a hidden state \(\mathbf{h}_t\):
To learn long-range dependencies, the gradient must flow backward through hundreds of timesteps. Each step multiplies by \(\mathbf{W}_h\) and \(f'(\cdot)\):
For seq2seq tasks (translation, summarization), an encoder RNN compresses the entire source sentence into a single fixed-size vector \(\mathbf{h}_T\). A 1000-word document must be squeezed into, say, 512 numbers. The decoder then generates the output from this bottleneck. Information is inevitably lost — especially from the beginning of the sequence.
Three compounding problems: (1) No parallelism — training is slow regardless of hardware. (2) Vanishing gradients — early tokens effectively stop contributing to the gradient. (3) Fixed-size bottleneck — all source information must fit in one vector. The transformer solves all three simultaneously.
| Model | Parallelizable? | Max Path Length | Long-Range Dependency | Memory per Token |
|---|---|---|---|---|
| RNN | No — strictly sequential | O(n) | Poor — gradient decay | O(1) |
| LSTM/GRU | No — sequential | O(n) | Better but still limited | O(1) |
| 1D CNN | Yes | O(n/k) per layer | Limited by kernel size | O(1) |
| Transformer | Yes — fully parallel | O(1) | Perfect — direct attention | O(n) |
The attention mechanism solves the bottleneck problem by allowing every output token to look directly at every input token — without going through the sequential bottleneck. It's an information retrieval system embedded inside a neural network.
Think of a search engine. You have a query (what you're looking for), a set of keys (metadata describing database entries), and values (the actual content). You compute similarity between query and each key, then retrieve a weighted blend of values:
The Query-Key-Value framework is not arbitrary — it is the result of asking: "how do we make the attention weights learnable and differentiable?" We derive it from first principles.
Start with \(n\) token embeddings: \(\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}\). Each row \(\mathbf{x}_i\) is one token. For each token, we want to compute three different linear projections — each encoding a different aspect of the token for a different role:
The query for token \(i\) encodes what information token \(i\) needs to compute its output. When translating "cat" to French, the query might encode "I need a noun that refers to a feline animal."
Different tokens have different queries: "cat" searches for its referent; "sat" searches for its subject and object.
The key for token \(j\) encodes what that token contains that might be relevant to others. "mat" has a key that advertises "I am a physical object, a location."
The query-key dot product measures compatibility: does what token \(i\) needs match what token \(j\) provides?
The value for token \(j\) is the actual content that gets aggregated. Once token \(i\) decides to attend to token \(j\) (via high query-key similarity), it receives \(v_j\).
The value can encode different information than the key. A token can advertise one thing (key) but share another (value).
Separating Q, K, V gives three degrees of freedom: (1) What to search for, (2) what to make findable, (3) what to contribute when found. Using a single vector for all three would conflate these distinct roles.
Training learns all three matrices jointly via backpropagation through the attention weights.
\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V} \in \mathbb{R}^{n \times d_v}\]
A single attention head can only capture one type of relationship. Multi-head attention runs \(h\) attention mechanisms in parallel — each in a different learned subspace — and concatenates their outputs.
Self-attention is permutation-equivariant — shuffle the input tokens and the output shuffles identically. The model has no built-in notion of order. "The cat sat on the mat" and "mat the on sat cat The" would produce the same attention scores without positional encoding.
Learn a separate embedding for each position index, trained jointly with the model. Used by BERT, GPT-2. Fixed maximum sequence length; doesn't generalize beyond training length.
Multiply Q and K by rotation matrices that depend on position. Relative position naturally emerges in the dot product. Used in LLaMA, GPT-NeoX, PaLM. Generalizes well to unseen lengths.
Add a linear penalty proportional to distance between positions. No learned parameters. Strong generalization to longer sequences at inference. Used in BLOOM, MPT.
Some recent work (e.g., RWKV) avoids explicit PE entirely, relying on architectural inductive biases. The bag-of-words model is the extreme case: no position information at all.
The transformer block is the repeated unit that forms both the encoder and decoder. It wraps multi-head attention and a feed-forward network with residual connections and layer normalization.
Every source token attends to every other source token. Bidirectional — can look forward and backward. Builds rich contextual representations: "bank" in "river bank" vs "bank account" gets different representations based on surrounding context.
Generated tokens attend only to previous tokens. Prevents peeking at future tokens during training. Causal/autoregressive: each position can only see what has already been generated. Essential for language modeling and generation.
Decoder queries attend to encoder keys/values. This is where the translation happens: each target word looks up relevant source words. Replaces the RNN bottleneck with direct, flexible attention across the full source sequence.
| Method | Time Complexity | Memory | Exact? | Notes |
|---|---|---|---|---|
| Full Self-Attention | O(n²d) | O(n²) | Yes | Baseline — quadratic bottleneck |
| Flash Attention v2 | O(n²d) | O(n) | Yes | Tiled SRAM computation; same math, better memory access pattern |
| Sparse Attention (Longformer) | O(nd·k) | O(nk) | Approx | Attend to k local + global tokens only |
| Linear Attention | O(nd²) | O(d²) | Approx | Kernel approximation; approximate softmax |
| State Space Models (Mamba) | O(nd) | O(d) | Different model | Not attention-based; RNN-style recurrence in latent space |
Flash Attention (Dao et al., 2022) achieves the same mathematical result as standard attention but uses tiled computation to avoid materializing the full \(n\times n\) attention matrix in HBM (GPU DRAM). Instead, it computes attention in SRAM tiles. Memory goes from \(O(n^2)\) to \(O(n)\) while compute remains \(O(n^2 d)\) — but the constant factor improves because SRAM is ~100× faster than HBM:
Self-attention is equivalent to a fully-connected Graph Neural Network where every token is a node and attention weights are edge weights. The attention mechanism is a differentiable message-passing algorithm. Masked attention restricts the graph. This perspective reveals that transformers are the most general sequence model — they can learn any permutation-equivariant function of the input tokens.
A transformer with sufficiently many layers, heads, and sufficient dimension can approximate any continuous sequence-to-sequence mapping to arbitrary accuracy. This follows from the FFN's universal approximation (each token independently), combined with the attention mechanism's ability to aggregate global context. In practice, depth matters more than width — scaling laws confirm that \(L \propto N^{0.5}\) layers is optimal for a model with \(N\) parameters.
Architecture: Encoder stack only. Bidirectional — each token attends to all others (no causal mask).
Pre-training: Masked Language Modeling (MLM): mask 15% of tokens, predict them. Next Sentence Prediction (NSP).
Best for: Understanding tasks — classification, NER, question answering, retrieval. Not for generation.
Representations: Each token gets a contextualized embedding conditioned on the full bidirectional context. Richer representations per token.
Architecture: Decoder stack only with causal masking. Each token attends only to previous tokens.
Pre-training: Autoregressive Language Modeling: predict next token given all previous tokens. Simple, scalable, infinite data.
Best for: Generation tasks — text generation, translation, code generation, few-shot learning.
Representations: Each token conditioned on prefix only. Can generate arbitrarily long sequences autoregressively.
| Model Family | Architecture | Pre-training | Key Models |
|---|---|---|---|
| BERT, RoBERTa | Encoder-only | MLM (bidirectional) | BERT, RoBERTa, DeBERTa, ELECTRA |
| GPT, LLaMA | Decoder-only | Causal LM (left-to-right) | GPT-4, LLaMA 3, Mistral, PaLM |
| T5, BART | Encoder-Decoder | Text-to-text denoising | T5, BART, mT5, Flan-T5 |
| XLNet, ALBERT | Encoder-only | Permutation LM | XLNet (overcomes BERT limitations) |
Dosovitskiy et al. (2020) showed that a pure transformer — with minimal modifications — applied directly to image patches achieves state-of-the-art on image classification when pre-trained at scale.
My chest X-ray classification used ViT-B/16 pretrained on ImageNet. Transfer learning: the 196 patch tokens attend to each other across the full 224×224 image — allowing the model to capture global patterns (like bilateral infiltrates) that local CNN filters would miss. GradCAM on the attention maps (Attention Rollout) gives XAI explanations: high attention to radiologically relevant regions validates the model's reasoning. ViT's global attention is why it outperforms VGG16 on detecting spatially distributed pathologies.
The unified story in one thread:
The transformer's combination of (1) constant-path-length attention enabling true long-range dependencies, (2) full GPU parallelism enabling massive scale, (3) content-based dynamic routing enabling flexible context-dependent representations, and (4) empirical scaling laws enabling predictable improvement with compute has made it the universal foundation of AI — language, vision, protein structure, code, scientific discovery. Your GPT-based interactions, and every large language model rely on the same four equations: \(\mathbf{Q}=\mathbf{X}\mathbf{W}^Q\), \(\mathbf{K}=\mathbf{X}\mathbf{W}^K\), \(\mathbf{V}=\mathbf{X}\mathbf{W}^V\), \(\text{Attention}=\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\mathbf{V}\).