Deep Learning · Sequence Modeling

Attention Is All You Need Transformers from First Principles

From the failure modes of RNNs, through the mathematical derivation of attention, to multi-head attention, positional encoding, and the complete encoder-decoder architecture. Every equation, every design choice.

Why RNNs Fail Q · K · V Derivation Scaled Dot-Product Multi-Head Attention Positional Encoding Encoder-Decoder Self-Attention O(n²) Layer Norm · FFN BERT vs GPT Vision Transformers
Thecat saton mat The cat sat on mat attention matrix
01
The Problem

Why RNNs Struggle

// sequential bottleneck · vanishing gradients · long-range dependencies · the information bottleneck

To understand why transformers are revolutionary, you must first understand what they replaced — and why RNNs fundamentally cannot handle long-range dependencies, no matter how large you make them.

The Sequential Bottleneck

An RNN processes a sequence one token at a time, updating a hidden state \(\mathbf{h}_t\):

RNN Hidden State UpdateRecurrent
\[\mathbf{h}_t = f(\mathbf{W}_h\mathbf{h}_{t-1} + \mathbf{W}_x\mathbf{x}_t + \mathbf{b})\]
To compute h_t, you need h_{t-1}. To compute h_{t-1}, you need h_{t-2}. This creates an irreducible sequential dependency: you cannot compute h_t for all t simultaneously. This makes RNNs fundamentally un-parallelizable on GPUs — and GPUs are only useful for parallel computation.
Vanishing Gradients Through Time

To learn long-range dependencies, the gradient must flow backward through hundreds of timesteps. Each step multiplies by \(\mathbf{W}_h\) and \(f'(\cdot)\):

Gradient Flow Through Time — BPTTVanishing
\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T}\prod_{t=2}^{T}\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T}\prod_{t=2}^{T}\mathbf{W}_h^T\text{diag}(f'(\mathbf{h}_{t-1}))\]
If the spectral radius ρ(W_h) < 1: the product of T matrices → 0 exponentially fast (vanishing). If ρ(W_h) > 1: → ∞ exponentially fast (exploding). Neither allows the gradient to carry signal across T=1000 steps. LSTMs and GRUs mitigate but don't eliminate this — they still process sequentially.
The Information Bottleneck

For seq2seq tasks (translation, summarization), an encoder RNN compresses the entire source sentence into a single fixed-size vector \(\mathbf{h}_T\). A 1000-word document must be squeezed into, say, 512 numbers. The decoder then generates the output from this bottleneck. Information is inevitably lost — especially from the beginning of the sequence.

!
The Fundamental Limitation

Three compounding problems: (1) No parallelism — training is slow regardless of hardware. (2) Vanishing gradients — early tokens effectively stop contributing to the gradient. (3) Fixed-size bottleneck — all source information must fit in one vector. The transformer solves all three simultaneously.

Model Parallelizable? Max Path Length Long-Range Dependency Memory per Token
RNN No — strictly sequential O(n) Poor — gradient decay O(1)
LSTM/GRU No — sequential O(n) Better but still limited O(1)
1D CNN Yes O(n/k) per layer Limited by kernel size O(1)
Transformer Yes — fully parallel O(1) Perfect — direct attention O(n)
02
Intuition

Attention Intuition

// from human attention · information retrieval · dynamic weighted average · relevance scoring

The attention mechanism solves the bottleneck problem by allowing every output token to look directly at every input token — without going through the sequential bottleneck. It's an information retrieval system embedded inside a neural network.

The Retrieval Metaphor

Think of a search engine. You have a query (what you're looking for), a set of keys (metadata describing database entries), and values (the actual content). You compute similarity between query and each key, then retrieve a weighted blend of values:

Information Retrieval AnalogyIntuition
\[\text{Query } q \text{ matches Key } k_i \to \text{similarity score } s_i = q \cdot k_i\] \[\text{Weights: } \alpha_i = \text{softmax}(s_i)\] \[\text{Output: } o = \sum_i \alpha_i v_i \quad \text{(weighted blend of Values)}\]
In a hard database lookup, you retrieve exactly one record (the best match). Attention is a "soft" lookup — you retrieve a weighted sum of all values, where the weights reflect relevance. The softmax ensures weights sum to 1 and are non-negative. Every token can directly attend to every other token — path length is O(1).
INPUT TOKENS The cat sat on mat query: cat? 0.05 0.20 0.08 0.05 0.62 VALUE VECTORS v_The v_cat v_sat v_on v_mat weighted sum context(cat) ≈ 0.62·v_mat + 0.20·v_cat + ...
Fig 1. Attention from "cat" to all other tokens. "cat" attends most strongly to "mat" (its likely reference in context), retrieving a context vector dominated by mat's value representation.
03
Derivation

Query–Key–Value Derivation

// where Q, K, V come from · learned projections · why three matrices · mathematical motivation

The Query-Key-Value framework is not arbitrary — it is the result of asking: "how do we make the attention weights learnable and differentiable?" We derive it from first principles.

From Token Embeddings to Q, K, V

Start with \(n\) token embeddings: \(\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}\). Each row \(\mathbf{x}_i\) is one token. For each token, we want to compute three different linear projections — each encoding a different aspect of the token for a different role:

Q, K, V ProjectionsCore Definition
\[\mathbf{Q} = \mathbf{X}\mathbf{W}^Q \in \mathbb{R}^{n \times d_k} \quad \text{(Queries: what am I looking for?}\] \[\mathbf{K} = \mathbf{X}\mathbf{W}^K \in \mathbb{R}^{n \times d_k} \quad \text{(Keys: what do I contain?)}\] \[\mathbf{V} = \mathbf{X}\mathbf{W}^V \in \mathbb{R}^{n \times d_v} \quad \text{(Values: what do I want to share?)}\] \[\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d_{\text{model}}\times d_k}, \quad \mathbf{W}^V \in \mathbb{R}^{d_{\text{model}}\times d_v}\]
The three projection matrices W^Q, W^K, W^V are learned. Using separate matrices for each role gives the model the freedom to represent "what to search for" independently from "what to advertise" independently from "what to contribute." If W^Q = W^K = W^V = I (identity), you get simple dot-product similarity — but the model can't learn to specialize.
Why Separate Queries and Keys?

Query = "What am I looking for?"

The query for token \(i\) encodes what information token \(i\) needs to compute its output. When translating "cat" to French, the query might encode "I need a noun that refers to a feline animal."

Different tokens have different queries: "cat" searches for its referent; "sat" searches for its subject and object.

Key = "What can I provide?"

The key for token \(j\) encodes what that token contains that might be relevant to others. "mat" has a key that advertises "I am a physical object, a location."

The query-key dot product measures compatibility: does what token \(i\) needs match what token \(j\) provides?

Value = "What information do I share?"

The value for token \(j\) is the actual content that gets aggregated. Once token \(i\) decides to attend to token \(j\) (via high query-key similarity), it receives \(v_j\).

The value can encode different information than the key. A token can advertise one thing (key) but share another (value).

The Key Insight

Separating Q, K, V gives three degrees of freedom: (1) What to search for, (2) what to make findable, (3) what to contribute when found. Using a single vector for all three would conflate these distinct roles.

Training learns all three matrices jointly via backpropagation through the attention weights.

04
Core Algorithm

Scaled Dot-Product Attention

// the full formula · why scale · softmax as competition · complete derivation
Scaled Dot-Product Attention (Vaswani et al. 2017)

\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V} \in \mathbb{R}^{n \times d_v}\]

Step-by-Step Breakdown
// Complete derivation and interpretation of each operation
Step 1
Compute raw attention scores: \(\mathbf{S} = \mathbf{Q}\mathbf{K}^T \in \mathbb{R}^{n \times n}\)
S_{ij} = q_i · k_j = dot product of query for position i with key for position j. Measures "how much does token i want to attend to token j?" High values: strong match. Low/negative: weak or no match. Each of the n² entries is computed in parallel by the matrix multiply.
Step 2
Scale by \(1/\sqrt{d_k}\): \(\mathbf{S}_{\text{scaled}} = \mathbf{S}/\sqrt{d_k}\)
Why? For q, k ~ N(0,1) components: dot product q·k has variance d_k (sum of d_k products of unit-variance numbers). So std(S_{ij}) ≈ √d_k. Without scaling, for large d_k these scores have huge variance → softmax saturates (all weight on max, gradients vanish). Dividing by √d_k brings variance back to O(1).
Step 3
Apply softmax row-wise: \(\mathbf{A} = \text{softmax}(\mathbf{S}_{\text{scaled}}) \in \mathbb{R}^{n\times n}\)
Each row of A sums to 1 and is non-negative — it's a probability distribution over positions. A_{ij} is the attention weight token i assigns to token j. Softmax creates competition: attending strongly to one position reduces attention to others.
Step 4
Weighted sum of values: \(\text{Output} = \mathbf{A}\mathbf{V} \in \mathbb{R}^{n \times d_v}\)
Output_i = Σ_j A_{ij} · v_j = weighted average of all value vectors, weighted by how much token i attends to each position j. This is the "retrieved information" — a blended representation containing contributions from all tokens the current token cares about.
Why Scaling by √d_k is Critical
Scaling Analysis — Variance ComputationMathematics
\[\mathbf{q}, \mathbf{k} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_{d_k}) \Rightarrow \mathbf{q}^T\mathbf{k} = \sum_{j=1}^{d_k}q_j k_j\] \[\mathbb{E}[q_j k_j] = 0, \quad \text{Var}(q_j k_j) = 1 \Rightarrow \text{Var}(\mathbf{q}^T\mathbf{k}) = d_k\] \[\text{Without scaling: } \text{softmax}(\mathbf{s}) \approx \mathbf{e}_{\text{argmax}} \text{ (one-hot) as } d_k \to \infty\] \[\text{Gradient of one-hot softmax} \approx \mathbf{0} \text{ everywhere except argmax}\]
Practical example: d_k = 512. Without scaling: scores ~ N(0, 512) → std ≈ 22. The max score is ~3 std above mean ≈ 66. Softmax(66) ≈ 1, all others ≈ 0. Gradient is zero. With √512 scaling: scores ~ N(0,1). Softmax is smooth, gradients flow freely.
Masked Attention — For Decoder
Masked Scaled Dot-Product AttentionCausal Mask
\[\text{MaskedAttention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T + \mathbf{M}}{\sqrt{d_k}}\right)\mathbf{V}\] \[\mathbf{M}_{ij} = \begin{cases}0 & j \leq i \\ -\infty & j > i\end{cases}\]
The causal mask M prevents position i from attending to future positions j > i. Adding −∞ before softmax makes those weights exactly 0 (softmax(−∞) = 0). This ensures the decoder cannot "cheat" by looking at tokens it hasn't generated yet. Essential for autoregressive language models.
05
Multi-Head

Multi-Head Attention

// parallel attention heads · different representation subspaces · concatenation · projection

A single attention head can only capture one type of relationship. Multi-head attention runs \(h\) attention mechanisms in parallel — each in a different learned subspace — and concatenates their outputs.

Multi-Head Attention — Full DefinitionArchitecture
\[\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)\] \[\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O\] \[\mathbf{W}_i^Q, \mathbf{W}_i^K \in \mathbb{R}^{d_{\text{model}}\times d_k},\quad \mathbf{W}_i^V \in \mathbb{R}^{d_{\text{model}}\times d_v},\quad \mathbf{W}^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}\]
In "Attention Is All You Need": h=8 heads, d_k = d_v = d_model/h = 64 (for d_model=512). Total parameters per MHA: h(d_model·d_k + d_model·d_k + d_model·d_v) + hd_v·d_model = d_model·(2d_k·h + d_v·h) + d_model² ≈ 4d_model². Computational cost matches a single head at d_k = d_model (due to parallel heads × smaller dimensions).
Why Multiple Heads?
  • Different relationship types simultaneously: Head 1 might learn syntactic dependencies; head 2 coreference resolution; head 3 positional proximity. Each head specializes in a different linguistic pattern without explicit supervision.
  • Multiple representation subspaces: Each head projects the embeddings to a different \(d_k\)-dimensional subspace. Token relationships that aren't visible in the full \(d_{\text{model}}\)-dimensional space may become clear in a specialized subspace.
  • Richer, more stable gradients: \(h\) independent gradient paths back through \(h\) independent sets of Q, K, V projections. The average gradient is more stable than a single high-dimensional head.
  • Empirical evidence: Removing attention heads during inference shows different heads learn qualitatively different behaviors — some are removable without impact; others are critical and task-specific.
Computation Cost
Multi-Head Attention — ComplexityCost
\[\text{Cost per head: } O(n^2 d_k + n d_k d_{\text{model}})\] \[\text{Total (h heads): } O(n^2 d_k h + n d_{\text{model}}^2) = O(n^2 d_{\text{model}} + n d_{\text{model}}^2)\]
The n² term (the attention matrix) is the bottleneck for long sequences. For n < d_model: the d_model² term (projection) dominates. For n > d_model: the n² term dominates — this is the "quadratic bottleneck" addressed by Longformer, BigBird, Flash Attention etc.
06
Position

Positional Encoding

// attention is permutation invariant · injecting position · sinusoidal · learned · RoPE

Self-attention is permutation-equivariant — shuffle the input tokens and the output shuffles identically. The model has no built-in notion of order. "The cat sat on the mat" and "mat the on sat cat The" would produce the same attention scores without positional encoding.

Sinusoidal Positional Encoding (Original)
Sinusoidal Positional EncodingVaswani 2017
\[\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)\] \[\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)\] \[\mathbf{x}'_{\text{pos}} = \mathbf{x}_{\text{pos}} + \text{PE}(\text{pos}, :) \quad \text{(added to input embeddings)}\]
pos is the position in the sequence (0, 1, 2, ...). i is the dimension index (0, 1, ..., d_model/2 − 1). Each dimension oscillates at a different frequency — from very fast (2π period) to very slow (2π·10000 period). The model can learn to attend by relative position because PE(pos+k, :) can be expressed as a linear function of PE(pos, :) via rotation matrices.
Why Sinusoidal Encodings Work
// Relative position encoding via rotation matrices
1
Consider dimension pair \((2i, 2i+1)\) as a 2D vector: \(\text{PE}_{\text{pos}} = [\sin(\omega_i\text{pos}), \cos(\omega_i\text{pos})]\) where \(\omega_i = 10000^{-2i/d}\).
2
Shift by \(k\) positions: \(\text{PE}_{\text{pos}+k} = [\sin(\omega_i(\text{pos}+k)), \cos(\omega_i(\text{pos}+k))]\)
3
Using angle addition formulas: \(\text{PE}_{\text{pos}+k} = \mathbf{R}(\omega_i k)\,\text{PE}_{\text{pos}}\) where \(\mathbf{R}(\theta) = \begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix}\)
The rotation matrix R(ω_i k) depends only on the offset k, not on the absolute position. This means attention score PE_pos^T PE_{pos+k} = PE_0^T R^T(ω_i k) PE_0 — relative position, not absolute. The model can learn to detect "k tokens apart" patterns without seeing absolute positions. ∎
Modern Alternatives
Learned Positional Embeddings
PE ∈ ℝ^{max_len × d_model} (trainable)

Learn a separate embedding for each position index, trained jointly with the model. Used by BERT, GPT-2. Fixed maximum sequence length; doesn't generalize beyond training length.

Rotary Position Embedding (RoPE)
q̃ᵢ = R(θ·pos)·qᵢ

Multiply Q and K by rotation matrices that depend on position. Relative position naturally emerges in the dot product. Used in LLaMA, GPT-NeoX, PaLM. Generalizes well to unseen lengths.

ALiBi (Attention with Linear Biases)
Sᵢⱼ = qᵢᵀkⱼ − m·|i−j|

Add a linear penalty proportional to distance between positions. No learned parameters. Strong generalization to longer sequences at inference. Used in BLOOM, MPT.

No Positional Encoding
Input order not encoded explicitly

Some recent work (e.g., RWKV) avoids explicit PE entirely, relying on architectural inductive biases. The bag-of-words model is the extreme case: no position information at all.

07
Building Block

The Transformer Block

// layer normalization · residual connections · feed-forward network · pre-norm vs post-norm

The transformer block is the repeated unit that forms both the encoder and decoder. It wraps multi-head attention and a feed-forward network with residual connections and layer normalization.

Transformer Block (Post-LayerNorm, original)Block Equations
\[\mathbf{Z} = \text{LayerNorm}\!\left(\mathbf{X} + \text{MultiHead}(\mathbf{X},\mathbf{X},\mathbf{X})\right)\] \[\mathbf{Y} = \text{LayerNorm}\!\left(\mathbf{Z} + \text{FFN}(\mathbf{Z})\right)\] \[\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2\] \[\mathbf{W}_1 \in \mathbb{R}^{d_{\text{model}}\times d_{\text{ff}}}, \quad d_{\text{ff}} = 4d_{\text{model}} \text{ (typically)}\]
Pre-LayerNorm (modern standard): LayerNorm applied before attention/FFN. More stable training; used in GPT-3, LLaMA. Post-LayerNorm: original paper; applied after residual add. Layer normalization normalizes each token's representation independently across the d_model dimension: LN(x) = (x − μ)/σ · γ + β.
Why Each Component Is Essential
  • Residual connections (+X): Allow gradients to flow directly from output to every layer — circumventing vanishing gradients exactly as in ResNets. Without them, training deep transformers is unstable. The residual path also allows earlier representations to pass through unchanged if no transformation is needed.
  • Layer Normalization: Normalizes the activation within each token's embedding (across the d_model dimension, not across the batch). Stabilizes training, makes the model less sensitive to learning rate, and helps maintain signal through deep networks. Different from BatchNorm: no dependency on batch size.
  • Feed-Forward Network (FFN): Applied independently and identically to each token position. It acts as a learned "memory" — research suggests FFN layers store factual associations (e.g., "Paris is in France"). The 4× expansion (d_ff = 4·d_model) then projection gives the model capacity to transform each token's representation nonlinearly.
Layer NormalizationNormalization
\[\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \boldsymbol{\mu}}{\boldsymbol{\sigma}}\odot\boldsymbol{\gamma} + \boldsymbol{\beta}, \quad \boldsymbol{\mu} = \frac{1}{d}\sum_{j=1}^d x_j, \quad \boldsymbol{\sigma} = \sqrt{\frac{1}{d}\sum_{j=1}^d(x_j-\mu)^2 + \epsilon}\]
γ, β ∈ ℝ^d are learned per-dimension scale and shift (trainable). Applied per token (per row of X) — unlike BatchNorm which normalizes per feature across the batch. ε ≈ 1e-5 for numerical stability. RMSNorm (used in LLaMA): remove the mean subtraction, only divide by RMS — cheaper and often performs as well.
08
Architecture

Encoder–Decoder Structure

// encoder self-attention · decoder masked self-attention · cross-attention · the information flow
ENCODER (×N layers) Input Embeddings + PE Multi-Head Self-Attention Add & LayerNorm Feed-Forward Network Add & LayerNorm encoder output (all token representations) DECODER (×N layers) Output Embeddings + PE Masked Multi-Head Self-Attention Add & LayerNorm Multi-Head Cross-Attention (Q=decoder, K,V=encoder) Add & LayerNorm Feed-Forward Network Add & Norm → Linear → Softmax K, V from encoder Output probabilities
Fig 2. Full encoder-decoder Transformer architecture. Left: Encoder stack — bidirectional self-attention reads the source. Right: Decoder stack — masked self-attention for target, cross-attention reads encoder output (K, V from encoder, Q from decoder).
Three Types of Attention in the Transformer
Encoder Self-Attention
Q=K=V=X_encoder

Every source token attends to every other source token. Bidirectional — can look forward and backward. Builds rich contextual representations: "bank" in "river bank" vs "bank account" gets different representations based on surrounding context.

Decoder Masked Self-Attention
Q=K=V=X_decoder + causal mask

Generated tokens attend only to previous tokens. Prevents peeking at future tokens during training. Causal/autoregressive: each position can only see what has already been generated. Essential for language modeling and generation.

Decoder Cross-Attention
Q=X_decoder, K=V=X_encoder

Decoder queries attend to encoder keys/values. This is where the translation happens: each target word looks up relevant source words. Replaces the RNN bottleneck with direct, flexible attention across the full source sequence.

09
Complexity

Self-Attention Complexity

// O(n²) bottleneck · sparse attention · Flash Attention · linear transformers
Self-Attention Computational CostComplexity
\[\text{Time: } O(n^2 d) \quad \text{Memory: } O(n^2 + nd)\] \[\text{n = sequence length, d = model dimension}\]
The n² term is the attention matrix QK^T ∈ ℝ^{n×n}. For n=512: 262,144 entries. For n=32,000 (GPT-4 context): 1,024,000,000 entries — 4GB just for the attention matrix. This is the quadratic bottleneck that limits context length.
Method Time Complexity Memory Exact? Notes
Full Self-Attention O(n²d) O(n²) Yes Baseline — quadratic bottleneck
Flash Attention v2 O(n²d) O(n) Yes Tiled SRAM computation; same math, better memory access pattern
Sparse Attention (Longformer) O(nd·k) O(nk) Approx Attend to k local + global tokens only
Linear Attention O(nd²) O(d²) Approx Kernel approximation; approximate softmax
State Space Models (Mamba) O(nd) O(d) Different model Not attention-based; RNN-style recurrence in latent space
Flash Attention — The Hardware Perspective

Flash Attention (Dao et al., 2022) achieves the same mathematical result as standard attention but uses tiled computation to avoid materializing the full \(n\times n\) attention matrix in HBM (GPU DRAM). Instead, it computes attention in SRAM tiles. Memory goes from \(O(n^2)\) to \(O(n)\) while compute remains \(O(n^2 d)\) — but the constant factor improves because SRAM is ~100× faster than HBM:

Flash Attention Tiled ComputationIO Complexity
\[\text{HBM reads/writes: Standard Attn} = O(n^2 + nd)\] \[\text{HBM reads/writes: Flash Attn} = O(n^2d / M) \quad M = \text{SRAM size}\]
For M = 100KB (typical GPU SRAM) and d = 64: Flash Attention reduces HBM access by ~n²/M ≈ 10×. This translates to 2-4× wall-clock speedup and enables training on much longer sequences. Flash Attention is now the default in PyTorch, JAX, and all major deep learning frameworks.
10
Understanding

Why Transformers Work

// inductive biases · universal approximation · empirical evidence · interpretability
Five Reasons Transformers Win
  • Constant-length dependency path: Every pair of tokens is connected in exactly one attention step — O(1) path length regardless of sequence length. RNNs have O(n) path length for distant tokens. This directly enables learning long-range dependencies.
  • Full parallelism: All attention operations across all positions can be computed simultaneously on a GPU. This makes transformers \(100\times\) more practical to train than RNNs on modern hardware — more compute = more data = better models.
  • Content-based routing: Attention weights are computed dynamically from the content of the tokens — unlike convolutions which use fixed local filters. The model can learn to attend to the token "bank" in "river bank" differently from "bank" in "bank account" based on context.
  • Hierarchical feature learning through depth: Each layer builds on the previous. Layer 1 attention may learn syntactic patterns; layer 12 may learn semantic relationships; layer 24 may encode factual associations. Depth enables composition.
  • Scalability with compute and data: Transformers follow the scaling laws (Kaplan et al. 2020): validation loss decreases as a power law with compute, data, and parameters. No other architecture has demonstrated this property at scale. This enables GPT-4, PaLM, LLaMA-scale models.
Transformers as Graph Neural Networks
Unifying Framework

Self-attention is equivalent to a fully-connected Graph Neural Network where every token is a node and attention weights are edge weights. The attention mechanism is a differentiable message-passing algorithm. Masked attention restricts the graph. This perspective reveals that transformers are the most general sequence model — they can learn any permutation-equivariant function of the input tokens.

Universal Approximation for Sequences

A transformer with sufficiently many layers, heads, and sufficient dimension can approximate any continuous sequence-to-sequence mapping to arbitrary accuracy. This follows from the FFN's universal approximation (each token independently), combined with the attention mechanism's ability to aggregate global context. In practice, depth matters more than width — scaling laws confirm that \(L \propto N^{0.5}\) layers is optimal for a model with \(N\) parameters.

11
Architectures

BERT vs GPT — Encoder vs Decoder

// encoder-only · decoder-only · seq2seq · pre-training objectives

BERT — Encoder-Only

Architecture: Encoder stack only. Bidirectional — each token attends to all others (no causal mask).

Pre-training: Masked Language Modeling (MLM): mask 15% of tokens, predict them. Next Sentence Prediction (NSP).

Best for: Understanding tasks — classification, NER, question answering, retrieval. Not for generation.

Representations: Each token gets a contextualized embedding conditioned on the full bidirectional context. Richer representations per token.

GPT — Decoder-Only

Architecture: Decoder stack only with causal masking. Each token attends only to previous tokens.

Pre-training: Autoregressive Language Modeling: predict next token given all previous tokens. Simple, scalable, infinite data.

Best for: Generation tasks — text generation, translation, code generation, few-shot learning.

Representations: Each token conditioned on prefix only. Can generate arbitrarily long sequences autoregressively.

Model Family Architecture Pre-training Key Models
BERT, RoBERTa Encoder-only MLM (bidirectional) BERT, RoBERTa, DeBERTa, ELECTRA
GPT, LLaMA Decoder-only Causal LM (left-to-right) GPT-4, LLaMA 3, Mistral, PaLM
T5, BART Encoder-Decoder Text-to-text denoising T5, BART, mT5, Flan-T5
XLNet, ALBERT Encoder-only Permutation LM XLNet (overcomes BERT limitations)
12
Computer Vision

Vision Transformers (ViT)

// patch embeddings · class token · no convolutions · your XAI work

Dosovitskiy et al. (2020) showed that a pure transformer — with minimal modifications — applied directly to image patches achieves state-of-the-art on image classification when pre-trained at scale.

Vision Transformer — Input PreparationViT
\[\text{Image } \mathbf{I} \in \mathbb{R}^{H \times W \times C} \xrightarrow{\text{split}} N = \frac{HW}{P^2} \text{ patches of size } P\times P\] \[\mathbf{x}_p \in \mathbb{R}^{P^2 C} \xrightarrow{\mathbf{E}} \mathbf{z}_p = \mathbf{x}_p\mathbf{E} + \mathbf{e}_p^{\text{pos}} \quad \mathbf{E} \in \mathbb{R}^{P^2C \times D}\] \[\text{Input sequence: } [\mathbf{z}_{\text{cls}}; \mathbf{z}_1; \mathbf{z}_2; \ldots; \mathbf{z}_N]\]
P=16: each patch is 16×16 pixels. For a 224×224 image: N = 224²/16² = 196 patches. A learnable [CLS] token is prepended — its final representation is used for classification. Positional embeddings (learned or sinusoidal) allow the model to distinguish spatial locations. For ViT-B/16 (your chest X-ray model): D=768, 12 heads, 12 layers.
My ViT-B/16 Chest X-Ray Work

My chest X-ray classification used ViT-B/16 pretrained on ImageNet. Transfer learning: the 196 patch tokens attend to each other across the full 224×224 image — allowing the model to capture global patterns (like bilateral infiltrates) that local CNN filters would miss. GradCAM on the attention maps (Attention Rollout) gives XAI explanations: high attention to radiologically relevant regions validates the model's reasoning. ViT's global attention is why it outperforms VGG16 on detecting spatially distributed pathologies.

13
Mental Model

The Complete Mental Model

// everything unified · one diagram · one thread
TRANSFORMER — COMPLETE ARCHITECTURE MAP Attention(Q,K,V) = softmax(QKᵀ/√d_k)V QUERY Q = XW^Q what am I looking for? KEY K = XW^K what do I contain? VALUE V = XW^V what do I share? MULTI-HEAD ATTENTION (×h) Concat(head₁,...,headₕ)·W^O POSITIONAL ENC sin/cos or RoPE TRANSFORMER BLOCK MHA + LN + FFN + residuals BERT Encoder · MLM · NLU GPT / LLaMA Decoder · CLM · Gen T5 / BART Enc-Dec · Seq2Seq ViT / DINO Vision · Patches RNN O(n) path → Attention O(1) path · Sequential → Parallel · Fixed bottleneck → Direct token-to-token access
Fig 3. Complete transformer mental map — from Q/K/V through multi-head attention, positional encoding, transformer block, to encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5), and vision (ViT) architectures.

The unified story in one thread:

  1. RNNs fail at long-range dependencies because information must travel through \(O(n)\) sequential steps, gradients vanish exponentially, and all information must pass through a fixed-size bottleneck.
  2. Attention solves this by allowing direct token-to-token communication in one step: each token computes a weighted sum of all other tokens' values, weighted by relevance. Path length becomes \(O(1)\).
  3. Q, K, V come from learned linear projections of the same input: Queries encode "what to look for," Keys encode "what I contain," Values encode "what to contribute." Separating these three roles gives the model expressive flexibility.
  4. Scaling by \(1/\sqrt{d_k}\) is mathematically necessary: without it, dot products in high dimensions have large variance → softmax saturates → gradients vanish. Scaling restores \(O(1)\) variance.
  5. Multi-head attention runs \(h\) attention mechanisms in parallel, each specializing in a different type of relationship (syntactic, semantic, coreference, positional). Concatenate + project.
  6. Positional encoding injects order information into the permutation-invariant attention. Sinusoidal encodings enable relative position detection. RoPE (modern) encodes relative positions directly into query-key dot products.
  7. The transformer block wraps MHA with residual connections (gradient highway) and LayerNorm (stabilization), followed by a per-position FFN (nonlinear transformation of each token's representation).
  8. Architectures diverge: BERT (encoder, bidirectional, MLM) for understanding; GPT/LLaMA (decoder, causal, CLM) for generation; T5/BART (encoder-decoder) for sequence-to-sequence; ViT (patches as tokens) for vision.
  9. The \(O(n^2)\) bottleneck limits context length. Flash Attention reduces memory to \(O(n)\) with same compute. Sparse attention methods achieve sub-quadratic. State space models (Mamba) are linear alternatives.
Why Transformers Are the Foundation of Modern AI

The transformer's combination of (1) constant-path-length attention enabling true long-range dependencies, (2) full GPU parallelism enabling massive scale, (3) content-based dynamic routing enabling flexible context-dependent representations, and (4) empirical scaling laws enabling predictable improvement with compute has made it the universal foundation of AI — language, vision, protein structure, code, scientific discovery. Your GPT-based interactions, and every large language model rely on the same four equations: \(\mathbf{Q}=\mathbf{X}\mathbf{W}^Q\), \(\mathbf{K}=\mathbf{X}\mathbf{W}^K\), \(\mathbf{V}=\mathbf{X}\mathbf{W}^V\), \(\text{Attention}=\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\mathbf{V}\).