All posts
AI Fundamentals

What Is Attention in Transformers in LLMs? A Step-by-Step Engineering Breakdown

3Blue1Brown's visual explainer of attention, annotated by a production AI engineer. Query, key, value vectors, softmax, masking, multi-head attention, and the GPT-3 parameter math behind self-attention.

JBJames Bennett
31 minutes read

3Blue1Brown's 26-minute deep dive on attention is the clearest visual explanation of the mechanism that powers every modern large language model. This annotated breakdown walks through the math, adds Python code, GPU economics, and modern variants 3Blue1Brown skipped, and connects each step to what production retrieval engineers actually deal with — quadratic context cost, multi-head attention, and why GPT-3 spends 58 billion parameters on this one operation.

Attention is a learned operation that lets every token in a sequence directly read from every other token in parallel, using three projections — query, key, and value — to decide which other tokens are relevant and what information to copy from them. It first appeared in the 2017 paper Attention Is All You Need, which has since been cited over 173,000 times, and is the reason GPT-4, Claude 4, Gemini 3, LLaMA 3, Mistral, and DeepSeek all share the same underlying architecture nine years later.

Attention mechanism in transformers visualized as a circular network of glowing nodes where one central node receives weighted beams of light from surrounding tokens

Video Summary and Key Insights

3Blue1Brown — the channel created by Stanford-trained mathematician Grant Sanderson — spends 26 minutes walking through the attention mechanism with hand-built animations made in his Python library manim. The video has been viewed 4 million times since April 2024 and serves as the de facto visual companion to the Attention Is All You Need paper, which has crossed 173,000 academic citations and is the foundation of every model you actually use — GPT-4, Claude 4, Gemini 3, LLaMA 3, Mistral, DeepSeek. Sanderson uses a single running example, "a fluffy blue creature roamed the verdant forest," to show how query, key, and value matrices let adjectives update the meaning of nouns. The single most important takeaway: attention is just a learned weighted sum where the weights come from softmax-normalized dot products between learned projections of each token. That's it. Everything else — multi-head, masking, the output matrix — is engineering scaffolding around that core idea. If you've previously bounced off the original paper, this is the visualization that makes the math click.

  • A transformer's whole job is to update embeddings until the last vector knows enough to predict the next token. Sanderson opens by reminding viewers that prediction is computed entirely from the final vector in the sequence. Every attention block exists to make sure that final vector has absorbed everything relevant from the rest of the context.

If the model's going to accurately predict the next word, that final vector in the sequence, which began its life simply embedding the word was, will have to have been updated by all of the attention blocks to represent much, much more than any individual word.

Grant Sanderson
Grant Sanderson3Blue1Brown
  • Attention solves contextual ambiguity that lookup-table embeddings cannot. The same word "mole" gets the same initial embedding in "American true mole," "one mole of carbon dioxide," and "take a biopsy of the mole." Attention is the layer that pushes that generic vector toward the right meaning based on the surrounding tokens.
  • Query, key, and value are three learned matrices, not magic. Each input embedding is multiplied by three matrices to produce a query vector ("what am I looking for?"), a key vector ("here's what I have"), and a value vector ("here's what I'd add to your meaning if you picked me"). Relevance is the dot product of one token's query against every other token's key.

Conceptually, you want to think of the keys as potentially answering the queries. You think of the keys as matching the queries whenever they closely align with each other.

Grant Sanderson
Grant Sanderson3Blue1Brown
  • Softmax over each column converts dot products into a probability distribution. Raw dot products run from negative infinity to positive infinity. Softmax squeezes them into [0, 1] and forces each column to sum to 1, so the resulting attention pattern can be used as weights for a weighted sum.
  • Masking forces causal attention by setting future-token scores to negative infinity before softmax. During training, GPT-style models predict every next token in parallel. To stop later tokens from leaking into earlier predictions, the upper-triangular entries of the attention matrix are zeroed out — but in a way that keeps the columns normalized.
  • The attention pattern is O(n²) in the context size — that's the bottleneck the entire industry is fighting. Sanderson calls this out directly: doubling context length quadruples the memory cost. Every "long context" trick in 2026 — sliding window, FlashAttention, Mamba hybrids, MLA — exists because of this single quadratic.
  • GPT-3 spends about 58 billion of its 175 billion parameters on attention alone. Each block has 96 heads, each head has its own Q/K/V matrices, the model has 96 layers, and the math compounds. The majority of parameters actually live in the MLP blocks between attention layers — a fact that surprises most people who learn about transformers from the attention chapter first.
  • Multi-head attention runs the same operation in parallel with different learned matrices, then sums the results. Each head learns its own kind of relationship — one might track adjective-noun bindings, another coreference, another syntactic structure. Sanderson is explicit that we don't actually know what most heads do; the patterns emerge from gradient descent.

The overall idea is that by running many distinct heads in parallel, you're giving the model the capacity to learn many distinct ways that context changes meaning.

Grant Sanderson
Grant Sanderson3Blue1Brown

I've spent the last few years at WebSearchAPI.ai building the retrieval layer that feeds live web data into transformer-based LLMs. Most of what we ship has nothing to do with attention math directly — but every single byte we send to a model passes through self-attention on the other side, and every extra token costs us latency, GPU-seconds, and cache pressure. When I talk to junior engineers about why we obsess over context window utilization, the conversation always lands back on this video.

Sanderson's animations are the cleanest visual treatment of self-attention I've found. He's a former 3Blue1Brown viewer's dream: a Stanford math grad who built a custom animation library specifically to teach math intuitively, and his attention chapter has been viewed over 4 million times since April 2024. But it's a teaching video — it skips the GPU economics, the memory wall, and the Anthropic transformer circuits work that explains why the value-up-and-down factoring exists. I'll quote Sanderson for the conceptual scaffolding, then fill in what production AI engineers actually run into. If you've never watched the video, watch it first. Then come back here for the engineering layer on top.

Recap diagram showing tokens being mapped to high-dimensional embedding vectors at the start of a transformer pipeline

Why Are Embeddings Alone Not Enough?

Embeddings alone are a context-free lookup table — the same word gets the same vector regardless of what surrounds it. Attention exists to fix that, and it's the only mechanism in a transformer that lets tokens share information with each other.

The aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning.

Grant Sanderson
Grant Sanderson3Blue1Brown

The "mole" example Sanderson opens with is the cleanest demonstration of the problem. After tokenization and embedding, "American true mole," "one mole of carbon dioxide," and "take a biopsy of the mole" all start with the same vector for the word "mole." That vector points to a generic spot in the 12,288-dimensional embedding space GPT-3 uses — somewhere between "small mammal," "chemistry unit," and "skin lesion." It cannot be all three at once. Without a layer that reads the surrounding context, the rest of the network is stuck.

This is the practical reason embeddings on their own — the kind of static embeddings produced by Word2Vec or GloVe in 2013 — can't power modern language understanding. They were good enough for analogy benchmarks ("king - man + woman = queen") but they can't disambiguate. In production retrieval at WebSearchAPI.ai, the same problem shows up when we embed user queries with a static encoder: a query about "Apple revenue" and a query about "apple varieties" land near each other in vector space until you push them through a contextual encoder. Attention is what makes contextual encoding possible.

Where Did the Idea of Attention Actually Come From?

Attention as a learned operation predates transformers by three years. Bahdanau, Cho, and Bengio's 2014 paper on neural machine translation introduced the first additive attention mechanism for an encoder-decoder RNN, where a small feed-forward network computed alignment scores between the decoder's hidden state and each encoder hidden state. The 2017 transformer paper kept the core idea — learned weighting between positions — and replaced the feed-forward scoring with the dot-product formulation that runs faster on GPUs.

YearPaper / ArchitectureKey IdeaScore Function
2014Bahdanau et al. (1409.0473)Additive attention for RNN translationFeed-forward MLP over [hᵢ, sⱼ]
2015Luong et al. (1508.04025)Multiplicative attentionDot product, no scaling
2017Vaswani et al. (1706.03762)Self-attention in transformersScaled dot product Q·Kᵀ/√d_k
2022Dao et al. — FlashAttentionTiled attention computationSame math, different memory access
2024DeepSeek-V2 — Multi-Head Latent AttentionCompressed KV cacheLatent-space dot product

The biological framing IBM uses in their explainer maps roughly to this lineage: humans pay selective attention to salient parts of the visual or auditory field while filtering noise, and machine attention does something analogous over token sequences. The metaphor is useful for intuition but the implementation is plain linear algebra — there's no biological circuit being simulated, just a learned weighted sum that happened to work on translation in 2014 and never stopped working.

The reason the 2017 paper exploded into 173,000+ citations isn't that attention itself was new. It's that Vaswani's team showed you could build an entire model out of attention, drop recurrence completely, and still match (then exceed) state-of-the-art on translation while training in a fraction of the wall-clock time. The architecture was a compute argument disguised as an accuracy paper. The follow-on architectures — encoder-only BERT (2018), decoder-only GPT (2018), and every frontier LLM since — all picked the same skeleton.

What Does Attention Actually Do to a Vector?

Attention takes a context-free embedding and pushes it in a specific direction in embedding space, where that direction encodes the influence of surrounding tokens. For "Eiffel tower" preceded by "miniature," the vector for "tower" gets nudged away from "tall," and toward "Paris," "souvenir," and "small."

Motivating examples showing how the embedding for tower can be refined into different meanings based on preceding context like Eiffel and miniature

Sanderson walks through how attention disambiguates 'mole' across three different sentences.

Sanderson uses two examples to set this up: "mole" with three different meanings, and "tower" preceded by either "Eiffel" or "miniature Eiffel." Both make the same point — there are distinct directions in embedding space for distinct meanings of the same token, and attention is what calculates the offset that moves the generic embedding toward the right one.

The mental model that helped me most when debugging RAG pipelines is this: think of every token as a point in a 12,000-dimensional cloud. Attention is a learned function that reads the rest of the cloud and outputs a delta vector for each point. The delta is small for tokens whose meaning is already clear (like determiners) and large for tokens whose meaning depends heavily on context (like pronouns and ambiguous nouns). This is also why long-range coreference — where "it" refers to something 200 tokens earlier — works at all in transformers but failed catastrophically in RNNs.

How Does the Attention Pattern Get Computed?

The attention pattern is a square grid of dot products between every query and every key, scaled by the square root of the key dimension and softmax-normalized column by column. The result is a matrix where column j tells you how much each previous token contributes to the meaning of token j.

Query key and value vectors in attention illustrated as a library search where a person holding a glowing question card pulls relevant books off shelves

Attention pattern diagram showing the example sentence 'a fluffy blue creature roamed the verdant forest' with the attention grid forming between query and key vectors

To measure how well each key matches each query, you compute a dot product between each possible key query pair. I like to visualize a grid full of a bunch of dots where the bigger dots correspond to the larger dot products, the places where the keys and queries align.

Grant Sanderson
Grant Sanderson3Blue1Brown

Here's the formula from the original paper, decoded one piece at a time:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

TermWhat It IsShape (GPT-3)
QStack of all query vectors for the sequencen × 128
KStack of all key vectors for the sequencen × 128
VStack of all value vectors for the sequencen × 12288
QKᵀRaw attention scores (every query against every key)n × n
√d_kSquare root of the key dimension, used for numerical stabilityscalar = √128 ≈ 11.3
softmax(...)Column-wise normalization to a probability distributionn × n
· VWeighted sum of value vectors using the softmax weights as coefficientsn × 12288

The √d_k division is the part most explanations skip. Without it, when keys and queries get long, dot products get large, and softmax collapses to a one-hot distribution where one token gets all the attention weight and the rest get rounding errors. Dividing by √d_k keeps the softmax in a sane regime where it can actually express partial weight on multiple tokens. This isn't theoretically derived — it's an empirical stabilization trick from the 2017 paper that ended up surviving everything since.

The example sentence Sanderson uses, "a fluffy blue creature roamed the verdant forest," is doing a lot of work here. Adjectives ("fluffy," "blue," "verdant") query for nouns. Nouns advertise themselves as nouns through their keys. The dot products between matching pairs are large, and softmax turns those into the dominant weights for the column. After the weighted sum, the noun embedding has absorbed information from its modifiers — that's the whole behavior in one sentence.

What Does Scaled Dot-Product Attention Look Like in PyTorch?

The full math fits in eight lines of PyTorch. This is the canonical implementation that ships in roughly every educational notebook and matches what torch.nn.functional.scaled_dot_product_attention runs internally before kernel-level optimizations:

import math
import torch
import torch.nn.functional as F
 
def scaled_dot_product_attention(query, key, value, mask=None):
    # query, key, value: [batch, heads, seq_len, d_k]
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, value), weights

A few things worth pointing out for production engineers reading this:

  • The torch.matmul(query, key.transpose(-2, -1)) line is the n × n attention matrix being materialized in HBM. For a 64K-token sequence with float16 weights, that's 64K × 64K × 2 bytes ≈ 8.6 GB per head per layer — which is exactly what FlashAttention sidesteps by tiling the computation in SRAM and never writing the full matrix to HBM.
  • The masked_fill step is what enforces causal attention in GPT-style training. A common bug in homemade attention implementations is forgetting that the mask must be applied before softmax, not after. Apply it after and your "zeroed" entries get a non-zero weight from the softmax denominator.
  • The F.softmax(scores, dim=-1) operates along the last dimension because PyTorch attention matrices are usually shaped [batch, heads, seq, seq] where the last seq dimension is the keys. Running softmax along the wrong axis silently breaks the model — the gradients still flow, training just doesn't converge.
  • The function returns the weighted values and the attention weights themselves. In production we usually drop the weights to save memory, but during debugging they're the single most useful thing to inspect — they tell you which tokens the model thinks are relevant to which other tokens, which is how you diagnose attention dilution in long-context retrieval.

For real production code, you almost never write this loop by hand. PyTorch 2.0+ ships torch.nn.functional.scaled_dot_product_attention, which dispatches to FlashAttention, memory-efficient attention, or a fallback math kernel depending on hardware. JAX has jax.nn.dot_product_attention. xFormers and FlashAttention are the two reference implementations everyone benchmarks against. The eight-line version above is for understanding, not deploying.

Why Do Transformers Apply Masking?

Masking enforces causal attention so that earlier tokens can't see later tokens during training. It works by setting the upper-triangular entries of the attention scores to negative infinity before softmax, which makes them zero in the normalized matrix while keeping each column sum equal to 1.

Masking pattern showing the lower-triangular attention matrix with future-token cells blacked out

The simplest thing you might think to do is to set them equal to zero, but if you did that, the columns wouldn't add up to one anymore. They wouldn't be normalized. So instead, a common way to do this is that before applying softmax, you set all of those entries to be negative infinity.

Grant Sanderson
Grant Sanderson3Blue1Brown

The reason this matters during training is subtle. When GPT trains, it doesn't just predict the last token — it predicts every next token simultaneously, for every position in the sequence. That means a single 1024-token training example produces 1024 supervised prediction targets for the price of one forward pass. Masking is what makes this efficient: each position must predict its own next token without cheating by looking at future tokens. Remove the mask and the model just memorizes the input.

In production inference, the picture is slightly different. Once a model is deployed and generating text autoregressively, it generates one token at a time, so there are no future tokens to mask. But the mask is still applied for consistency with the training-time computation graph. KV-cache implementations exploit this by storing the keys and values for all previous tokens and only computing the new query against the cached keys — a 100x speedup on long-context generation that depends on the causal mask being in place.

For non-causal use cases like translation or speech recognition, you typically don't want a mask. That's where cross-attention comes in (Sanderson covers this around the 18-minute mark), and why encoder-decoder models like the original Transformer and T5 use bidirectional attention in the encoder.

How Does Context Size Become a Bottleneck?

The attention pattern is an n × n matrix where n is the context size, so memory grows quadratically. Doubling your context window from 32K to 64K tokens quadruples the attention memory, not just doubles it.

Quadratic attention context window bottleneck shown as a small sparse grid expanding into an overwhelming dense lattice that bursts beyond its frame

Context size diagram with attention pattern shown as a square grid whose dimensions scale with token count

Another fact that's worth reflecting on about this attention pattern is how its size is equal to the square of the context size. So this is why context size can be a really huge bottleneck for large language models, and scaling it up is nontrivial.

Grant Sanderson
Grant Sanderson3Blue1Brown

Sanderson notes this almost in passing, but it's the single most consequential engineering fact in modern LLM infrastructure. Every "long context" announcement you've seen since 2023 — Claude's 100K window, Gemini's 1M window, GPT-4 Turbo's 128K — sits on top of architectural workarounds for the n² wall. The attention math itself hasn't changed; the way we run it has. As an arXiv survey from 2025 put it bluntly: "the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling."

The standard tricks, in roughly the order they appeared:

TechniqueYearIdeaUsed In
FlashAttention2022Tile attention in SRAM, never materialize full n × n in HBMAll frontier LLMs since 2023
Sliding window attention2023Each token only attends to the last k tokens, not all nMistral 7B, Longformer
Multi-query attention (MQA)2019All heads share a single K and V projectionPaLM, early Falcon
Grouped-query attention (GQA)2023Heads share K/V in groups (compromise between MHA and MQA)LLaMA 2/3, Mixtral
Multi-head latent attention (MLA)2024Compress KV cache into a low-rank latent spaceDeepSeek-V2, V3
Multi-head low-rank attention (MLRA)2026Partitionable latent states for 4-way TP decodingICLR 2026
FlashAttention 42026Reaches 1,605 TFLOPs/s on NVIDIA BlackwellDefault for new training runs
State space models (SSMs)2023Replace attention with a linear recurrence that scales O(n)Mamba, RWKV, hybrids

The pace of these optimizations is itself a signal. Songtao Liu's 2026 ICLR paper on Multi-Head Low-Rank Attention reports a 2.8× decoding speedup over MLA while matching its perplexity — meaning the gains from squeezing the attention KV cache haven't plateaued yet. As Sebastian Raschka summarized in his visual attention variants overview, the field has moved "from MHA and GQA to MLA, sparse attention, and hybrid architectures" in roughly 18 months. Pick your favorite frontier model and odds are its attention block isn't the same as the 2017 paper anymore.

In our retrieval engine at WebSearchAPI.ai, the practical version of this fight is choosing how much context to send. For a typical RAG query, we could send 500 tokens of carefully ranked context or 50,000 tokens of raw search results. The 50K version costs roughly 100x more in attention compute and is empirically worse on answer quality — a phenomenon researchers call "lost in the middle," where models ignore information buried deep in long contexts. The quadratic isn't just a billing problem; it's a quality problem too.

What Are Value Vectors and How Do They Update Embeddings?

Value vectors are the third learned projection of each token, and they hold the actual information that gets added to other tokens' embeddings. Once you've computed the attention pattern from queries and keys, you compute a weighted sum of value vectors using the attention weights as coefficients, and add that sum back to the original embedding.

Values diagram showing the value matrix multiplication producing value vectors that get summed with attention weights and added to the original embedding

This value vector lives in the same very high dimensional space as the embeddings. When you multiply this value matrix by the embedding of a word, you might think of it as saying, if this word is relevant to adjusting the meaning of something else, what exactly should be added to the embedding of that something else?

Grant Sanderson
Grant Sanderson3Blue1Brown

Here's the part of attention that throws people off: the query and key vectors are only used to compute weights. They never directly influence what gets added to the output embedding. The thing that actually flows through the network — the content that gets injected from one token into another — is the value vector.

This separation of concerns is more elegant than it first looks. Queries and keys decide who talks to whom. Values decide what gets said. Anthropic's transformer circuits work showed that this factorization is closer to "memory addressing + memory contents" than anyone realized — the QK circuit determines which heads route information between which positions, and the OV (output-value) circuit determines what transformation gets applied when that routing happens. Sanderson hints at this when he calls the value map "a low-rank transformation," and it's the conceptual unlock that made me start treating attention like a learned content-addressable memory rather than an opaque blob of matmuls.

How Many Parameters Does a Single Attention Head Use?

A single GPT-3 attention head uses about 6.3 million parameters split across four matrices: query (W_Q), key (W_K), value-down (W_V↓), and value-up (W_V↑). The Q and K matrices each have ~1.5M parameters; the value map is factored into two smaller matrices to match.

Counting parameters diagram showing the four matrices Q, K, value-down, and value-up with their dimensions and parameter counts for GPT-3

The way the value map is factored is as a product of two smaller matrices. To throw in linear algebra jargon here, what we're basically doing is constraining the overall value map to be a low rank transformation.

Grant Sanderson
Grant Sanderson3Blue1Brown

The value-up / value-down factoring is one of those design choices that looks arbitrary until you do the parameter math:

MatrixNaive ShapeNaive ParamsFactored ShapeFactored Params
W_Q12288 × 1281,572,864samesame
W_K12288 × 1281,572,864samesame
W_V (single matrix)12288 × 12288150,994,944
W_V↓ + W_V↑ (factored)12288 × 128 + 128 × 122883,145,728

The factored value map saves about 148 million parameters per head. With 96 heads × 96 layers in GPT-3, that saving multiplies into something on the order of 1.4 trillion parameters that could have been spent on a square value matrix but weren't. This is one of the reasons GPT-3 fits into 175B parameters instead of 1.5T — a single design choice in the attention block.

The other consequence of low-rank value maps is interpretability. When the value transformation is forced to be rank-128 in a 12288-dimensional space, what each head can do is constrained. Anthropic's circuits work exploits this directly: a low-rank OV circuit is much easier to analyze than a full-rank one, and most of the mechanistic interpretability results published since 2021 depend on it.

What Is Cross-Attention and Where Does It Show Up?

Cross-attention is the same operation as self-attention except the keys and values come from a different sequence than the queries. It's used wherever a model needs to relate two distinct streams of data — translation, image captioning, speech transcription, retrieval-augmented generation.

Cross-attention diagram showing keys and queries coming from two different language sequences during translation

A cross attention head looks almost identical. The only difference is that the key and query maps act on different datasets. In a model doing translation, the keys might come from one language while the queries come from another.

Grant Sanderson
Grant Sanderson3Blue1Brown

Cross-attention is what lets the decoder of a translation model look up the original English sentence while it's generating French. It's also what powers Whisper's speech-to-text — the decoder generates text tokens whose queries attend to keys derived from audio features. And it's the conceptual basis for retrieval-augmented generation: at WebSearchAPI.ai, when we hand a model search results plus a user question, the model uses something close to cross-attention internally to figure out which retrieved passages are relevant to which parts of the question.

A small caveat Sanderson mentions: cross-attention typically has no causal mask, because there's no temporal ordering between the two sequences. In a translation model, the entire source sentence is available before generation begins, so masking would just throw away signal. This is why decoder-only models (GPT, Claude, LLaMA) only use causal self-attention while encoder-decoder models (T5, BART, Whisper) use bidirectional cross-attention from decoder to encoder.

Why Do Transformers Use Multiple Attention Heads in Parallel?

Multiple heads exist so the model can learn many different types of contextual relationships simultaneously — adjective-noun bindings, coreference, syntactic structure, semantic role, and patterns no human would label. Each head has its own Q, K, and V matrices, and the outputs of all heads are summed together at the end of the block.

Multi-head attention illustrated as a grid of nine parallel rooms each containing a different specialist instrument pointing toward the same central object

Multi-head attention diagram showing 96 parallel attention heads with their own Q, K, V matrices feeding into a single output

Multi-head attention runs many parallel attention operations, each with its own learned matrices.

GPT three, for example, uses 96 attention heads inside each block. Considering that each one is already a bit confusing, it's certainly a lot to hold in your head.

Grant Sanderson
Grant Sanderson3Blue1Brown

The multi-head structure is the reason transformers learn so much from raw text. Different heads end up specializing — interpretability researchers at Anthropic and elsewhere have identified "induction heads" that copy patterns from earlier in the sequence, "previous token heads" that look at position n-1, and "name mover" heads that surface entity references. None of these specializations are programmed in. They emerge during training because the gradient finds them useful.

The parameter math for the full multi-head block in GPT-3 lands at about 600 million parameters per layer:

  • 4 matrices × 6.3M params per head × 96 heads ≈ 600M parameters per attention block
  • 600M × 96 layers ≈ 58 billion parameters in attention across the whole model
  • 175B total params - 58B attention ≈ 117B in MLPs and embeddings

The "MLP is bigger than attention" surprise is real. Most of GPT-3's parameters live in the feed-forward blocks between attention layers, not in attention itself. But attention is what lets information flow between positions, and the MLPs only get to do anything useful because attention has already mixed the right tokens together.

How Does the Output Matrix Tie Multi-Head Attention Together?

The output matrix is what most papers call the "value-up" projection across all heads, stapled into a single large matrix. It's a notational convenience that lives at the multi-head level rather than the per-head level, and it can confuse readers expecting a separate W_O and W_V↑.

Output matrix diagram showing the value-up projections of all heads combined into a single output matrix at the multi-head level

All of these value up matrices for each head appear stapled together in one giant matrix that we call the output matrix associated with the entire multi headed attention block.

Grant Sanderson
Grant Sanderson3Blue1Brown

This is the part of the attention chapter where I usually have to slow down with a junior engineer. If you read the original paper, you'll see one V matrix per head and one big W_O at the multi-head level. If you read a teaching explanation like this video, you'll see Q, K, V↓, and V↑ per head with no separate output matrix. They describe the same computation. The difference is whether you bundle the V↑ projections of all heads into one matrix or keep them per-head.

The practical implication: when you read a research paper that talks about "the output projection" or "W_O," it's the thing Sanderson called the value-up matrix. When you read a paper about "the value matrix" or "V," it's the value-down projection only. Mismatched naming has tripped up more grad students than I can count, and Sanderson is unusually upfront about it. If you're implementing attention from scratch, pick one convention and document it.

How Does Attention Compose Across Many Layers?

Attention is applied repeatedly through dozens of stacked transformer blocks, with each block adding more contextual richness on top of the embeddings produced by the previous one. By layer 96, the embeddings encode high-level features like sentiment, topic, and reasoning structure — not just individual word meanings.

Going deeper diagram showing attention blocks stacked many times in series with embeddings flowing through each layer

The further down the network you go, with each embedding taking in more and more meaning from all the other embeddings, which themselves are getting more and more nuanced, the hope is that there's the capacity to encode higher level and more abstract ideas.

Grant Sanderson
Grant Sanderson3Blue1Brown

The layered structure is what makes transformers more than the sum of their parts. A single attention block can only mix information once. Stack 96 of them and you get 96 rounds of mixing, each operating on representations that already encode some context. Mechanistic interpretability papers have shown that early layers in GPT-style models tend to handle syntax and surface features, middle layers handle semantic and entity-level reasoning, and late layers handle output formatting and next-token prediction. Nobody designed this division of labor — it falls out of training.

In production retrieval, this layered behavior is why simple "ablate one layer" experiments rarely work. You can't just swap out the attention mechanism in layer 47 of a 96-layer model and expect the rest of the stack to keep functioning, because the representations in layer 48 were trained against whatever layer 47 produced. This tight coupling is one reason the industry hasn't fully migrated away from vanilla attention even though faster alternatives exist — every replacement has to be retrained from scratch, and the cost of training a frontier model in 2026 is somewhere between $100M and $1B depending on how you count.

How Does Attention Generalize Beyond Text?

The same scaled dot-product attention you read in the 2017 paper is what powers Vision Transformers, Whisper, AlphaFold 2, video models, and every multimodal LLM shipping today. The token sequence changes — image patches, audio frames, amino acid residues, video frames — but the operation is identical.

Modality"Tokens" AreReference Architecture
TextSubword units (BPE)GPT-4, Claude 4, Gemini 3
Image16×16 pixel patches flattenedVision Transformer (ViT), DINOv2
Audio25ms mel-spectrogram windowsWhisper, AudioLM
VideoSpatiotemporal patchesSora, VideoPoet
ProteinAmino acid residuesAlphaFold 2, ESM-3
MultimodalMixed text + image + audio tokensGPT-4o, Gemini 3, Claude 4 Vision

Vision Transformers were the first big proof that the architecture wasn't text-specific. The 2020 ViT paper from Google showed that splitting images into 16×16 patches and feeding them through a standard transformer matched or beat ConvNets on ImageNet — with no convolutional bias built in. The model learned spatial structure entirely from data. Anyone in computer vision in 2019 would have told you that wasn't supposed to work. By 2022, ViT-style backbones were the default for most vision tasks.

The reason this generalizes is the same reason attention won over RNNs. Attention treats its input as a set with positional metadata, not a sequence with a hardcoded order. If you can tokenize a modality — turn it into a finite list of vectors — you can feed it through self-attention and let the model learn whatever positional relationships matter. This is also why multimodal models like GPT-4o and Claude 4 work without separate vision and language stacks: the same attention block reads text tokens and image patches in the same context window, computes Q·Kᵀ across all of them, and lets the cross-modal weights emerge during training. The 2017 paper called itself "Attention Is All You Need," and a decade in, that's mostly held up.

Why Is Parallelism the Real Reason Attention Won?

Attention won because it's parallelizable on GPUs in a way RNNs never were. The whole attention pattern computes as one matrix multiplication across all token pairs simultaneously, while RNNs literally cannot start step t until step t-1 finishes.

Ending screen with a summary of the attention mechanism and references to additional resources by Karpathy and Olah

A big part of the story for the success of the attention mechanism is not so much any specific kind of behavior that it enables, but the fact that it's extremely parallelizable.

Grant Sanderson
Grant Sanderson3Blue1Brown

This is the closing point Sanderson lands on, and it's the one I'd put in bold for any engineer learning transformers in 2026. Attention isn't deeper or smarter than what came before. It's flatter — the data dependencies are wider but shallower, which maps cleanly onto thousands of GPU cores running in parallel. The 2017 paper essentially asked: "what if we let the chip do all its multiplications at once instead of one at a time?" Everything else followed from that.

The downstream consequence is what Rich Sutton called the bitter lesson — the methods that scale with compute end up winning, regardless of whether they look elegant. Attention scales with compute. RNNs don't. Even if a sufficiently clever RNN variant could in principle solve the same problems, the architecture that lets you point a billion-dollar GPU cluster at the problem and amortize linearly in compute is the one that ships.

The hardware-software co-evolution since 2017 has been remarkable to watch from inside the industry. NVIDIA's H100 and Blackwell GPUs ship with hardware-level optimizations specifically for attention math — tensor cores, transformer engines, and memory hierarchy decisions that would have been overkill for any pre-2017 architecture. FlashAttention 4 hitting 1,605 TFLOPs/s on Blackwell isn't just a software win; it's the result of GPU vendors and model architects converging on the same operation as the central computational unit of AI. As one Forbes column put it in early 2026, "too many people have a sort of distorted view of how attention mechanisms work in analyzing text" — the gap between how attention is described in tutorials and how it actually runs on production hardware has only widened.

Sanderson recommends Karpathy and Chris Olah's transformer circuits work at the end of the video for more depth, and I'd second both. Karpathy's Build a GPT from scratch is the natural next watch. If you want the broader transformer block (MLPs, residuals, encoder-decoder) instead of just attention, our sister post on the full architecture walks through ByteByteGo's 10-minute explainer with the same engineer-annotation approach. And if you're wondering how attention patterns end up misaligned during training, our piece on AI reward hacking covers what happens when gradient descent finds the wrong attention shortcuts.

Frequently Asked Questions

What is attention in transformers in simple terms?

Attention is a layer in a transformer that lets every token in a sequence read information from every other token at once, weighted by how relevant each pair is. Sanderson describes it as a learned mechanism that takes context-free embeddings and pushes them toward more contextually rich meanings — the mechanism that lets "mole" mean "small mammal" in one sentence and "skin lesion" in another.

What are query, key, and value vectors in attention?

Query, key, and value are three learned projections of every input token. The query represents what the token is "looking for" in other tokens, the key represents what the token "has to offer," and the value carries the actual content that gets added to other embeddings. The attention pattern comes from dot products between queries and keys; the output is a weighted sum of values using those dot products as weights.

Why does the attention formula divide by the square root of d_k?

The √d_k division keeps the dot products in a numerically stable range before applying softmax. Without it, when keys and queries have many dimensions (128 in GPT-3), the dot products can grow large and softmax collapses to a one-hot distribution, making gradients vanish. This is an empirical stabilization trick from the original 2017 paper that has survived essentially unchanged.

What is the difference between self-attention and cross-attention?

Self-attention computes queries, keys, and values from the same input sequence. Cross-attention pulls queries from one sequence and keys plus values from a different sequence — for example, a French decoder querying an English encoder during translation. Sanderson notes that GPT-style models use only self-attention with masking, while encoder-decoder models like T5 and Whisper use both self-attention and cross-attention.

How many parameters does GPT-3 spend on attention?

GPT-3 spends roughly 58 billion of its 175 billion total parameters on attention — about a third of the model. Each attention block contains 96 heads with about 6.3 million parameters each, and the model has 96 layers, so the math compounds quickly. The remaining ~117 billion parameters live in the MLP feed-forward blocks between attention layers.

Why is context size such a hard problem for LLMs?

The attention pattern is a square matrix whose dimensions equal the context length, so memory grows quadratically with context size. Doubling context from 32K to 64K tokens quadruples the attention memory cost, not just doubles it. This O(n²) wall is why every long-context technique — FlashAttention, sliding window attention, Mamba, multi-head latent attention — exists.

What is masking in attention and why is it used?

Masking sets the upper-triangular entries of the attention scores to negative infinity before softmax, which makes them zero in the normalized matrix while keeping each column summing to 1. This prevents earlier tokens from seeing later tokens during training, which is essential for autoregressive next-token prediction in models like GPT. Without masking, the model would trivially "cheat" by reading the answer.

Are transformers and attention the same thing?

No. Attention is one component inside a transformer. A full transformer block also contains a feed-forward MLP, residual connections, layer normalization, and (depending on the variant) positional encoding. The 2017 paper "Attention Is All You Need" was named for the fact that its architecture relied on attention instead of recurrence — but attention alone isn't a transformer.

What is the difference between additive attention and scaled dot-product attention?

Additive attention (Bahdanau et al., 2014) computes alignment scores using a small feed-forward network applied to concatenated query and key vectors. Scaled dot-product attention (Vaswani et al., 2017) computes the score as Q·Kᵀ divided by √d_k. Both produce a learned weighting between positions, but scaled dot-product runs as a single matrix multiplication on a GPU while additive attention requires a per-pair MLP forward pass — which is the entire reason transformers replaced RNN-style attention.

What is FlashAttention and why does it matter?

FlashAttention is a tiled implementation of scaled dot-product attention that never materializes the full n × n attention matrix in GPU high-bandwidth memory. By computing attention in SRAM-sized chunks, it cuts memory usage from O(n²) to O(n) and runs significantly faster on long sequences. FlashAttention 4 reaches 1,605 TFLOPs/s on NVIDIA Blackwell GPUs as of March 2026 — an order of magnitude over the 2022 baseline, and the reason 100K+ context windows are economically viable.

Key Takeaways

  • Attention is a learned operation that computes relevance scores between every pair of tokens, then uses those scores to mix their value vectors into context-aware embeddings.
  • The idea predates transformers — Bahdanau, Cho, and Bengio introduced additive attention for RNN translation in 2014. The 2017 paper kept the concept and replaced the score function with a GPU-friendly dot product.
  • "Attention Is All You Need" has been cited 173,000+ times and the architecture now runs every frontier model — GPT-4, Claude, Gemini, LLaMA 3, Mistral, DeepSeek.
  • Query, key, and value are three separate learned projections — Q and K decide attention weights, V carries the content that gets added to the output.
  • Scaled dot-product attention fits in eight lines of PyTorch but the production version (torch.nn.functional.scaled_dot_product_attention) dispatches to FlashAttention or xFormers depending on hardware.
  • The attention pattern is O(n²) in context length, which is the single most important engineering constraint in modern LLM infrastructure. FlashAttention 4 hits 1,605 TFLOPs/s on Blackwell; MLRA delivers 2.8× decoding speedup over MLA.
  • Masking enforces causal attention by zeroing out future-token scores before softmax, making efficient parallel training possible.
  • Multi-head attention runs many parallel attention operations with different learned matrices, and heads specialize in coreference, syntax, induction, and patterns no human would name. GPT-3 uses 96 heads × 96 layers and spends ~58B of 175B parameters on attention.
  • Modern variants (MQA, GQA, MLA, MLRA) compress the KV cache to ease the long-context memory wall while keeping the same Q·Kᵀ math.
  • The same operation generalizes beyond text — Vision Transformers, Whisper, AlphaFold 2, and multimodal LLMs all use scaled dot-product attention over different token types.
  • Attention won over RNNs primarily because it's parallelizable across GPU cores, not because it's mathematically deeper — and that parallelism is why transformer scaling laws work.

This post is based on Attention in transformers, step-by-step | Deep Learning Chapter 6 by 3Blue1Brown. The video is part of Grant Sanderson's Neural Networks series and was published April 7, 2024.