AI FundamentalsPublished 9 May 2026

What Is Attention in Transformers in LLMs? A Step-by-Step Engineering Breakdown

3Blue1Brown's visual explainer of attention, annotated by a production AI engineer. Query, key, value vectors, softmax, masking, multi-head attention, and the GPT-3 parameter math behind self-attention.

JBJames Bennett

31 minutes read

3Blue1Brown's 26-minute deep dive on attention is the clearest visual explanation of the mechanism that powers every modern large language model. This annotated breakdown walks through the math, adds Python code, GPU economics, and modern variants 3Blue1Brown skipped, and connects each step to what production retrieval engineers actually deal with — quadratic context cost, multi-head attention, and why GPT-3 spends 58 billion parameters on this one operation.

Attention is a learned operation that lets every token in a sequence directly read from every other token in parallel, using three projections — query, key, and value — to decide which other tokens are relevant and what information to copy from them. It first appeared in the 2017 paper Attention Is All You Need, which has since been cited over 173,000 times, and is the reason GPT-4, Claude 4, Gemini 3, LLaMA 3, Mistral, and DeepSeek all share the same underlying architecture nine years later.

Attention mechanism in transformers visualized as a circular network of glowing nodes where one central node receives weighted beams of light from surrounding tokens

Video Summary and Key Insights

3Blue1Brown — the channel created by Stanford-trained mathematician Grant Sanderson — spends 26 minutes walking through the attention mechanism with hand-built animations made in his Python library manim. The video has been viewed 4 million times since April 2024 and serves as the de facto visual companion to the Attention Is All You Need paper, which has crossed 173,000 academic citations and is the foundation of every model you actually use — GPT-4, Claude 4, Gemini 3, LLaMA 3, Mistral, DeepSeek. Sanderson uses a single running example, "a fluffy blue creature roamed the verdant forest," to show how query, key, and value matrices let adjectives update the meaning of nouns. The single most important takeaway: attention is just a learned weighted sum where the weights come from softmax-normalized dot products between learned projections of each token. That's it. Everything else — multi-head, masking, the output matrix — is engineering scaffolding around that core idea. If you've previously bounced off the original paper, this is the visualization that makes the math click.

A transformer's whole job is to update embeddings until the last vector knows enough to predict the next token. Sanderson opens by reminding viewers that prediction is computed entirely from the final vector in the sequence. Every attention block exists to make sure that final vector has absorbed everything relevant from the rest of the context.

Attention solves contextual ambiguity that lookup-table embeddings cannot. The same word "mole" gets the same initial embedding in "American true mole," "one mole of carbon dioxide," and "take a biopsy of the mole." Attention is the layer that pushes that generic vector toward the right meaning based on the surrounding tokens.
Query, key, and value are three learned matrices, not magic. Each input embedding is multiplied by three matrices to produce a query vector ("what am I looking for?"), a key vector ("here's what I have"), and a value vector ("here's what I'd add to your meaning if you picked me"). Relevance is the dot product of one token's query against every other token's key.

Softmax over each column converts dot products into a probability distribution. Raw dot products run from negative infinity to positive infinity. Softmax squeezes them into [0, 1] and forces each column to sum to 1, so the resulting attention pattern can be used as weights for a weighted sum.
Masking forces causal attention by setting future-token scores to negative infinity before softmax. During training, GPT-style models predict every next token in parallel. To stop later tokens from leaking into earlier predictions, the upper-triangular entries of the attention matrix are zeroed out — but in a way that keeps the columns normalized.
The attention pattern is O(n²) in the context size — that's the bottleneck the entire industry is fighting. Sanderson calls this out directly: doubling context length quadruples the memory cost. Every "long context" trick in 2026 — sliding window, FlashAttention, Mamba hybrids, MLA — exists because of this single quadratic.
GPT-3 spends about 58 billion of its 175 billion parameters on attention alone. Each block has 96 heads, each head has its own Q/K/V matrices, the model has 96 layers, and the math compounds. The majority of parameters actually live in the MLP blocks between attention layers — a fact that surprises most people who learn about transformers from the attention chapter first.
Multi-head attention runs the same operation in parallel with different learned matrices, then sums the results. Each head learns its own kind of relationship — one might track adjective-noun bindings, another coreference, another syntactic structure. Sanderson is explicit that we don't actually know what most heads do; the patterns emerge from gradient descent.

I've spent the last few years at WebSearchAPI.ai building the retrieval layer that feeds live web data into transformer-based LLMs. Most of what we ship has nothing to do with attention math directly — but every single byte we send to a model passes through self-attention on the other side, and every extra token costs us latency, GPU-seconds, and cache pressure. When I talk to junior engineers about why we obsess over context window utilization, the conversation always lands back on this video.

Sanderson's animations are the cleanest visual treatment of self-attention I've found. He's a former 3Blue1Brown viewer's dream: a Stanford math grad who built a custom animation library specifically to teach math intuitively, and his attention chapter has been viewed over 4 million times since April 2024. But it's a teaching video — it skips the GPU economics, the memory wall, and the Anthropic transformer circuits work that explains why the value-up-and-down factoring exists. I'll quote Sanderson for the conceptual scaffolding, then fill in what production AI engineers actually run into. If you've never watched the video, watch it first. Then come back here for the engineering layer on top.

Recap diagram showing tokens being mapped to high-dimensional embedding vectors at the start of a transformer pipeline

Why Are Embeddings Alone Not Enough?

Embeddings alone are a context-free lookup table — the same word gets the same vector regardless of what surrounds it. Attention exists to fix that, and it's the only mechanism in a transformer that lets tokens share information with each other.

The "mole" example Sanderson opens with is the cleanest demonstration of the problem. After tokenization and embedding, "American true mole," "one mole of carbon dioxide," and "take a biopsy of the mole" all start with the same vector for the word "mole." That vector points to a generic spot in the 12,288-dimensional embedding space GPT-3 uses — somewhere between "small mammal," "chemistry unit," and "skin lesion." It cannot be all three at once. Without a layer that reads the surrounding context, the rest of the network is stuck.

This is the practical reason embeddings on their own — the kind of static embeddings produced by Word2Vec or GloVe in 2013 — can't power modern language understanding. They were good enough for analogy benchmarks ("king - man + woman = queen") but they can't disambiguate. In production retrieval at WebSearchAPI.ai, the same problem shows up when we embed user queries with a static encoder: a query about "Apple revenue" and a query about "apple varieties" land near each other in vector space until you push them through a contextual encoder. Attention is what makes contextual encoding possible.

Where Did the Idea of Attention Actually Come From?

Attention as a learned operation predates transformers by three years. Bahdanau, Cho, and Bengio's 2014 paper on neural machine translation introduced the first additive attention mechanism for an encoder-decoder RNN, where a small feed-forward network computed alignment scores between the decoder's hidden state and each encoder hidden state. The 2017 transformer paper kept the core idea — learned weighting between positions — and replaced the feed-forward scoring with the dot-product formulation that runs faster on GPUs.

Year	Paper / Architecture	Key Idea	Score Function
2014	Bahdanau et al. (1409.0473)	Additive attention for RNN translation	Feed-forward MLP over [hᵢ, sⱼ]
2015	Luong et al. (1508.04025)	Multiplicative attention	Dot product, no scaling
2017	Vaswani et al. (1706.03762)	Self-attention in transformers	Scaled dot product Q·Kᵀ/√d_k
2022	Dao et al. — FlashAttention	Tiled attention computation	Same math, different memory access
2024	DeepSeek-V2 — Multi-Head Latent Attention	Compressed KV cache	Latent-space dot product

The biological framing IBM uses in their explainer maps roughly to this lineage: humans pay selective attention to salient parts of the visual or auditory field while filtering noise, and machine attention does something analogous over token sequences. The metaphor is useful for intuition but the implementation is plain linear algebra — there's no biological circuit being simulated, just a learned weighted sum that happened to work on translation in 2014 and never stopped working.

The reason the 2017 paper exploded into 173,000+ citations isn't that attention itself was new. It's that Vaswani's team showed you could build an entire model out of attention, drop recurrence completely, and still match (then exceed) state-of-the-art on translation while training in a fraction of the wall-clock time. The architecture was a compute argument disguised as an accuracy paper. The follow-on architectures — encoder-only BERT (2018), decoder-only GPT (2018), and every frontier LLM since — all picked the same skeleton.

What Does Attention Actually Do to a Vector?

Attention takes a context-free embedding and pushes it in a specific direction in embedding space, where that direction encodes the influence of surrounding tokens. For "Eiffel tower" preceded by "miniature," the vector for "tower" gets nudged away from "tall," and toward "Paris," "souvenir," and "small."

Motivating examples showing how the embedding for tower can be refined into different meanings based on preceding context like Eiffel and miniature

Sanderson walks through how attention disambiguates 'mole' across three different sentences.

Sanderson uses two examples to set this up: "mole" with three different meanings, and "tower" preceded by either "Eiffel" or "miniature Eiffel." Both make the same point — there are distinct directions in embedding space for distinct meanings of the same token, and attention is what calculates the offset that moves the generic embedding toward the right one.

The mental model that helped me most when debugging RAG pipelines is this: think of every token as a point in a 12,000-dimensional cloud. Attention is a learned function that reads the rest of the cloud and outputs a delta vector for each point. The delta is small for tokens whose meaning is already clear (like determiners) and large for tokens whose meaning depends heavily on context (like pronouns and ambiguous nouns). This is also why long-range coreference — where "it" refers to something 200 tokens earlier — works at all in transformers but failed catastrophically in RNNs.

How Does the Attention Pattern Get Computed?

The attention pattern is a square grid of dot products between every query and every key, scaled by the square root of the key dimension and softmax-normalized column by column. The result is a matrix where column j tells you how much each previous token contributes to the meaning of token j.

Attention pattern diagram showing the example sentence 'a fluffy blue creature roamed the verdant forest' with the attention grid forming between query and key vectors

Here's the formula from the original paper, decoded one piece at a time:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Term	What It Is	Shape (GPT-3)
Q	Stack of all query vectors for the sequence	n × 128
K	Stack of all key vectors for the sequence	n × 128
V	Stack of all value vectors for the sequence	n × 12288
QKᵀ	Raw attention scores (every query against every key)	n × n
√d_k	Square root of the key dimension, used for numerical stability	scalar = √128 ≈ 11.3
softmax(...)	Column-wise normalization to a probability distribution	n × n
· V	Weighted sum of value vectors using the softmax weights as coefficients	n × 12288

The √d_k division is the part most explanations skip. Without it, when keys and queries get long, dot products get large, and softmax collapses to a one-hot distribution where one token gets all the attention weight and the rest get rounding errors. Dividing by √d_k keeps the softmax in a sane regime where it can actually express partial weight on multiple tokens. This isn't theoretically derived — it's an empirical stabilization trick from the 2017 paper that ended up surviving everything since.

The example sentence Sanderson uses, "a fluffy blue creature roamed the verdant forest," is doing a lot of work here. Adjectives ("fluffy," "blue," "verdant") query for nouns. Nouns advertise themselves as nouns through their keys. The dot products between matching pairs are large, and softmax turns those into the dominant weights for the column. After the weighted sum, the noun embedding has absorbed information from its modifiers — that's the whole behavior in one sentence.

What Does Scaled Dot-Product Attention Look Like in PyTorch?

The full math fits in eight lines of PyTorch. This is the canonical implementation that ships in roughly every educational notebook and matches what torch.nn.functional.scaled_dot_product_attention runs internally before kernel-level optimizations:

import math
import torch
import torch.nn.functional as F
 
def scaled_dot_product_attention(query, key, value, mask=None):
    # query, key, value: [batch, heads, seq_len, d_k]
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, value), weights

A few things worth pointing out for production engineers reading this:

The torch.matmul(query, key.transpose(-2, -1)) line is the n × n attention matrix being materialized in HBM. For a 64K-token sequence with float16 weights, that's 64K × 64K × 2 bytes ≈ 8.6 GB per head per layer — which is exactly what FlashAttention sidesteps by tiling the computation in SRAM and never writing the full matrix to HBM.
The masked_fill step is what enforces causal attention in GPT-style training. A common bug in homemade attention implementations is forgetting that the mask must be applied before softmax, not after. Apply it after and your "zeroed" entries get a non-zero weight from the softmax denominator.
The F.softmax(scores, dim=-1) operates along the last dimension because PyTorch attention matrices are usually shaped [batch, heads, seq, seq] where the last seq dimension is the keys. Running softmax along the wrong axis silently breaks the model — the gradients still flow, training just doesn't converge.
The function returns the weighted values and the attention weights themselves. In production we usually drop the weights to save memory, but during debugging they're the single most useful thing to inspect — they tell you which tokens the model thinks are relevant to which other tokens, which is how you diagnose attention dilution in long-context retrieval.

For real production code, you almost never write this loop by hand. PyTorch 2.0+ ships torch.nn.functional.scaled_dot_product_attention, which dispatches to FlashAttention, memory-efficient attention, or a fallback math kernel depending on hardware. JAX has jax.nn.dot_product_attention. xFormers and FlashAttention are the two reference implementations everyone benchmarks against. The eight-line version above is for understanding, not deploying.

Why Do Transformers Apply Masking?

Masking enforces causal attention so that earlier tokens can't see later tokens during training. It works by setting the upper-triangular entries of the attention scores to negative infinity before softmax, which makes them zero in the normalized matrix while keeping each column sum equal to 1.

Masking pattern showing the lower-triangular attention matrix with future-token cells blacked out

The reason this matters during training is subtle. When GPT trains, it doesn't just predict the last token — it predicts every next token simultaneously, for every position in the sequence. That means a single 1024-token training example produces 1024 supervised prediction targets for the price of one forward pass. Masking is what makes this efficient: each position must predict its own next token without cheating by looking at future tokens. Remove the mask and the model just memorizes the input.

In production inference, the picture is slightly different. Once a model is deployed and generating text autoregressively, it generates one token at a time, so there are no future tokens to mask. But the mask is still applied for consistency with the training-time computation graph. KV-cache implementations exploit this by storing the keys and values for all previous tokens and only computing the new query against the cached keys — a 100x speedup on long-context generation that depends on the causal mask being in place.

For non-causal use cases like translation or speech recognition, you typically don't want a mask. That's where cross-attention comes in (Sanderson covers this around the 18-minute mark), and why encoder-decoder models like the original Transformer and T5 use bidirectional attention in the encoder.

How Does Context Size Become a Bottleneck?

The attention pattern is an n × n matrix where n is the context size, so memory grows quadratically. Doubling your context window from 32K to 64K tokens quadruples the attention memory, not just doubles it.

Quadratic attention context window bottleneck shown as a small sparse grid expanding into an overwhelming dense lattice that bursts beyond its frame

Context size diagram with attention pattern shown as a square grid whose dimensions scale with token count

Sanderson notes this almost in passing, but it's the single most consequential engineering fact in modern LLM infrastructure. Every "long context" announcement you've seen since 2023 — Claude's 100K window, Gemini's 1M window, GPT-4 Turbo's 128K — sits on top of architectural workarounds for the n² wall. The attention math itself hasn't changed; the way we run it has. As an arXiv survey from 2025 put it bluntly: "the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling."

The standard tricks, in roughly the order they appeared:

Technique	Year	Idea	Used In
FlashAttention	2022	Tile attention in SRAM, never materialize full n × n in HBM	All frontier LLMs since 2023
Sliding window attention	2023	Each token only attends to the last k tokens, not all n	Mistral 7B, Longformer
Multi-query attention (MQA)	2019	All heads share a single K and V projection	PaLM, early Falcon
Grouped-query attention (GQA)	2023	Heads share K/V in groups (compromise between MHA and MQA)	LLaMA 2/3, Mixtral
Multi-head latent attention (MLA)	2024	Compress KV cache into a low-rank latent space	DeepSeek-V2, V3
Multi-head low-rank attention (MLRA)	2026	Partitionable latent states for 4-way TP decoding	ICLR 2026
FlashAttention 4	2026	Reaches 1,605 TFLOPs/s on NVIDIA Blackwell	Default for new training runs
State space models (SSMs)	2023	Replace attention with a linear recurrence that scales O(n)	Mamba, RWKV, hybrids

The pace of these optimizations is itself a signal. Songtao Liu's 2026 ICLR paper on Multi-Head Low-Rank Attention reports a 2.8× decoding speedup over MLA while matching its perplexity — meaning the gains from squeezing the attention KV cache haven't plateaued yet. As Sebastian Raschka summarized in his visual attention variants overview, the field has moved "from MHA and GQA to MLA, sparse attention, and hybrid architectures" in roughly 18 months. Pick your favorite frontier model and odds are its attention block isn't the same as the 2017 paper anymore.

In our retrieval engine at WebSearchAPI.ai, the practical version of this fight is choosing how much context to send. For a typical RAG query, we could send 500 tokens of carefully ranked context or 50,000 tokens of raw search results. The 50K version costs roughly 100x more in attention compute and is empirically worse on answer quality — a phenomenon researchers call "lost in the middle," where models ignore information buried deep in long contexts. The quadratic isn't just a billing problem; it's a quality problem too.

What Are Value Vectors and How Do They Update Embeddings?

Value vectors are the third learned projection of each token, and they hold the actual information that gets added to other tokens' embeddings. Once you've computed the attention pattern from queries and keys, you compute a weighted sum of value vectors using the attention weights as coefficients, and add that sum back to the original embedding.

Values diagram showing the value matrix multiplication producing value vectors that get summed with attention weights and added to the original embedding

Here's the part of attention that throws people off: the query and key vectors are only used to compute weights. They never directly influence what gets added to the output embedding. The thing that actually flows through the network — the content that gets injected from one token into another — is the value vector.

This separation of concerns is more elegant than it first looks. Queries and keys decide who talks to whom. Values decide what gets said. Anthropic's transformer circuits work showed that this factorization is closer to "memory addressing + memory contents" than anyone realized — the QK circuit determines which heads route information between which positions, and the OV (output-value) circuit determines what transformation gets applied when that routing happens. Sanderson hints at this when he calls the value map "a low-rank transformation," and it's the conceptual unlock that made me start treating attention like a learned content-addressable memory rather than an opaque blob of matmuls.

How Many Parameters Does a Single Attention Head Use?

A single GPT-3 attention head uses about 6.3 million parameters split across four matrices: query (W_Q), key (W_K), value-down (W_V↓), and value-up (W_V↑). The Q and K matrices each have ~1.5M parameters; the value map is factored into two smaller matrices to match.

Counting parameters diagram showing the four matrices Q, K, value-down, and value-up with their dimensions and parameter counts for GPT-3

The value-up / value-down factoring is one of those design choices that looks arbitrary until you do the parameter math:

Matrix	Naive Shape	Naive Params	Factored Shape	Factored Params
W_Q	12288 × 128	1,572,864	same	same
W_K	12288 × 128	1,572,864	same	same
W_V (single matrix)	12288 × 12288	150,994,944	—	—
W_V↓ + W_V↑ (factored)	—	—	12288 × 128 + 128 × 12288	3,145,728

The factored value map saves about 148 million parameters per head. With 96 heads × 96 layers in GPT-3, that saving multiplies into something on the order of 1.4 trillion parameters that could have been spent on a square value matrix but weren't. This is one of the reasons GPT-3 fits into 175B parameters instead of 1.5T — a single design choice in the attention block.

The other consequence of low-rank value maps is interpretability. When the value transformation is forced to be rank-128 in a 12288-dimensional space, what each head can do is constrained. Anthropic's circuits work exploits this directly: a low-rank OV circuit is much easier to analyze than a full-rank one, and most of the mechanistic interpretability results published since 2021 depend on it.

What Is Cross-Attention and Where Does It Show Up?

Cross-attention is the same operation as self-attention except the keys and values come from a different sequence than the queries. It's used wherever a model needs to relate two distinct streams of data — translation, image captioning, speech transcription, retrieval-augmented generation.

Cross-attention diagram showing keys and queries coming from two different language sequences during translation

Cross-attention is what lets the decoder of a translation model look up the original English sentence while it's generating French. It's also what powers Whisper's speech-to-text — the decoder generates text tokens whose queries attend to keys derived from audio features. And it's the conceptual basis for retrieval-augmented generation: at WebSearchAPI.ai, when we hand a model search results plus a user question, the model uses something close to cross-attention internally to figure out which retrieved passages are relevant to which parts of the question.

A small caveat Sanderson mentions: cross-attention typically has no causal mask, because there's no temporal ordering between the two sequences. In a translation model, the entire source sentence is available before generation begins, so masking would just throw away signal. This is why decoder-only models (GPT, Claude, LLaMA) only use causal self-attention while encoder-decoder models (T5, BART, Whisper) use bidirectional cross-attention from decoder to encoder.

Why Do Transformers Use Multiple Attention Heads in Parallel?

Multiple heads exist so the model can learn many different types of contextual relationships simultaneously — adjective-noun bindings, coreference, syntactic structure, semantic role, and patterns no human would label. Each head has its own Q, K, and V matrices, and the outputs of all heads are summed together at the end of the block.

Multi-head attention illustrated as a grid of nine parallel rooms each containing a different specialist instrument pointing toward the same central object

Multi-head attention diagram showing 96 parallel attention heads with their own Q, K, V matrices feeding into a single output

Multi-head attention runs many parallel attention operations, each with its own learned matrices.

The multi-head structure is the reason transformers learn so much from raw text. Different heads end up specializing — interpretability researchers at Anthropic and elsewhere have identified "induction heads" that copy patterns from earlier in the sequence, "previous token heads" that look at position n-1, and "name mover" heads that surface entity references. None of these specializations are programmed in. They emerge during training because the gradient finds them useful.

The parameter math for the full multi-head block in GPT-3 lands at about 600 million parameters per layer:

4 matrices × 6.3M params per head × 96 heads ≈ 600M parameters per attention block
600M × 96 layers ≈ 58 billion parameters in attention across the whole model
175B total params - 58B attention ≈ 117B in MLPs and embeddings

The "MLP is bigger than attention" surprise is real. Most of GPT-3's parameters live in the feed-forward blocks between attention layers, not in attention itself. But attention is what lets information flow between positions, and the MLPs only get to do anything useful because attention has already mixed the right tokens together.

How Does the Output Matrix Tie Multi-Head Attention Together?

The output matrix is what most papers call the "value-up" projection across all heads, stapled into a single large matrix. It's a notational convenience that lives at the multi-head level rather than the per-head level, and it can confuse readers expecting a separate W_O and W_V↑.

Output matrix diagram showing the value-up projections of all heads combined into a single output matrix at the multi-head level

This is the part of the attention chapter where I usually have to slow down with a junior engineer. If you read the original paper, you'll see one V matrix per head and one big W_O at the multi-head level. If you read a teaching explanation like this video, you'll see Q, K, V↓, and V↑ per head with no separate output matrix. They describe the same computation. The difference is whether you bundle the V↑ projections of all heads into one matrix or keep them per-head.

The practical implication: when you read a research paper that talks about "the output projection" or "W_O," it's the thing Sanderson called the value-up matrix. When you read a paper about "the value matrix" or "V," it's the value-down projection only. Mismatched naming has tripped up more grad students than I can count, and Sanderson is unusually upfront about it. If you're implementing attention from scratch, pick one convention and document it.

How Does Attention Compose Across Many Layers?

Attention is applied repeatedly through dozens of stacked transformer blocks, with each block adding more contextual richness on top of the embeddings produced by the previous one. By layer 96, the embeddings encode high-level features like sentiment, topic, and reasoning structure — not just individual word meanings.

Going deeper diagram showing attention blocks stacked many times in series with embeddings flowing through each layer

The layered structure is what makes transformers more than the sum of their parts. A single attention block can only mix information once. Stack 96 of them and you get 96 rounds of mixing, each operating on representations that already encode some context. Mechanistic interpretability papers have shown that early layers in GPT-style models tend to handle syntax and surface features, middle layers handle semantic and entity-level reasoning, and late layers handle output formatting and next-token prediction. Nobody designed this division of labor — it falls out of training.

In production retrieval, this layered behavior is why simple "ablate one layer" experiments rarely work. You can't just swap out the attention mechanism in layer 47 of a 96-layer model and expect the rest of the stack to keep functioning, because the representations in layer 48 were trained against whatever layer 47 produced. This tight coupling is one reason the industry hasn't fully migrated away from vanilla attention even though faster alternatives exist — every replacement has to be retrained from scratch, and the cost of training a frontier model in 2026 is somewhere between $100M and $1B depending on how you count.

How Does Attention Generalize Beyond Text?

The same scaled dot-product attention you read in the 2017 paper is what powers Vision Transformers, Whisper, AlphaFold 2, video models, and every multimodal LLM shipping today. The token sequence changes — image patches, audio frames, amino acid residues, video frames — but the operation is identical.

Modality	"Tokens" Are	Reference Architecture
Text	Subword units (BPE)	GPT-4, Claude 4, Gemini 3
Image	16×16 pixel patches flattened	Vision Transformer (ViT), DINOv2
Audio	25ms mel-spectrogram windows	Whisper, AudioLM
Video	Spatiotemporal patches	Sora, VideoPoet
Protein	Amino acid residues	AlphaFold 2, ESM-3
Multimodal	Mixed text + image + audio tokens	GPT-4o, Gemini 3, Claude 4 Vision

Vision Transformers were the first big proof that the architecture wasn't text-specific. The 2020 ViT paper from Google showed that splitting images into 16×16 patches and feeding them through a standard transformer matched or beat ConvNets on ImageNet — with no convolutional bias built in. The model learned spatial structure entirely from data. Anyone in computer vision in 2019 would have told you that wasn't supposed to work. By 2022, ViT-style backbones were the default for most vision tasks.

The reason this generalizes is the same reason attention won over RNNs. Attention treats its input as a set with positional metadata, not a sequence with a hardcoded order. If you can tokenize a modality — turn it into a finite list of vectors — you can feed it through self-attention and let the model learn whatever positional relationships matter. This is also why multimodal models like GPT-4o and Claude 4 work without separate vision and language stacks: the same attention block reads text tokens and image patches in the same context window, computes Q·Kᵀ across all of them, and lets the cross-modal weights emerge during training. The 2017 paper called itself "Attention Is All You Need," and a decade in, that's mostly held up.

Why Is Parallelism the Real Reason Attention Won?

Attention won because it's parallelizable on GPUs in a way RNNs never were. The whole attention pattern computes as one matrix multiplication across all token pairs simultaneously, while RNNs literally cannot start step t until step t-1 finishes.

Ending screen with a summary of the attention mechanism and references to additional resources by Karpathy and Olah

This is the closing point Sanderson lands on, and it's the one I'd put in bold for any engineer learning transformers in 2026. Attention isn't deeper or smarter than what came before. It's flatter — the data dependencies are wider but shallower, which maps cleanly onto thousands of GPU cores running in parallel. The 2017 paper essentially asked: "what if we let the chip do all its multiplications at once instead of one at a time?" Everything else followed from that.

The downstream consequence is what Rich Sutton called the bitter lesson — the methods that scale with compute end up winning, regardless of whether they look elegant. Attention scales with compute. RNNs don't. Even if a sufficiently clever RNN variant could in principle solve the same problems, the architecture that lets you point a billion-dollar GPU cluster at the problem and amortize linearly in compute is the one that ships.

The hardware-software co-evolution since 2017 has been remarkable to watch from inside the industry. NVIDIA's H100 and Blackwell GPUs ship with hardware-level optimizations specifically for attention math — tensor cores, transformer engines, and memory hierarchy decisions that would have been overkill for any pre-2017 architecture. FlashAttention 4 hitting 1,605 TFLOPs/s on Blackwell isn't just a software win; it's the result of GPU vendors and model architects converging on the same operation as the central computational unit of AI. As one Forbes column put it in early 2026, "too many people have a sort of distorted view of how attention mechanisms work in analyzing text" — the gap between how attention is described in tutorials and how it actually runs on production hardware has only widened.

Sanderson recommends Karpathy and Chris Olah's transformer circuits work at the end of the video for more depth, and I'd second both. Karpathy's Build a GPT from scratch is the natural next watch. If you want the broader transformer block (MLPs, residuals, encoder-decoder) instead of just attention, our sister post on the full architecture walks through ByteByteGo's 10-minute explainer with the same engineer-annotation approach. And if you're wondering how attention patterns end up misaligned during training, our piece on AI reward hacking covers what happens when gradient descent finds the wrong attention shortcuts.

Frequently Asked Questions

What is attention in transformers in simple terms?

Attention is a layer in a transformer that lets every token in a sequence read information from every other token at once, weighted by how relevant each pair is. Sanderson describes it as a learned mechanism that takes context-free embeddings and pushes them toward more contextually rich meanings — the mechanism that lets "mole" mean "small mammal" in one sentence and "skin lesion" in another.

What are query, key, and value vectors in attention?

Query, key, and value are three learned projections of every input token. The query represents what the token is "looking for" in other tokens, the key represents what the token "has to offer," and the value carries the actual content that gets added to other embeddings. The attention pattern comes from dot products between queries and keys; the output is a weighted sum of values using those dot products as weights.

Why does the attention formula divide by the square root of d_k?

The √d_k division keeps the dot products in a numerically stable range before applying softmax. Without it, when keys and queries have many dimensions (128 in GPT-3), the dot products can grow large and softmax collapses to a one-hot distribution, making gradients vanish. This is an empirical stabilization trick from the original 2017 paper that has survived essentially unchanged.

What is the difference between self-attention and cross-attention?

Self-attention computes queries, keys, and values from the same input sequence. Cross-attention pulls queries from one sequence and keys plus values from a different sequence — for example, a French decoder querying an English encoder during translation. Sanderson notes that GPT-style models use only self-attention with masking, while encoder-decoder models like T5 and Whisper use both self-attention and cross-attention.

How many parameters does GPT-3 spend on attention?

GPT-3 spends roughly 58 billion of its 175 billion total parameters on attention — about a third of the model. Each attention block contains 96 heads with about 6.3 million parameters each, and the model has 96 layers, so the math compounds quickly. The remaining ~117 billion parameters live in the MLP feed-forward blocks between attention layers.

Why is context size such a hard problem for LLMs?

The attention pattern is a square matrix whose dimensions equal the context length, so memory grows quadratically with context size. Doubling context from 32K to 64K tokens quadruples the attention memory cost, not just doubles it. This O(n²) wall is why every long-context technique — FlashAttention, sliding window attention, Mamba, multi-head latent attention — exists.

What is masking in attention and why is it used?

Masking sets the upper-triangular entries of the attention scores to negative infinity before softmax, which makes them zero in the normalized matrix while keeping each column summing to 1. This prevents earlier tokens from seeing later tokens during training, which is essential for autoregressive next-token prediction in models like GPT. Without masking, the model would trivially "cheat" by reading the answer.

Are transformers and attention the same thing?

No. Attention is one component inside a transformer. A full transformer block also contains a feed-forward MLP, residual connections, layer normalization, and (depending on the variant) positional encoding. The 2017 paper "Attention Is All You Need" was named for the fact that its architecture relied on attention instead of recurrence — but attention alone isn't a transformer.

What is the difference between additive attention and scaled dot-product attention?

Additive attention (Bahdanau et al., 2014) computes alignment scores using a small feed-forward network applied to concatenated query and key vectors. Scaled dot-product attention (Vaswani et al., 2017) computes the score as Q·Kᵀ divided by √d_k. Both produce a learned weighting between positions, but scaled dot-product runs as a single matrix multiplication on a GPU while additive attention requires a per-pair MLP forward pass — which is the entire reason transformers replaced RNN-style attention.

What is FlashAttention and why does it matter?

FlashAttention is a tiled implementation of scaled dot-product attention that never materializes the full n × n attention matrix in GPU high-bandwidth memory. By computing attention in SRAM-sized chunks, it cuts memory usage from O(n²) to O(n) and runs significantly faster on long sequences. FlashAttention 4 reaches 1,605 TFLOPs/s on NVIDIA Blackwell GPUs as of March 2026 — an order of magnitude over the 2022 baseline, and the reason 100K+ context windows are economically viable.

Key Takeaways

Attention is a learned operation that computes relevance scores between every pair of tokens, then uses those scores to mix their value vectors into context-aware embeddings.
The idea predates transformers — Bahdanau, Cho, and Bengio introduced additive attention for RNN translation in 2014. The 2017 paper kept the concept and replaced the score function with a GPU-friendly dot product.
"Attention Is All You Need" has been cited 173,000+ times and the architecture now runs every frontier model — GPT-4, Claude, Gemini, LLaMA 3, Mistral, DeepSeek.
Query, key, and value are three separate learned projections — Q and K decide attention weights, V carries the content that gets added to the output.
Scaled dot-product attention fits in eight lines of PyTorch but the production version (torch.nn.functional.scaled_dot_product_attention) dispatches to FlashAttention or xFormers depending on hardware.
The attention pattern is O(n²) in context length, which is the single most important engineering constraint in modern LLM infrastructure. FlashAttention 4 hits 1,605 TFLOPs/s on Blackwell; MLRA delivers 2.8× decoding speedup over MLA.
Masking enforces causal attention by zeroing out future-token scores before softmax, making efficient parallel training possible.
Multi-head attention runs many parallel attention operations with different learned matrices, and heads specialize in coreference, syntax, induction, and patterns no human would name. GPT-3 uses 96 heads × 96 layers and spends ~58B of 175B parameters on attention.
Modern variants (MQA, GQA, MLA, MLRA) compress the KV cache to ease the long-context memory wall while keeping the same Q·Kᵀ math.
The same operation generalizes beyond text — Vision Transformers, Whisper, AlphaFold 2, and multimodal LLMs all use scaled dot-product attention over different token types.
Attention won over RNNs primarily because it's parallelizable across GPU cores, not because it's mathematically deeper — and that parallelism is why transformer scaling laws work.

This post is based on Attention in transformers, step-by-step | Deep Learning Chapter 6 by 3Blue1Brown. The video is part of Grant Sanderson's Neural Networks series and was published April 7, 2024.