3Blue1Brown's visual explainer of attention, annotated by a production AI engineer. Query, key, value vectors, softmax, masking, multi-head attention, and the GPT-3 parameter math behind self-attention.
3Blue1Brown's 26-minute deep dive on attention is the clearest visual explanation of the mechanism that powers every modern large language model. This annotated breakdown walks through the math, adds Python code, GPU economics, and modern variants 3Blue1Brown skipped, and connects each step to what production retrieval engineers actually deal with — quadratic context cost, multi-head attention, and why GPT-3 spends 58 billion parameters on this one operation.
Attention is a learned operation that lets every token in a sequence directly read from every other token in parallel, using three projections — query, key, and value — to decide which other tokens are relevant and what information to copy from them. It first appeared in the 2017 paper Attention Is All You Need, which has since been cited over 173,000 times, and is the reason GPT-4, Claude 4, Gemini 3, LLaMA 3, Mistral, and DeepSeek all share the same underlying architecture nine years later.

3Blue1Brown — the channel created by Stanford-trained mathematician Grant Sanderson — spends 26 minutes walking through the attention mechanism with hand-built animations made in his Python library manim. The video has been viewed 4 million times since April 2024 and serves as the de facto visual companion to the Attention Is All You Need paper, which has crossed 173,000 academic citations and is the foundation of every model you actually use — GPT-4, Claude 4, Gemini 3, LLaMA 3, Mistral, DeepSeek. Sanderson uses a single running example, "a fluffy blue creature roamed the verdant forest," to show how query, key, and value matrices let adjectives update the meaning of nouns. The single most important takeaway: attention is just a learned weighted sum where the weights come from softmax-normalized dot products between learned projections of each token. That's it. Everything else — multi-head, masking, the output matrix — is engineering scaffolding around that core idea. If you've previously bounced off the original paper, this is the visualization that makes the math click.
If the model's going to accurately predict the next word, that final vector in the sequence, which began its life simply embedding the word was, will have to have been updated by all of the attention blocks to represent much, much more than any individual word.

Conceptually, you want to think of the keys as potentially answering the queries. You think of the keys as matching the queries whenever they closely align with each other.

The overall idea is that by running many distinct heads in parallel, you're giving the model the capacity to learn many distinct ways that context changes meaning.

I've spent the last few years at WebSearchAPI.ai building the retrieval layer that feeds live web data into transformer-based LLMs. Most of what we ship has nothing to do with attention math directly — but every single byte we send to a model passes through self-attention on the other side, and every extra token costs us latency, GPU-seconds, and cache pressure. When I talk to junior engineers about why we obsess over context window utilization, the conversation always lands back on this video.
Sanderson's animations are the cleanest visual treatment of self-attention I've found. He's a former 3Blue1Brown viewer's dream: a Stanford math grad who built a custom animation library specifically to teach math intuitively, and his attention chapter has been viewed over 4 million times since April 2024. But it's a teaching video — it skips the GPU economics, the memory wall, and the Anthropic transformer circuits work that explains why the value-up-and-down factoring exists. I'll quote Sanderson for the conceptual scaffolding, then fill in what production AI engineers actually run into. If you've never watched the video, watch it first. Then come back here for the engineering layer on top.

Embeddings alone are a context-free lookup table — the same word gets the same vector regardless of what surrounds it. Attention exists to fix that, and it's the only mechanism in a transformer that lets tokens share information with each other.
The aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning.

The "mole" example Sanderson opens with is the cleanest demonstration of the problem. After tokenization and embedding, "American true mole," "one mole of carbon dioxide," and "take a biopsy of the mole" all start with the same vector for the word "mole." That vector points to a generic spot in the 12,288-dimensional embedding space GPT-3 uses — somewhere between "small mammal," "chemistry unit," and "skin lesion." It cannot be all three at once. Without a layer that reads the surrounding context, the rest of the network is stuck.
This is the practical reason embeddings on their own — the kind of static embeddings produced by Word2Vec or GloVe in 2013 — can't power modern language understanding. They were good enough for analogy benchmarks ("king - man + woman = queen") but they can't disambiguate. In production retrieval at WebSearchAPI.ai, the same problem shows up when we embed user queries with a static encoder: a query about "Apple revenue" and a query about "apple varieties" land near each other in vector space until you push them through a contextual encoder. Attention is what makes contextual encoding possible.
Attention as a learned operation predates transformers by three years. Bahdanau, Cho, and Bengio's 2014 paper on neural machine translation introduced the first additive attention mechanism for an encoder-decoder RNN, where a small feed-forward network computed alignment scores between the decoder's hidden state and each encoder hidden state. The 2017 transformer paper kept the core idea — learned weighting between positions — and replaced the feed-forward scoring with the dot-product formulation that runs faster on GPUs.
| Year | Paper / Architecture | Key Idea | Score Function |
|---|---|---|---|
| 2014 | Bahdanau et al. (1409.0473) | Additive attention for RNN translation | Feed-forward MLP over [hᵢ, sⱼ] |
| 2015 | Luong et al. (1508.04025) | Multiplicative attention | Dot product, no scaling |
| 2017 | Vaswani et al. (1706.03762) | Self-attention in transformers | Scaled dot product Q·Kᵀ/√d_k |
| 2022 | Dao et al. — FlashAttention | Tiled attention computation | Same math, different memory access |
| 2024 | DeepSeek-V2 — Multi-Head Latent Attention | Compressed KV cache | Latent-space dot product |
The biological framing IBM uses in their explainer maps roughly to this lineage: humans pay selective attention to salient parts of the visual or auditory field while filtering noise, and machine attention does something analogous over token sequences. The metaphor is useful for intuition but the implementation is plain linear algebra — there's no biological circuit being simulated, just a learned weighted sum that happened to work on translation in 2014 and never stopped working.
The reason the 2017 paper exploded into 173,000+ citations isn't that attention itself was new. It's that Vaswani's team showed you could build an entire model out of attention, drop recurrence completely, and still match (then exceed) state-of-the-art on translation while training in a fraction of the wall-clock time. The architecture was a compute argument disguised as an accuracy paper. The follow-on architectures — encoder-only BERT (2018), decoder-only GPT (2018), and every frontier LLM since — all picked the same skeleton.
Attention takes a context-free embedding and pushes it in a specific direction in embedding space, where that direction encodes the influence of surrounding tokens. For "Eiffel tower" preceded by "miniature," the vector for "tower" gets nudged away from "tall," and toward "Paris," "souvenir," and "small."

Sanderson uses two examples to set this up: "mole" with three different meanings, and "tower" preceded by either "Eiffel" or "miniature Eiffel." Both make the same point — there are distinct directions in embedding space for distinct meanings of the same token, and attention is what calculates the offset that moves the generic embedding toward the right one.
The mental model that helped me most when debugging RAG pipelines is this: think of every token as a point in a 12,000-dimensional cloud. Attention is a learned function that reads the rest of the cloud and outputs a delta vector for each point. The delta is small for tokens whose meaning is already clear (like determiners) and large for tokens whose meaning depends heavily on context (like pronouns and ambiguous nouns). This is also why long-range coreference — where "it" refers to something 200 tokens earlier — works at all in transformers but failed catastrophically in RNNs.
The attention pattern is a square grid of dot products between every query and every key, scaled by the square root of the key dimension and softmax-normalized column by column. The result is a matrix where column j tells you how much each previous token contributes to the meaning of token j.


To measure how well each key matches each query, you compute a dot product between each possible key query pair. I like to visualize a grid full of a bunch of dots where the bigger dots correspond to the larger dot products, the places where the keys and queries align.

Here's the formula from the original paper, decoded one piece at a time:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
| Term | What It Is | Shape (GPT-3) |
|---|---|---|
| Q | Stack of all query vectors for the sequence | n × 128 |
| K | Stack of all key vectors for the sequence | n × 128 |
| V | Stack of all value vectors for the sequence | n × 12288 |
| QKᵀ | Raw attention scores (every query against every key) | n × n |
| √d_k | Square root of the key dimension, used for numerical stability | scalar = √128 ≈ 11.3 |
| softmax(...) | Column-wise normalization to a probability distribution | n × n |
| · V | Weighted sum of value vectors using the softmax weights as coefficients | n × 12288 |
The √d_k division is the part most explanations skip. Without it, when keys and queries get long, dot products get large, and softmax collapses to a one-hot distribution where one token gets all the attention weight and the rest get rounding errors. Dividing by √d_k keeps the softmax in a sane regime where it can actually express partial weight on multiple tokens. This isn't theoretically derived — it's an empirical stabilization trick from the 2017 paper that ended up surviving everything since.
The example sentence Sanderson uses, "a fluffy blue creature roamed the verdant forest," is doing a lot of work here. Adjectives ("fluffy," "blue," "verdant") query for nouns. Nouns advertise themselves as nouns through their keys. The dot products between matching pairs are large, and softmax turns those into the dominant weights for the column. After the weighted sum, the noun embedding has absorbed information from its modifiers — that's the whole behavior in one sentence.
The full math fits in eight lines of PyTorch. This is the canonical implementation that ships in roughly every educational notebook and matches what torch.nn.functional.scaled_dot_product_attention runs internally before kernel-level optimizations:
import math
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(query, key, value, mask=None):
# query, key, value: [batch, heads, seq_len, d_k]
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, value), weightsA few things worth pointing out for production engineers reading this:
torch.matmul(query, key.transpose(-2, -1)) line is the n × n attention matrix being materialized in HBM. For a 64K-token sequence with float16 weights, that's 64K × 64K × 2 bytes ≈ 8.6 GB per head per layer — which is exactly what FlashAttention sidesteps by tiling the computation in SRAM and never writing the full matrix to HBM.masked_fill step is what enforces causal attention in GPT-style training. A common bug in homemade attention implementations is forgetting that the mask must be applied before softmax, not after. Apply it after and your "zeroed" entries get a non-zero weight from the softmax denominator.F.softmax(scores, dim=-1) operates along the last dimension because PyTorch attention matrices are usually shaped [batch, heads, seq, seq] where the last seq dimension is the keys. Running softmax along the wrong axis silently breaks the model — the gradients still flow, training just doesn't converge.For real production code, you almost never write this loop by hand. PyTorch 2.0+ ships torch.nn.functional.scaled_dot_product_attention, which dispatches to FlashAttention, memory-efficient attention, or a fallback math kernel depending on hardware. JAX has jax.nn.dot_product_attention. xFormers and FlashAttention are the two reference implementations everyone benchmarks against. The eight-line version above is for understanding, not deploying.
Masking enforces causal attention so that earlier tokens can't see later tokens during training. It works by setting the upper-triangular entries of the attention scores to negative infinity before softmax, which makes them zero in the normalized matrix while keeping each column sum equal to 1.

The simplest thing you might think to do is to set them equal to zero, but if you did that, the columns wouldn't add up to one anymore. They wouldn't be normalized. So instead, a common way to do this is that before applying softmax, you set all of those entries to be negative infinity.

The reason this matters during training is subtle. When GPT trains, it doesn't just predict the last token — it predicts every next token simultaneously, for every position in the sequence. That means a single 1024-token training example produces 1024 supervised prediction targets for the price of one forward pass. Masking is what makes this efficient: each position must predict its own next token without cheating by looking at future tokens. Remove the mask and the model just memorizes the input.
In production inference, the picture is slightly different. Once a model is deployed and generating text autoregressively, it generates one token at a time, so there are no future tokens to mask. But the mask is still applied for consistency with the training-time computation graph. KV-cache implementations exploit this by storing the keys and values for all previous tokens and only computing the new query against the cached keys — a 100x speedup on long-context generation that depends on the causal mask being in place.
For non-causal use cases like translation or speech recognition, you typically don't want a mask. That's where cross-attention comes in (Sanderson covers this around the 18-minute mark), and why encoder-decoder models like the original Transformer and T5 use bidirectional attention in the encoder.
The attention pattern is an n × n matrix where n is the context size, so memory grows quadratically. Doubling your context window from 32K to 64K tokens quadruples the attention memory, not just doubles it.


Another fact that's worth reflecting on about this attention pattern is how its size is equal to the square of the context size. So this is why context size can be a really huge bottleneck for large language models, and scaling it up is nontrivial.

Sanderson notes this almost in passing, but it's the single most consequential engineering fact in modern LLM infrastructure. Every "long context" announcement you've seen since 2023 — Claude's 100K window, Gemini's 1M window, GPT-4 Turbo's 128K — sits on top of architectural workarounds for the n² wall. The attention math itself hasn't changed; the way we run it has. As an arXiv survey from 2025 put it bluntly: "the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling."
The standard tricks, in roughly the order they appeared:
| Technique | Year | Idea | Used In |
|---|---|---|---|
| FlashAttention | 2022 | Tile attention in SRAM, never materialize full n × n in HBM | All frontier LLMs since 2023 |
| Sliding window attention | 2023 | Each token only attends to the last k tokens, not all n | Mistral 7B, Longformer |
| Multi-query attention (MQA) | 2019 | All heads share a single K and V projection | PaLM, early Falcon |
| Grouped-query attention (GQA) | 2023 | Heads share K/V in groups (compromise between MHA and MQA) | LLaMA 2/3, Mixtral |
| Multi-head latent attention (MLA) | 2024 | Compress KV cache into a low-rank latent space | DeepSeek-V2, V3 |
| Multi-head low-rank attention (MLRA) | 2026 | Partitionable latent states for 4-way TP decoding | ICLR 2026 |
| FlashAttention 4 | 2026 | Reaches 1,605 TFLOPs/s on NVIDIA Blackwell | Default for new training runs |
| State space models (SSMs) | 2023 | Replace attention with a linear recurrence that scales O(n) | Mamba, RWKV, hybrids |
The pace of these optimizations is itself a signal. Songtao Liu's 2026 ICLR paper on Multi-Head Low-Rank Attention reports a 2.8× decoding speedup over MLA while matching its perplexity — meaning the gains from squeezing the attention KV cache haven't plateaued yet. As Sebastian Raschka summarized in his visual attention variants overview, the field has moved "from MHA and GQA to MLA, sparse attention, and hybrid architectures" in roughly 18 months. Pick your favorite frontier model and odds are its attention block isn't the same as the 2017 paper anymore.
In our retrieval engine at WebSearchAPI.ai, the practical version of this fight is choosing how much context to send. For a typical RAG query, we could send 500 tokens of carefully ranked context or 50,000 tokens of raw search results. The 50K version costs roughly 100x more in attention compute and is empirically worse on answer quality — a phenomenon researchers call "lost in the middle," where models ignore information buried deep in long contexts. The quadratic isn't just a billing problem; it's a quality problem too.
Value vectors are the third learned projection of each token, and they hold the actual information that gets added to other tokens' embeddings. Once you've computed the attention pattern from queries and keys, you compute a weighted sum of value vectors using the attention weights as coefficients, and add that sum back to the original embedding.

This value vector lives in the same very high dimensional space as the embeddings. When you multiply this value matrix by the embedding of a word, you might think of it as saying, if this word is relevant to adjusting the meaning of something else, what exactly should be added to the embedding of that something else?

Here's the part of attention that throws people off: the query and key vectors are only used to compute weights. They never directly influence what gets added to the output embedding. The thing that actually flows through the network — the content that gets injected from one token into another — is the value vector.
This separation of concerns is more elegant than it first looks. Queries and keys decide who talks to whom. Values decide what gets said. Anthropic's transformer circuits work showed that this factorization is closer to "memory addressing + memory contents" than anyone realized — the QK circuit determines which heads route information between which positions, and the OV (output-value) circuit determines what transformation gets applied when that routing happens. Sanderson hints at this when he calls the value map "a low-rank transformation," and it's the conceptual unlock that made me start treating attention like a learned content-addressable memory rather than an opaque blob of matmuls.
A single GPT-3 attention head uses about 6.3 million parameters split across four matrices: query (W_Q), key (W_K), value-down (W_V↓), and value-up (W_V↑). The Q and K matrices each have ~1.5M parameters; the value map is factored into two smaller matrices to match.

The way the value map is factored is as a product of two smaller matrices. To throw in linear algebra jargon here, what we're basically doing is constraining the overall value map to be a low rank transformation.

The value-up / value-down factoring is one of those design choices that looks arbitrary until you do the parameter math:
| Matrix | Naive Shape | Naive Params | Factored Shape | Factored Params |
|---|---|---|---|---|
| W_Q | 12288 × 128 | 1,572,864 | same | same |
| W_K | 12288 × 128 | 1,572,864 | same | same |
| W_V (single matrix) | 12288 × 12288 | 150,994,944 | — | — |
| W_V↓ + W_V↑ (factored) | — | — | 12288 × 128 + 128 × 12288 | 3,145,728 |
The factored value map saves about 148 million parameters per head. With 96 heads × 96 layers in GPT-3, that saving multiplies into something on the order of 1.4 trillion parameters that could have been spent on a square value matrix but weren't. This is one of the reasons GPT-3 fits into 175B parameters instead of 1.5T — a single design choice in the attention block.
The other consequence of low-rank value maps is interpretability. When the value transformation is forced to be rank-128 in a 12288-dimensional space, what each head can do is constrained. Anthropic's circuits work exploits this directly: a low-rank OV circuit is much easier to analyze than a full-rank one, and most of the mechanistic interpretability results published since 2021 depend on it.
Cross-attention is the same operation as self-attention except the keys and values come from a different sequence than the queries. It's used wherever a model needs to relate two distinct streams of data — translation, image captioning, speech transcription, retrieval-augmented generation.

A cross attention head looks almost identical. The only difference is that the key and query maps act on different datasets. In a model doing translation, the keys might come from one language while the queries come from another.

Cross-attention is what lets the decoder of a translation model look up the original English sentence while it's generating French. It's also what powers Whisper's speech-to-text — the decoder generates text tokens whose queries attend to keys derived from audio features. And it's the conceptual basis for retrieval-augmented generation: at WebSearchAPI.ai, when we hand a model search results plus a user question, the model uses something close to cross-attention internally to figure out which retrieved passages are relevant to which parts of the question.
A small caveat Sanderson mentions: cross-attention typically has no causal mask, because there's no temporal ordering between the two sequences. In a translation model, the entire source sentence is available before generation begins, so masking would just throw away signal. This is why decoder-only models (GPT, Claude, LLaMA) only use causal self-attention while encoder-decoder models (T5, BART, Whisper) use bidirectional cross-attention from decoder to encoder.
Multiple heads exist so the model can learn many different types of contextual relationships simultaneously — adjective-noun bindings, coreference, syntactic structure, semantic role, and patterns no human would label. Each head has its own Q, K, and V matrices, and the outputs of all heads are summed together at the end of the block.


GPT three, for example, uses 96 attention heads inside each block. Considering that each one is already a bit confusing, it's certainly a lot to hold in your head.

The multi-head structure is the reason transformers learn so much from raw text. Different heads end up specializing — interpretability researchers at Anthropic and elsewhere have identified "induction heads" that copy patterns from earlier in the sequence, "previous token heads" that look at position n-1, and "name mover" heads that surface entity references. None of these specializations are programmed in. They emerge during training because the gradient finds them useful.
The parameter math for the full multi-head block in GPT-3 lands at about 600 million parameters per layer:
The "MLP is bigger than attention" surprise is real. Most of GPT-3's parameters live in the feed-forward blocks between attention layers, not in attention itself. But attention is what lets information flow between positions, and the MLPs only get to do anything useful because attention has already mixed the right tokens together.
The output matrix is what most papers call the "value-up" projection across all heads, stapled into a single large matrix. It's a notational convenience that lives at the multi-head level rather than the per-head level, and it can confuse readers expecting a separate W_O and W_V↑.

All of these value up matrices for each head appear stapled together in one giant matrix that we call the output matrix associated with the entire multi headed attention block.

This is the part of the attention chapter where I usually have to slow down with a junior engineer. If you read the original paper, you'll see one V matrix per head and one big W_O at the multi-head level. If you read a teaching explanation like this video, you'll see Q, K, V↓, and V↑ per head with no separate output matrix. They describe the same computation. The difference is whether you bundle the V↑ projections of all heads into one matrix or keep them per-head.
The practical implication: when you read a research paper that talks about "the output projection" or "W_O," it's the thing Sanderson called the value-up matrix. When you read a paper about "the value matrix" or "V," it's the value-down projection only. Mismatched naming has tripped up more grad students than I can count, and Sanderson is unusually upfront about it. If you're implementing attention from scratch, pick one convention and document it.
Attention is applied repeatedly through dozens of stacked transformer blocks, with each block adding more contextual richness on top of the embeddings produced by the previous one. By layer 96, the embeddings encode high-level features like sentiment, topic, and reasoning structure — not just individual word meanings.

The further down the network you go, with each embedding taking in more and more meaning from all the other embeddings, which themselves are getting more and more nuanced, the hope is that there's the capacity to encode higher level and more abstract ideas.

The layered structure is what makes transformers more than the sum of their parts. A single attention block can only mix information once. Stack 96 of them and you get 96 rounds of mixing, each operating on representations that already encode some context. Mechanistic interpretability papers have shown that early layers in GPT-style models tend to handle syntax and surface features, middle layers handle semantic and entity-level reasoning, and late layers handle output formatting and next-token prediction. Nobody designed this division of labor — it falls out of training.
In production retrieval, this layered behavior is why simple "ablate one layer" experiments rarely work. You can't just swap out the attention mechanism in layer 47 of a 96-layer model and expect the rest of the stack to keep functioning, because the representations in layer 48 were trained against whatever layer 47 produced. This tight coupling is one reason the industry hasn't fully migrated away from vanilla attention even though faster alternatives exist — every replacement has to be retrained from scratch, and the cost of training a frontier model in 2026 is somewhere between $100M and $1B depending on how you count.
The same scaled dot-product attention you read in the 2017 paper is what powers Vision Transformers, Whisper, AlphaFold 2, video models, and every multimodal LLM shipping today. The token sequence changes — image patches, audio frames, amino acid residues, video frames — but the operation is identical.
| Modality | "Tokens" Are | Reference Architecture |
|---|---|---|
| Text | Subword units (BPE) | GPT-4, Claude 4, Gemini 3 |
| Image | 16×16 pixel patches flattened | Vision Transformer (ViT), DINOv2 |
| Audio | 25ms mel-spectrogram windows | Whisper, AudioLM |
| Video | Spatiotemporal patches | Sora, VideoPoet |
| Protein | Amino acid residues | AlphaFold 2, ESM-3 |
| Multimodal | Mixed text + image + audio tokens | GPT-4o, Gemini 3, Claude 4 Vision |
Vision Transformers were the first big proof that the architecture wasn't text-specific. The 2020 ViT paper from Google showed that splitting images into 16×16 patches and feeding them through a standard transformer matched or beat ConvNets on ImageNet — with no convolutional bias built in. The model learned spatial structure entirely from data. Anyone in computer vision in 2019 would have told you that wasn't supposed to work. By 2022, ViT-style backbones were the default for most vision tasks.
The reason this generalizes is the same reason attention won over RNNs. Attention treats its input as a set with positional metadata, not a sequence with a hardcoded order. If you can tokenize a modality — turn it into a finite list of vectors — you can feed it through self-attention and let the model learn whatever positional relationships matter. This is also why multimodal models like GPT-4o and Claude 4 work without separate vision and language stacks: the same attention block reads text tokens and image patches in the same context window, computes Q·Kᵀ across all of them, and lets the cross-modal weights emerge during training. The 2017 paper called itself "Attention Is All You Need," and a decade in, that's mostly held up.
Attention won because it's parallelizable on GPUs in a way RNNs never were. The whole attention pattern computes as one matrix multiplication across all token pairs simultaneously, while RNNs literally cannot start step t until step t-1 finishes.

A big part of the story for the success of the attention mechanism is not so much any specific kind of behavior that it enables, but the fact that it's extremely parallelizable.

This is the closing point Sanderson lands on, and it's the one I'd put in bold for any engineer learning transformers in 2026. Attention isn't deeper or smarter than what came before. It's flatter — the data dependencies are wider but shallower, which maps cleanly onto thousands of GPU cores running in parallel. The 2017 paper essentially asked: "what if we let the chip do all its multiplications at once instead of one at a time?" Everything else followed from that.
The downstream consequence is what Rich Sutton called the bitter lesson — the methods that scale with compute end up winning, regardless of whether they look elegant. Attention scales with compute. RNNs don't. Even if a sufficiently clever RNN variant could in principle solve the same problems, the architecture that lets you point a billion-dollar GPU cluster at the problem and amortize linearly in compute is the one that ships.
The hardware-software co-evolution since 2017 has been remarkable to watch from inside the industry. NVIDIA's H100 and Blackwell GPUs ship with hardware-level optimizations specifically for attention math — tensor cores, transformer engines, and memory hierarchy decisions that would have been overkill for any pre-2017 architecture. FlashAttention 4 hitting 1,605 TFLOPs/s on Blackwell isn't just a software win; it's the result of GPU vendors and model architects converging on the same operation as the central computational unit of AI. As one Forbes column put it in early 2026, "too many people have a sort of distorted view of how attention mechanisms work in analyzing text" — the gap between how attention is described in tutorials and how it actually runs on production hardware has only widened.
Sanderson recommends Karpathy and Chris Olah's transformer circuits work at the end of the video for more depth, and I'd second both. Karpathy's Build a GPT from scratch is the natural next watch. If you want the broader transformer block (MLPs, residuals, encoder-decoder) instead of just attention, our sister post on the full architecture walks through ByteByteGo's 10-minute explainer with the same engineer-annotation approach. And if you're wondering how attention patterns end up misaligned during training, our piece on AI reward hacking covers what happens when gradient descent finds the wrong attention shortcuts.
Attention is a layer in a transformer that lets every token in a sequence read information from every other token at once, weighted by how relevant each pair is. Sanderson describes it as a learned mechanism that takes context-free embeddings and pushes them toward more contextually rich meanings — the mechanism that lets "mole" mean "small mammal" in one sentence and "skin lesion" in another.
Query, key, and value are three learned projections of every input token. The query represents what the token is "looking for" in other tokens, the key represents what the token "has to offer," and the value carries the actual content that gets added to other embeddings. The attention pattern comes from dot products between queries and keys; the output is a weighted sum of values using those dot products as weights.
The √d_k division keeps the dot products in a numerically stable range before applying softmax. Without it, when keys and queries have many dimensions (128 in GPT-3), the dot products can grow large and softmax collapses to a one-hot distribution, making gradients vanish. This is an empirical stabilization trick from the original 2017 paper that has survived essentially unchanged.
Self-attention computes queries, keys, and values from the same input sequence. Cross-attention pulls queries from one sequence and keys plus values from a different sequence — for example, a French decoder querying an English encoder during translation. Sanderson notes that GPT-style models use only self-attention with masking, while encoder-decoder models like T5 and Whisper use both self-attention and cross-attention.
GPT-3 spends roughly 58 billion of its 175 billion total parameters on attention — about a third of the model. Each attention block contains 96 heads with about 6.3 million parameters each, and the model has 96 layers, so the math compounds quickly. The remaining ~117 billion parameters live in the MLP feed-forward blocks between attention layers.
The attention pattern is a square matrix whose dimensions equal the context length, so memory grows quadratically with context size. Doubling context from 32K to 64K tokens quadruples the attention memory cost, not just doubles it. This O(n²) wall is why every long-context technique — FlashAttention, sliding window attention, Mamba, multi-head latent attention — exists.
Masking sets the upper-triangular entries of the attention scores to negative infinity before softmax, which makes them zero in the normalized matrix while keeping each column summing to 1. This prevents earlier tokens from seeing later tokens during training, which is essential for autoregressive next-token prediction in models like GPT. Without masking, the model would trivially "cheat" by reading the answer.
No. Attention is one component inside a transformer. A full transformer block also contains a feed-forward MLP, residual connections, layer normalization, and (depending on the variant) positional encoding. The 2017 paper "Attention Is All You Need" was named for the fact that its architecture relied on attention instead of recurrence — but attention alone isn't a transformer.
Additive attention (Bahdanau et al., 2014) computes alignment scores using a small feed-forward network applied to concatenated query and key vectors. Scaled dot-product attention (Vaswani et al., 2017) computes the score as Q·Kᵀ divided by √d_k. Both produce a learned weighting between positions, but scaled dot-product runs as a single matrix multiplication on a GPU while additive attention requires a per-pair MLP forward pass — which is the entire reason transformers replaced RNN-style attention.
FlashAttention is a tiled implementation of scaled dot-product attention that never materializes the full n × n attention matrix in GPU high-bandwidth memory. By computing attention in SRAM-sized chunks, it cuts memory usage from O(n²) to O(n) and runs significantly faster on long sequences. FlashAttention 4 reaches 1,605 TFLOPs/s on NVIDIA Blackwell GPUs as of March 2026 — an order of magnitude over the 2022 baseline, and the reason 100K+ context windows are economically viable.
torch.nn.functional.scaled_dot_product_attention) dispatches to FlashAttention or xFormers depending on hardware.This post is based on Attention in transformers, step-by-step | Deep Learning Chapter 6 by 3Blue1Brown. The video is part of Grant Sanderson's Neural Networks series and was published April 7, 2024.