Self-attention is the key innovation of the Transformer architecture that allows each token to directly interact with all other tokens in the sequence.
For each position in a sequence:
Method | Limitation |
---|---|
Recurrent (RNN) | Information must pass through all intermediate states, leading to vanishing gradients |
Convolutional (CNN) | Limited receptive field, needs many layers to capture long-range patterns |
Self-Attention | Direct connections between any tokens, regardless of distance |
Consider the phrase: "The animal didn't cross the street because it was too tired."
The | animal | didn't | cross | the | street | because | it | was | too | tired | |
it | 0.01 | 0.85 | 0.05 | 0.01 | 0.01 | 0.01 | 0.02 | - | 0.01 | 0.01 | 0.02 |
When processing "it", the model heavily attends to "animal", correctly resolving the reference.
Syntactic Attention
Semantic Attention
Different attention heads learn to focus on different linguistic patterns
For each token, compute:
Linear projections of each token embedding
How much each token should attend to others
Apply softmax to scores to get weights
Weighted sum of value vectors
Instead of a single attention mechanism, the model uses multiple "heads" in parallel:
Each attention head focuses on different aspects of the input