Introduced in the 2017 paper "Attention is All You Need" by Google researchers, the Transformer architecture revolutionized natural language processing:
Source: "Attention is All You Need" (Vaswani et al., 2017)
Allows each token to "look at" all other tokens and gather relevant information from them based on learned relevance scores.
Apply the same transformation to each token independently, typically consisting of two linear transformations with a non-linear activation in between.
Allow deep networks to train stably by providing shortcuts for gradient flow during backpropagation.
Normalizes activations within each layer to stabilize training and speed up convergence.
Add information about token positions since self-attention itself is position-agnostic.
A single transformer block with residual connections (dashed lines)
Modern LLMs stack these blocks to create deep networks:
Number of transformer blocks stacked together
Dimension of embeddings and hidden layers
Number of parallel attention mechanisms per layer