Complete Transformer Anatomy — Encoder, Decoder, and LLM Architecture
“Attention Is All You Need”: A Paper That Changed Everything
In 2017, eight researchers from Google Brain and Google Research presented a paper at NeurIPS (then called NIPS). “Attention Is All You Need”1. The title said it all. This paper proposed an architecture that processed sequences using only the Attention mechanism, without recurrent or convolutional structures. That architecture was the Transformer.
It didn’t just surpass the existing state-of-the-art in machine translation benchmarks at the time. BERT, GPT, T5, LLaMA—virtually every modern AI language model that followed was built on this architecture. It’s no exaggeration to say that single 2017 paper completely flipped the landscape of natural language processing.
The World Before Transformer: Limitations of RNN and LSTM
Before Transformer emerged, the standard architectures for handling sequence data were RNN (Recurrent Neural Network) and its variant LSTM (Long Short-Term Memory)2. They read words one by one in order, accumulating information from previous words in a hidden state. To use an analogy, it was like having one person read a long novel out loud character by character from the beginning, summarizing what they’d read in their head as they went.
There were two major problems.
First, sequential processing. Since words had to be processed one by one in order, parallelization was impossible. It was like having a factory where 100 people could work simultaneously, but the conveyor belt only had one line so only one person could work at a time. They couldn’t properly utilize GPU’s parallel computation capabilities, making training slow.
Second, the long-range dependency problem. The longer the sentence, the harder it was for information from earlier words to reach the end. It was similar to the telephone game before phones existed—by the time the first person’s message reached the tenth person, the original content was significantly distorted. LSTM alleviated this problem but didn’t fundamentally solve it. Once you exceeded a few hundred tokens, early information would fade.
Transformer solved both limitations simultaneously. Instead of reading sequentially, it looked at the entire sequence at once and directly calculated relationships between each position. Instead of reading a novel character by character, it was like spreading out the entire page and grasping which words were related to which all at once. The core of this was Self-Attention.
Self-Attention: The Core Mechanism
The idea of Self-Attention was simple. Every position in the sequence references all other positions to decide “where to pay attention” for itself.
For example, in the sentence “I have a stomachache so I went to the hospital,” whether “stomach” refers to the organ, fruit, or ship can only be determined by looking at the following “ache” and “hospital.” Self-Attention was a mechanism where the word “stomach” looked at all other words in the sentence and assigned high weights to “ache” and “hospital” to understand the context itself. Looking at just the word “stomach” alone gives three possible meanings, but looking at its relationships with surrounding words narrows it down to one. This mathematically implemented the process humans unconsciously do when reading sentences—connecting related words together.
The specific calculation process was as follows.
Self-Attention Calculation Steps
Input: X (each token's embedding in the sequence)
│
├─→ Q = X × W_Q (Query: "What am I looking for?")
├─→ K = X × W_K (Key: "What information do I have?")
└─→ V = X × W_V (Value: "The actual information I will convey")
│
│ Using a library analogy for Q, K, V:
│ Q = Topic you want to find (search term)
│ K = Each book's title/index (search target)
│ V = Actual content of the book (information to retrieve)
│
▼
Score = Q × K^T (Dot product of Query and Key → similarity score)
│
▼
Score / √d_k (Scaling → prevents values from becoming too large)
│
▼
Softmax(Score) (Convert to probability distribution → Attention Weight)
│
▼
Output = Attention Weight × V (Final output through weighted sum)
Expressed as a formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Here, d_k was the dimension of the Key vector. The reason for dividing by √d_k was intuitive—as vector dimensions get larger, dot product values increase, and if values get too large, Softmax creates extreme distributions close to either 0 or 1, making training unstable.
The time complexity of this calculation was O(n²·d)3. Where n is sequence length and d is embedding dimension. Since it calculates similarity for all token pairs, it has quadratic complexity with respect to sequence length. A 1,000-token sentence requires 1,000 × 1,000 = 1 million comparisons, and 10,000 tokens need 100 million comparisons. This was the bottleneck for Transformer when processing long sequences.
| Step | Operation | Meaning |
|---|---|---|
| 1. Linear transformation | Q, K, V = XW_Q, XW_K, XW_V | Separate input into three roles |
| 2. Similarity calculation | Score = QK^T | Measure relevance of each token pair |
| 3. Scaling | Score / √d_k | Stabilize Softmax |
| 4. Normalization | Softmax(Score) | Convert to probability distribution |
| 5. Weighted sum | Output = Weight × V | Generate representation focused on important information |
Multi-Head Attention and Positional Encoding
Self-Attention alone wasn’t enough. A single attention only captured relationships from one perspective. Transformer extended this by introducing Multi-Head Attention.
Multi-Head Attention performed the same Self-Attention in parallel across multiple “heads” and then combined the results. The original paper used 8 heads. It was like having 8 analysts read the same sentence simultaneously, each from a different perspective. One analyst might focus on “who did what to whom” (grammatical relationships), while another focuses on “what emotion this word conveys” (semantic relationships). Combining these 8 analytical results enabled much richer understanding than a single perspective.
Meanwhile, since Transformer eliminated recurrent structure, it needed to inject positional information separately. RNN naturally had built-in order by reading words sequentially—“first word,” “second word”—but Transformer looked at all words simultaneously and couldn’t distinguish between “I ate food” and “Food I ate.” This was Positional Encoding. The original paper used fixed positional encoding with sine and cosine functions. Later, BERT adopted learnable positional embeddings, while GPT series and LLaMA adopted RoPE (Rotary Position Embedding)4, among various other variants.
Encoder: Structure for Understanding Input
Transformer’s Encoder was a structure that read the entire input sequence and generated contextual representations for each token. The original paper stacked 6 identical layers, with each layer consisting of two sub-components.
- Multi-Head Self-Attention — All tokens reference all other tokens (bidirectional)
- Feed-Forward Network (FFN) — 2-layer neural network applied independently to each position
Each sub-component included residual connections and layer normalization. Residual connections were a simple technique of adding each layer’s input to its output. Without this, stacking layers deeply would cause the training signal to vanish (gradient vanishing problem). By providing a “shortcut” for original information, it enabled stable training of dozens of layers.
The key was bidirectional attention. In Encoder’s Self-Attention, all tokens could freely reference all positions in the sequence. In the sentence “The cat sat on the mat,” “sat” could simultaneously see both the preceding “cat” and the following “mat.” This bidirectional contextual understanding was the Encoder’s strength.
Decoder: Structure for Generating Output
The Decoder was a structure that generated output sequences autoregressively, one token at a time. “Autoregressive” means using your just-generated output as the next input. Like “Today” → “weather” → “is good,” deciding the next word by looking at previously written words. Like the Encoder, it stacked 6 layers, but each layer had three sub-components.
- Masked Multi-Head Self-Attention — Masking to prevent seeing future tokens
- Cross-Attention — References Encoder output (only in Encoder-Decoder structure)
- Feed-Forward Network
Masking was the key difference in the Decoder. When generating text, referencing future tokens that haven’t been generated yet would be like “copying the answer,” so attention to tokens after the current position was blocked by masking them to -∞. This was called causal masking. It was like putting up partitions during an exam so you can’t see the answers to the next questions. For the model to properly learn the “ability to predict the next word,” future information had to be blocked.
| Category | Encoder | Decoder |
|---|---|---|
| Attention direction | Bidirectional (full reference) | Unidirectional (past only) |
| Masking | None | Causal masking applied |
| Main purpose | Input understanding & representation | Output generation |
| Cross-Attention | None | References Encoder output (original structure) |
| Representative uses | Classification, similarity measurement | Translation, text generation |
Three Architecture Variants
After the original Transformer, researchers created three variants by selectively using Encoder and Decoder.
| Category | Encoder-only | Decoder-only | Encoder-Decoder |
|---|---|---|---|
| Representative models | BERT, RoBERTa, ELECTRA | GPT series, LLaMA, Claude | T5, BART, mT5 |
| Attention direction | Bidirectional | Unidirectional (causal) | Encoder bidirectional + Decoder unidirectional |
| Main uses | Text classification, NER, similarity | Text generation, dialogue, code | Translation, summarization, Q&A |
| Advantages | Deep contextual understanding | General-purpose generation, easy scaling | Clear input-output structure |
| Disadvantages | Limited generation capability | Relatively weaker input understanding | Complex structure, 2x parameters |
| Training method | MLM (Masked Language Model) | Next Token Prediction (NTP) | Span corruption, etc. |
Encoder-only models, represented by BERT5, were released by Google in 2018. They trained by masking parts of input text and predicting them (MLM). It was like solving large-scale fill-in-the-blank problems: “Today [MASK] is good”—guess what word goes in the blank. They could fully utilize bidirectional context, showing excellent performance in understanding tasks like text classification and named entity recognition.
Decoder-only models were represented by OpenAI’s GPT series6. They trained with the simple objective of predicting the next token from given tokens. Reading “Today weather is” and predicting the next word. If BERT was fill-in-the-blank, GPT was more like sentence continuation. This simplicity became a strength—we’ll dive deeper into this later.
Encoder-Decoder models were represented by Google’s T57. It unified all NLP tasks into the form of “receive text input, output text.” This was natural for tasks where both input and output are sequences, like translation, summarization, and Q&A.
Model Lineage
| Model | Structure | Parameters | Release Year | Features |
|---|---|---|---|---|
| Transformer | Encoder-Decoder | 65M | 2017 | Original paper, machine translation |
| GPT-1 | Decoder-only | 117M | 2018 | Beginning of generative pre-training |
| BERT | Encoder-only | 340M(Large) | 2018 | Bidirectional pre-training, MLM |
| GPT-2 | Decoder-only | 1.5B | 2019 | ”Too dangerous to release” |
| T5 | Encoder-Decoder | 11B(XXL) | 2019 | All tasks unified as text |
| GPT-3 | Decoder-only | 175B | 2020 | Few-shot learning, scaling |
| LLaMA 1 | Decoder-only | 7B–65B | 2023 | Beginning of open-source LLM |
| GPT-4 | Decoder-only(MoE estimated) | ~1.76T(estimated) | 2023 | Multimodal, estimated MoE structure |
| LLaMA 2 | Decoder-only | 7B–70B | 2023 | RLHF applied, commercial license |
| Mixtral 8x7B | Decoder-only(MoE) | 46.7B(active 12.9B) | 2023 | Open-source MoE |
| LLaMA 3 | Decoder-only | 8B–70B | 2024 | 15T token training |
| LLaMA 3.1 | Decoder-only | 8B–405B | 2024 | 128K context, first open-source frontier |
Why Decoder-only Won
As of 2024–2025, major LLMs almost all adopted Decoder-only architecture. GPT-4, Claude, LLaMA, Gemini, Mistral—all Decoder-only. The reasons were manifold.
First, simplicity of training objective. Decoder-only models had only one objective: “Next Token Prediction.” This simple objective was surprisingly effective at learning all aspects of language—grammar, meaning, reasoning, world knowledge.
Second, scaling efficiency. In Encoder-Decoder structure, the same parameter count had to be split between Encoder and Decoder. With the same total parameters, Decoder-only could concentrate more capacity on generation ability.
Third, versatility. Whether understanding or generation tasks, with proper prompt design, a single Decoder-only model could handle both. BERT specialized in text classification, but GPT series handled classification, translation, summarization, and code generation all as “generating the next text.”
Fourth, the discovery of In-context Learning. This capability discovered in GPT-38 was the decisive strength of Decoder-only architecture. It could perform new tasks just by putting examples in the prompt, without separate fine-tuning. This naturally connected with the “next token prediction” training objective.
Scaling Laws and the Birth of LLM
Having the Transformer architecture didn’t automatically create powerful AI. The crucial discovery was Scaling Laws.
In 2020, OpenAI’s Kaplan et al. showed that language model performance improved predictably following power laws with model size, data amount, and computation9. Power laws mean when you increase input 10x, performance improves at a consistent rate. The key was “predictable”—you could estimate beforehand what performance level you’d get by making the model a certain size. It was mathematically confirmed that bigger means better.
In 2022, DeepMind’s Hoffmann et al. corrected this in the “Chinchilla” paper10. They showed existing models had insufficient training data relative to parameter count, proposing that 70B parameter models need about 1.4T tokens of data. This discovery greatly influenced efficient LLM design including LLaMA and later models.
The core message of scaling laws was clear. Rather than complex architecture design, scaling up simple architectures was more effective. And the easiest structure to “scale up simply” was the Decoder-only Transformer.
Present and Future: MoE, SSM, Long Context
While Transformer established itself as the dominant architecture, attempts to overcome its limitations were also active.
Mixture of Experts (MoE) was a technique that improved computational efficiency by activating only part of the model’s total parameters. Like a general hospital with 100 specialists, but only 2–3 relevant specialists examine each patient. The total knowledge is vast, but you don’t mobilize everything each time. GPT-4 was estimated to use MoE structure, and Mistral’s Mixtral 8x7B11 achieved much higher performance with 7B model-level computation by activating only 12.9B out of 46.7B total parameters during inference.
State Space Models (SSM) were attempts to fundamentally bypass Transformer’s O(n²) attention complexity. If Transformer was an “exhaustive search” method comparing all word pairs, SSM was closer to a “summary method” viewing sequences as flowing signals and efficiently updating state. Particularly Mamba12 proposed Selective State Space Models, processing with linear complexity O(n) with respect to sequence length while achieving performance close to Transformer. Hybrid architectures combining MoE and SSM, like MoE-Mamba, also emerged.
Long context processing was also a major research direction. LLaMA 3.1 supported 128K tokens, Claude 200K tokens, and Gemini over 1 million tokens in context windows. Techniques like Ring Attention and Sparse Attention were applied to practically reduce O(n²) complexity.
There were predictions that Transformer would soon be replaced, but reality was different. Hybrid architectures mixing SSM and Transformer seemed more promising, and research continued on improving Transformer’s efficiency itself. The core principle of attention mechanism was likely to survive in some form.
Personal Thoughts
What impressed me most while studying Transformer was the “simplicity” of the architecture itself. Self-Attention, Feed-Forward Network, residual connections, layer normalization—the individual components already existed. What Vaswani et al. did was combine them without RNN. No one expected that combination to be this powerful.
Another interesting point was how the definition of “good architecture” changed over time. In 2018, BERT understanding bidirectional context was revolutionary. After 2020, GPT-3 with unidirectional but massive scaling flipped the landscape. Scale won over architectural sophistication.
I don’t know what form Transformer will evolve into. But one thing was certain—the intuition proposed in that 2017 paper that “attention is all you need” was still valid over 7 years later.
Footnotes
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762 ↩
-
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. ↩
-
Since dot products are calculated for all token pairs for sequence length n, it has O(n²·d) time complexity. See Section 3.5 of the original paper. ↩
-
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864 ↩
-
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805 ↩
-
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf ↩
-
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67. https://arxiv.org/abs/1910.10683 ↩
-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. https://arxiv.org/abs/2005.14165 ↩
-
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361 ↩
-
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., et al. (2022). Training Compute-Optimal Large Language Models. https://arxiv.org/abs/2203.15556 ↩
-
Jiang, A. Q., Sablayrolles, A., Roux, A., et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088 ↩
-
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752 ↩