Decode-Only Large Language Models: Input → Output Flow

The diagram below illustrates how a decode-only large language model transforms a sequence of input tokens into a probability distribution over the next token — all in a single forward pass.

What the Diagram Shows

1. Input Tokens (left) The model begins with a sequence of discrete tokens. Each token is looked up in an embedding table to produce a dense vector of dimension d_model. Positional encodings are added so the model knows the order of tokens.

2. Transformer Blocks (center) The core of the model is a stack of N identical transformer blocks, each containing two sub-layers:

  • Masked Multi-Head Self-Attention — each token attends to all preceding tokens (never future ones). The causal mask enforces this left-to-right constraint, which is what makes the architecture "decode-only." Attention outputs pass through learned linear projections (W_Q, W_K, W_V, W_O).
  • Feed-Forward Network — a two-layer MLP with a GELU activation applied position-wise. This is where most of the model's parameters live.

Both sub-layers are wrapped with a residual connection and layer normalization (Add & Norm).

3. Output: Next Token Distribution (right) After the final transformer block, an output projection matrix maps the hidden state back to vocabulary size. A softmax converts these logits into a probability distribution. The model samples (or greedily selects) from this distribution to produce the next token. Repeat autoregressively to generate text.

Key Architectural Facts

Component Role
Token Embedding Maps token IDs → dense vectors
Causal Mask Prevents attending to future positions
Multi-Head Attention Captures contextual relationships
Feed-Forward Network Applies non-linear transformations per position
Layer Norm + Residual Stabilizes training, preserves gradient flow
Output Projection + Softmax Produces next-token probabilities

The "decode-only" label distinguishes this design from encoder-decoder models (like the original Transformer). GPT-style models — including the GPT series, LLaMA, Mistral, and Claude — all follow this architecture. Its simplicity makes it highly scalable: stack more blocks, widen d_model, and train on more tokens.