Architecture / Foundation Models

Transformer Architecture: Residual Streams, Attention Blocks, and Autoregressive Generation

Update token representations along a residual stream: attention communicates across positions, while FFNs transform each token channel-wise.

Mechanism Lab

Animation: how one Transformer block updates the residual stream

The animation shows token representations entering position and LayerNorm, communicating through multi-head attention, then passing through residual paths, FFN, and output logits.

Step 1 / 5

Embed

Token embeddings and positions are added into the first residual stream.

H_0=E+P

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

A Transformer is not a single attention equation; it is a stackable representation update rule involving embeddings, positional information, LayerNorm, multi-head attention, residual paths, FFNs, and output heads.

Residual connections let each layer learn an incremental update rather than overwrite the whole representation; LayerNorm stabilizes token scale for deep stacks.

Self-attention performs global communication inside the sequence, while the FFN applies a token-wise nonlinear channel transformation.

Encoders, decoders, and decoder-only LLMs mainly differ by masks and training objectives: fully visible understanding, causal generation, or encoder-decoder conditional generation.

02 / Math

Deriving a full layer update from one Transformer block

01 / Input representation

Token ids are embedded and combined with positional information. Without position, self-attention is close to permutation equivariant and cannot reliably distinguish word order.

H_0 = E[token] + P[position]

02 / Pre-norm attention

Modern deep Transformers often normalize before each sublayer. This keeps gradients stable along the residual path.

U_l = H_l + MHA(LN(H_l))

03 / Multi-head communication

Each head learns its own Q/K/V space; parallel heads are concatenated and mixed by an output projection.

MHA(X)=Concat(head_h)W_O

04 / FFN channel transform

The FFN applies the same two-layer MLP independently to every token, usually expanding to d_ff and projecting back to d_model.

FFN(x)=W_2 phi(W_1 x + b_1)+b_2

05 / Second residual

The FFN output is added back to the residual stream. A block output is the representation after communication and token-wise nonlinear processing.

H_{l+1}=U_l + FFN(LN(U_l))

06 / Output distribution

A language model projects the last hidden vector to the vocabulary and applies softmax to obtain next-token probabilities.

p(x_{t+1}|x_{<=t})=softmax(h_t W_vocab)

03 / Code

NumPy demo: one pre-norm Transformer block

This is not a large-model training script; it explicitly runs the data flow through positional input, MHA, residual paths, FFN, and logits.

import numpy as np

def layer_norm(x, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = ((x - mean) ** 2).mean(axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)

def softmax(a, axis=-1):
    a = a - np.max(a, axis=axis, keepdims=True)
    exp = np.exp(a)
    return exp / exp.sum(axis=axis, keepdims=True)

def mha(x, weights, n_heads=2, mask=None):
    n, d_model = x.shape
    d_head = d_model // n_heads
    Q = x @ weights["Wq"]
    K = x @ weights["Wk"]
    V = x @ weights["Wv"]

    Q = Q.reshape(n, n_heads, d_head).transpose(1, 0, 2)
    K = K.reshape(n, n_heads, d_head).transpose(1, 0, 2)
    V = V.reshape(n, n_heads, d_head).transpose(1, 0, 2)

    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_head)
    if mask is not None:
        scores = np.where(mask[None, :, :], scores, -1e9)
    attn = softmax(scores, axis=-1)
    context = attn @ V
    context = context.transpose(1, 0, 2).reshape(n, d_model)
    return context @ weights["Wo"], attn

def ffn(x, weights):
    hidden = np.maximum(0, x @ weights["W1"] + weights["b1"])
    return hidden @ weights["W2"] + weights["b2"]

rng = np.random.default_rng(11)
n, d_model, d_ff, vocab = 5, 8, 24, 32
tokens = rng.integers(0, vocab, size=n)

embedding = rng.normal(size=(vocab, d_model)) / np.sqrt(d_model)
position = rng.normal(size=(n, d_model)) / np.sqrt(d_model)
H = embedding[tokens] + position

weights = {
    "Wq": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "Wk": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "Wv": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "Wo": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "W1": rng.normal(size=(d_model, d_ff)) / np.sqrt(d_model),
    "b1": np.zeros(d_ff),
    "W2": rng.normal(size=(d_ff, d_model)) / np.sqrt(d_ff),
    "b2": np.zeros(d_model),
    "W_vocab": rng.normal(size=(d_model, vocab)) / np.sqrt(d_model),
}

causal_mask = np.tril(np.ones((n, n), dtype=bool))
A, attention_weights = mha(layer_norm(H), weights, n_heads=2, mask=causal_mask)
U = H + A
H_next = U + ffn(layer_norm(U), weights)
logits = H_next[-1] @ weights["W_vocab"]
probs = softmax(logits)

print("hidden shape:", H_next.shape)
print("attention shape:", attention_weights.shape)
print("top next-token ids:", probs.argsort()[-5:][::-1])

04 / Case

Case: modeling research text and regression tables with Transformers

  • In StatsPAI-style work, a research question, data dictionary, regression table, footnotes, and reviewer comments are tokenized. Each token enters the residual stream through embedding plus position.
  • Attention lets “treatment effect” attend to table headers, estimands, standard errors, sample restrictions, and identification-design paragraphs; FFN layers re-encode each local token representation nonlinearly.
  • After many stacked layers, the model can generate explanations that combine definitions, table structure, and method context, but this is representational power rather than proof of causal validity.
  • In an agent system, Transformers are good at reading and writing research material; reliable workflows still require tool calls, rerunnable code, data provenance, and human review checkpoints.

05 / Risks

Common Pitfalls

Reducing “Transformer” to “attention” and ignoring residual streams, LayerNorm, FFNs, positional information, and training objectives.
Forgetting a causal mask in decoder-only generation, which leaks future tokens during training or evaluation.
Treating fluent generated explanations as identification proof. Transformers model representations and conditional distributions; they do not automatically validate causality.
Ignoring long-context cost: standard self-attention is O(n^2), so long papers, logs, and high-frequency data may require chunking, retrieval, or sparse variants.
Overfitting large models on small data, causing memorization, format drift, or unstable tool calls.
Overinterpreting a single head or layer visualization. It can be a diagnostic clue, not a standalone conclusion.

References