Architecture / Foundation Models

Transformer Architecture: Residual Streams, Attention Blocks, and Autoregressive Generation

Update token representations along a residual stream: attention communicates across positions, while FFNs transform each token channel-wise.

Mechanism Lab

Animation: how one Transformer block updates the residual stream

The animation shows token representations entering position and LayerNorm, communicating through multi-head attention, then passing through residual paths, FFN, and output logits.

Step 1 / 5

Embed

Token embeddings and positions are added into the first residual stream.

H_0=E+P

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

A Transformer is not a single attention equation; it is a stackable representation update rule involving embeddings, positional information, LayerNorm, multi-head attention, residual paths, FFNs, and output heads.

Residual connections let each layer learn an incremental update rather than overwrite the whole representation; LayerNorm stabilizes token scale for deep stacks.

Self-attention performs global communication inside the sequence, while the FFN applies a token-wise nonlinear channel transformation.

Encoders, decoders, and decoder-only LLMs mainly differ by masks and training objectives: fully visible understanding, causal generation, or encoder-decoder conditional generation.

02 / Math

Deriving a full layer update from one Transformer block

01 / Input representation

Token ids are embedded and combined with positional information. Without position, self-attention is close to permutation equivariant and cannot reliably distinguish word order.

H_0 = E[token] + P[position]

02 / Pre-norm attention

Modern deep Transformers often normalize before each sublayer. This keeps gradients stable along the residual path.

U_l = H_l + MHA(LN(H_l))

03 / Multi-head communication

Each head learns its own Q/K/V space; parallel heads are concatenated and mixed by an output projection.

MHA(X)=Concat(head_h)W_O

04 / FFN channel transform

The FFN applies the same two-layer MLP independently to every token, usually expanding to d_ff and projecting back to d_model.

FFN(x)=W_2 phi(W_1 x + b_1)+b_2

05 / Second residual

The FFN output is added back to the residual stream. A block output is the representation after communication and token-wise nonlinear processing.

H_{l+1}=U_l + FFN(LN(U_l))

06 / Output distribution

A language model projects the last hidden vector to the vocabulary and applies softmax to obtain next-token probabilities.

p(x_{t+1}|x_{<=t})=softmax(h_t W_vocab)

03 / Code

NumPy demo: one pre-norm Transformer block

This is not a large-model training script; it explicitly runs the data flow through positional input, MHA, residual paths, FFN, and logits.

import numpy as np

def layer_norm(x, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = ((x - mean) ** 2).mean(axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)

def softmax(a, axis=-1):
    a = a - np.max(a, axis=axis, keepdims=True)
    exp = np.exp(a)
    return exp / exp.sum(axis=axis, keepdims=True)

def mha(x, weights, n_heads=2, mask=None):
    n, d_model = x.shape
    d_head = d_model // n_heads
    Q = x @ weights["Wq"]
    K = x @ weights["Wk"]
    V = x @ weights["Wv"]

    Q = Q.reshape(n, n_heads, d_head).transpose(1, 0, 2)
    K = K.reshape(n, n_heads, d_head).transpose(1, 0, 2)
    V = V.reshape(n, n_heads, d_head).transpose(1, 0, 2)

    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_head)
    if mask is not None:
        scores = np.where(mask[None, :, :], scores, -1e9)
    attn = softmax(scores, axis=-1)
    context = attn @ V
    context = context.transpose(1, 0, 2).reshape(n, d_model)
    return context @ weights["Wo"], attn

def ffn(x, weights):
    hidden = np.maximum(0, x @ weights["W1"] + weights["b1"])
    return hidden @ weights["W2"] + weights["b2"]

rng = np.random.default_rng(11)
n, d_model, d_ff, vocab = 5, 8, 24, 32
tokens = rng.integers(0, vocab, size=n)

embedding = rng.normal(size=(vocab, d_model)) / np.sqrt(d_model)
position = rng.normal(size=(n, d_model)) / np.sqrt(d_model)
H = embedding[tokens] + position

weights = {
    "Wq": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "Wk": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "Wv": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "Wo": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
    "W1": rng.normal(size=(d_model, d_ff)) / np.sqrt(d_model),
    "b1": np.zeros(d_ff),
    "W2": rng.normal(size=(d_ff, d_model)) / np.sqrt(d_ff),
    "b2": np.zeros(d_model),
    "W_vocab": rng.normal(size=(d_model, vocab)) / np.sqrt(d_model),
}

causal_mask = np.tril(np.ones((n, n), dtype=bool))
A, attention_weights = mha(layer_norm(H), weights, n_heads=2, mask=causal_mask)
U = H + A
H_next = U + ffn(layer_norm(U), weights)
logits = H_next[-1] @ weights["W_vocab"]
probs = softmax(logits)

print("hidden shape:", H_next.shape)
print("attention shape:", attention_weights.shape)
print("top next-token ids:", probs.argsort()[-5:][::-1])

04 / Case

Case: modeling research text and regression tables with Transformers

In StatsPAI-style work, a research question, data dictionary, regression table, footnotes, and reviewer comments are tokenized. Each token enters the residual stream through embedding plus position.
Attention lets “treatment effect” attend to table headers, estimands, standard errors, sample restrictions, and identification-design paragraphs; FFN layers re-encode each local token representation nonlinearly.
After many stacked layers, the model can generate explanations that combine definitions, table structure, and method context, but this is representational power rather than proof of causal validity.
In an agent system, Transformers are good at reading and writing research material; reliable workflows still require tool calls, rerunnable code, data provenance, and human review checkpoints.

05 / Risks

Common Pitfalls

Reducing “Transformer” to “attention” and ignoring residual streams, LayerNorm, FFNs, positional information, and training objectives.

Forgetting a causal mask in decoder-only generation, which leaks future tokens during training or evaluation.

Treating fluent generated explanations as identification proof. Transformers model representations and conditional distributions; they do not automatically validate causality.

Ignoring long-context cost: standard self-attention is O(n^2), so long papers, logs, and high-frequency data may require chunking, retrieval, or sparse variants.

Overfitting large models on small data, causing memorization, format drift, or unstable tool calls.

Overinterpreting a single head or layer visualization. It can be a diagnostic clue, not a standalone conclusion.