Architecture / Foundation Models
Transformer Architecture: Residual Streams, Attention Blocks, and Autoregressive Generation
Update token representations along a residual stream: attention communicates across positions, while FFNs transform each token channel-wise.
Mechanism Lab
Animation: how one Transformer block updates the residual stream
The animation shows token representations entering position and LayerNorm, communicating through multi-head attention, then passing through residual paths, FFN, and output logits.
Step 1 / 5
Embed
Token embeddings and positions are added into the first residual stream.
H_0=E+PAnimation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
A Transformer is not a single attention equation; it is a stackable representation update rule involving embeddings, positional information, LayerNorm, multi-head attention, residual paths, FFNs, and output heads.
Residual connections let each layer learn an incremental update rather than overwrite the whole representation; LayerNorm stabilizes token scale for deep stacks.
Self-attention performs global communication inside the sequence, while the FFN applies a token-wise nonlinear channel transformation.
Encoders, decoders, and decoder-only LLMs mainly differ by masks and training objectives: fully visible understanding, causal generation, or encoder-decoder conditional generation.
02 / Math
Deriving a full layer update from one Transformer block
01 / Input representation
Token ids are embedded and combined with positional information. Without position, self-attention is close to permutation equivariant and cannot reliably distinguish word order.
H_0 = E[token] + P[position]02 / Pre-norm attention
Modern deep Transformers often normalize before each sublayer. This keeps gradients stable along the residual path.
U_l = H_l + MHA(LN(H_l))03 / Multi-head communication
Each head learns its own Q/K/V space; parallel heads are concatenated and mixed by an output projection.
MHA(X)=Concat(head_h)W_O04 / FFN channel transform
The FFN applies the same two-layer MLP independently to every token, usually expanding to d_ff and projecting back to d_model.
FFN(x)=W_2 phi(W_1 x + b_1)+b_205 / Second residual
The FFN output is added back to the residual stream. A block output is the representation after communication and token-wise nonlinear processing.
H_{l+1}=U_l + FFN(LN(U_l))06 / Output distribution
A language model projects the last hidden vector to the vocabulary and applies softmax to obtain next-token probabilities.
p(x_{t+1}|x_{<=t})=softmax(h_t W_vocab)03 / Code
NumPy demo: one pre-norm Transformer block
This is not a large-model training script; it explicitly runs the data flow through positional input, MHA, residual paths, FFN, and logits.
import numpy as np
def layer_norm(x, eps=1e-5):
mean = x.mean(axis=-1, keepdims=True)
var = ((x - mean) ** 2).mean(axis=-1, keepdims=True)
return (x - mean) / np.sqrt(var + eps)
def softmax(a, axis=-1):
a = a - np.max(a, axis=axis, keepdims=True)
exp = np.exp(a)
return exp / exp.sum(axis=axis, keepdims=True)
def mha(x, weights, n_heads=2, mask=None):
n, d_model = x.shape
d_head = d_model // n_heads
Q = x @ weights["Wq"]
K = x @ weights["Wk"]
V = x @ weights["Wv"]
Q = Q.reshape(n, n_heads, d_head).transpose(1, 0, 2)
K = K.reshape(n, n_heads, d_head).transpose(1, 0, 2)
V = V.reshape(n, n_heads, d_head).transpose(1, 0, 2)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_head)
if mask is not None:
scores = np.where(mask[None, :, :], scores, -1e9)
attn = softmax(scores, axis=-1)
context = attn @ V
context = context.transpose(1, 0, 2).reshape(n, d_model)
return context @ weights["Wo"], attn
def ffn(x, weights):
hidden = np.maximum(0, x @ weights["W1"] + weights["b1"])
return hidden @ weights["W2"] + weights["b2"]
rng = np.random.default_rng(11)
n, d_model, d_ff, vocab = 5, 8, 24, 32
tokens = rng.integers(0, vocab, size=n)
embedding = rng.normal(size=(vocab, d_model)) / np.sqrt(d_model)
position = rng.normal(size=(n, d_model)) / np.sqrt(d_model)
H = embedding[tokens] + position
weights = {
"Wq": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
"Wk": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
"Wv": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
"Wo": rng.normal(size=(d_model, d_model)) / np.sqrt(d_model),
"W1": rng.normal(size=(d_model, d_ff)) / np.sqrt(d_model),
"b1": np.zeros(d_ff),
"W2": rng.normal(size=(d_ff, d_model)) / np.sqrt(d_ff),
"b2": np.zeros(d_model),
"W_vocab": rng.normal(size=(d_model, vocab)) / np.sqrt(d_model),
}
causal_mask = np.tril(np.ones((n, n), dtype=bool))
A, attention_weights = mha(layer_norm(H), weights, n_heads=2, mask=causal_mask)
U = H + A
H_next = U + ffn(layer_norm(U), weights)
logits = H_next[-1] @ weights["W_vocab"]
probs = softmax(logits)
print("hidden shape:", H_next.shape)
print("attention shape:", attention_weights.shape)
print("top next-token ids:", probs.argsort()[-5:][::-1])04 / Case
Case: modeling research text and regression tables with Transformers
- In StatsPAI-style work, a research question, data dictionary, regression table, footnotes, and reviewer comments are tokenized. Each token enters the residual stream through embedding plus position.
- Attention lets “treatment effect” attend to table headers, estimands, standard errors, sample restrictions, and identification-design paragraphs; FFN layers re-encode each local token representation nonlinearly.
- After many stacked layers, the model can generate explanations that combine definitions, table structure, and method context, but this is representational power rather than proof of causal validity.
- In an agent system, Transformers are good at reading and writing research material; reliable workflows still require tool calls, rerunnable code, data provenance, and human review checkpoints.
05 / Risks
Common Pitfalls
References
- Vaswani et al. (2017), Attention Is All You Needhttps://arxiv.org/abs/1706.03762
- NeurIPS Proceedings: Attention Is All You Needhttps://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- Devlin et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinghttps://arxiv.org/abs/1810.04805
- Radford et al. (2019), Language Models are Unsupervised Multitask Learnershttps://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf