Neural Networks
Neural Network Training: Backpropagation and Gradient Descent
Decompose training into forward pass, loss, chain rule, and parameter updates.
Mechanism Lab
Animation: error signals flowing backward
The animation shows forward activations first, then error flowing from loss back to each layer.
Step 1 / 5
Input
Features enter the first layer.
xAnimation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
A network is a composite function.
Backpropagation is efficient chain rule.
Gradient descent moves parameters along the local direction of fastest loss decrease.
02 / Math
Backpropagation via chain rule
01 / Forward
Each layer applies an affine map followed by a nonlinearity.
h_l = sigma(W_l h_{l-1} + b_l)02 / Loss
Training minimizes loss between predictions and labels.
L(theta) = (1/n) sum_i ell(y_i, f_theta(x_i))03 / Backward
The chain rule propagates output error to each parameter.
dL/dW_l = (dL/dh_l)(dh_l/dz_l)(dz_l/dW_l)04 / Update
Gradient descent updates parameters with learning rate eta.
theta <- theta - eta grad_theta L03 / Code
PyTorch training loop
The four steps are forward, loss, backward, and step.
import torch
from torch import nn
model = nn.Sequential(
nn.Linear(10, 32),
nn.ReLU(),
nn.Linear(32, 1),
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
for xb, yb in loader:
pred = model(xb)
loss = loss_fn(pred.squeeze(), yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()04 / Case
Case: nonlinear student dropout risk
- Absence, grades, and advisor feedback may interact nonlinearly.
- An MLP can learn the joint risk of high absence and declining grades.
- Validation and calibration matter more than training loss alone.
05 / Risks