Neural Networks

Neural Network Training: Backpropagation and Gradient Descent

Decompose training into forward pass, loss, chain rule, and parameter updates.

Mechanism Lab

Animation: error signals flowing backward

The animation shows forward activations first, then error flowing from loss back to each layer.

Step 1 / 5

Input

Features enter the first layer.

x

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

A network is a composite function.

Backpropagation is efficient chain rule.

Gradient descent moves parameters along the local direction of fastest loss decrease.

02 / Math

Backpropagation via chain rule

01 / Forward

Each layer applies an affine map followed by a nonlinearity.

h_l = sigma(W_l h_{l-1} + b_l)

02 / Loss

Training minimizes loss between predictions and labels.

L(theta) = (1/n) sum_i ell(y_i, f_theta(x_i))

03 / Backward

The chain rule propagates output error to each parameter.

dL/dW_l = (dL/dh_l)(dh_l/dz_l)(dz_l/dW_l)

04 / Update

Gradient descent updates parameters with learning rate eta.

theta <- theta - eta grad_theta L

03 / Code

PyTorch training loop

The four steps are forward, loss, backward, and step.

import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 1),
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for xb, yb in loader:
    pred = model(xb)
    loss = loss_fn(pred.squeeze(), yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

04 / Case

Case: nonlinear student dropout risk

Absence, grades, and advisor feedback may interact nonlinearly.
An MLP can learn the joint risk of high absence and declining grades.
Validation and calibration matter more than training loss alone.

05 / Risks

Common Pitfalls

Too-large learning rates can diverge.

Forgetting zero_grad accumulates gradients.

Watching loss without validation metrics and calibration.