Neural Networks

Neural Network Training: Backpropagation and Gradient Descent

Decompose training into forward pass, loss, chain rule, and parameter updates.

Mechanism Lab

Animation: error signals flowing backward

The animation shows forward activations first, then error flowing from loss back to each layer.

Step 1 / 5

Input

Features enter the first layer.

x

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

A network is a composite function.

Backpropagation is efficient chain rule.

Gradient descent moves parameters along the local direction of fastest loss decrease.

02 / Math

Backpropagation via chain rule

01 / Forward

Each layer applies an affine map followed by a nonlinearity.

h_l = sigma(W_l h_{l-1} + b_l)

02 / Loss

Training minimizes loss between predictions and labels.

L(theta) = (1/n) sum_i ell(y_i, f_theta(x_i))

03 / Backward

The chain rule propagates output error to each parameter.

dL/dW_l = (dL/dh_l)(dh_l/dz_l)(dz_l/dW_l)

04 / Update

Gradient descent updates parameters with learning rate eta.

theta <- theta - eta grad_theta L

03 / Code

PyTorch training loop

The four steps are forward, loss, backward, and step.

import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 1),
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for xb, yb in loader:
    pred = model(xb)
    loss = loss_fn(pred.squeeze(), yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

04 / Case

Case: nonlinear student dropout risk

  • Absence, grades, and advisor feedback may interact nonlinearly.
  • An MLP can learn the joint risk of high absence and declining grades.
  • Validation and calibration matter more than training loss alone.

05 / Risks

Common Pitfalls

Too-large learning rates can diverge.
Forgetting zero_grad accumulates gradients.
Watching loss without validation metrics and calibration.

References