Architecture / Vision and Local Structure
CNNs: Local Receptive Fields, Weight Sharing, and Feature Hierarchies
Slide a shared kernel over local windows to turn raw grid signals into edges, textures, shapes, and task features.
Mechanism Lab
Animation: how a kernel turns local windows into feature maps
The animation shows a shared convolution kernel sliding over an image grid, multiplying local windows by weights, then passing through ReLU, pooling, and a task readout.
Step 1 / 5
Patch
A convolution reads a local window rather than connecting to the whole image.
R_{i,j}Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
The core inductive bias of CNNs is locality and translation equivariance: nearby pixels, table neighborhoods, or time windows often define local patterns.
The same kernel is shared across spatial positions, so the model does not learn a different detector for every location.
Early convolutions often detect edges and local textures; deeper convolutions compose them into more abstract shapes, structures, or local economic signals.
Pooling or stride lowers resolution and expands the effective receptive field, but also loses detail; modern networks balance downsampling, residual paths, and normalization.
02 / Math
Deriving a CNN layer from discrete convolution
01 / Local window
For 2D input X, each output location looks only at a k_h by k_w local window rather than the whole image.
R_{i,j} = X[i:i+k_h, j:j+k_w]02 / Weight sharing
The same weights W are reused at every spatial position. With C_in input channels and C_out output channels, parameter count is independent of image size.
#params = k_h k_w C_in C_out + C_out03 / Convolution output
Output channel k is the weighted sum of the local window and kernel k, plus a bias; input channels are summed inside the window.
Y[i,j,k]=sum_{u,v,c} W[u,v,c,k] X[i+u,j+v,c]+b_k04 / Translation equivariance
Ignoring boundary effects, translating the input and then convolving is equivalent to convolving first and translating the output.
Conv(T_delta X) = T_delta Conv(X)05 / Nonlinearity and pooling
ReLU lets the network combine local detectors; pooling or stride compresses local activations into a coarser representation.
Z = pool(phi(Y))06 / Effective receptive field
Stacking L layers of 3x3 stride-1 convolutions increases the theoretical receptive field by 2 per layer.
RF_L = 1 + 2L03 / Code
NumPy demo: 2D multi-channel convolution from scratch
This framework-free example explicitly shows local windows, shared weights, ReLU, and max pooling.
import numpy as np
def conv2d_valid(x, kernels, bias):
# x: [height, width, in_channels]
# kernels: [kernel_h, kernel_w, in_channels, out_channels]
h, w, c = x.shape
kh, kw, kc, out_channels = kernels.shape
assert c == kc
out = np.zeros((h - kh + 1, w - kw + 1, out_channels))
for i in range(out.shape[0]):
for j in range(out.shape[1]):
patch = x[i:i + kh, j:j + kw, :]
for k in range(out_channels):
out[i, j, k] = np.sum(patch * kernels[..., k]) + bias[k]
return out
def max_pool2d(x, size=2, stride=2):
h, w, channels = x.shape
out = np.zeros(((h - size) // stride + 1, (w - size) // stride + 1, channels))
for i in range(out.shape[0]):
for j in range(out.shape[1]):
patch = x[i * stride:i * stride + size, j * stride:j * stride + size, :]
out[i, j, :] = patch.max(axis=(0, 1))
return out
rng = np.random.default_rng(5)
image = rng.normal(size=(8, 8, 1))
# A vertical-edge detector and a horizontal-edge detector.
kernels = np.zeros((3, 3, 1, 2))
kernels[:, :, 0, 0] = [[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]
kernels[:, :, 0, 1] = [[-1, -1, -1], [0, 0, 0], [1, 1, 1]]
bias = np.zeros(2)
features = conv2d_valid(image, kernels, bias)
activated = np.maximum(features, 0.0)
pooled = max_pool2d(activated, size=2, stride=2)
print("feature map:", features.shape)
print("pooled map:", pooled.shape)
print("strongest vertical edge:", activated[..., 0].max())04 / Case
Case: turning paper figures, satellite images, and gridded economic data into local features
- In research workflows, CNNs are not only for cat and dog photos. They can process paper-figure screenshots, night-light satellite imagery, land-use grids, microscopy images, traffic heat maps, or local time-frequency images.
- For night-light prediction of economic activity, early convolutions can detect brightness boundaries, road textures, and urban patches; deeper features combine local patterns into regional activity intensity.
- For paper-figure understanding, convolution layers can identify axes, points, error bars, and table borders before later models connect those visual signals to OCR, table parsing, or Transformer representations.
- But CNN-discovered visual patterns are not causal explanations. Policy evaluation still needs treatment/control definition, timing, identification assumptions, and robustness checks.
05 / Risks
Common Pitfalls
References
- LeCun et al. (1998), Gradient-Based Learning Applied to Document Recognitionhttp://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
- Krizhevsky, Sutskever, and Hinton (2012), ImageNet Classification with Deep Convolutional Neural Networkshttps://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
- Zeiler and Fergus (2014), Visualizing and Understanding Convolutional Networkshttps://arxiv.org/abs/1311.2901