Architecture / Sequential Memory
RNNs, LSTMs, and GRUs: Hidden States, Gated Memory, and Sequence Modeling
Read a sequence step by step into hidden state: vanilla RNNs recurse, while LSTMs and GRUs use gates to control writing, forgetting, and output.
Mechanism Lab
Animation: hidden state flows through time and gates control memory
The animation starts with a vanilla recurrent chain, shows long-chain gradient risk, opens LSTM forget/input/output gates and GRU update/reset gates, then connects to a prediction head.
Step 1 / 5
Recurrence
Each step reads x_t and h_{t-1}, then writes the next hidden state.
h_t=f(x_t,h_{t-1})Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
The core idea of an RNN is to compress history into hidden state: current input x_t and previous state h_{t-1} jointly determine h_t.
Vanilla RNNs share parameters across time and handle variable-length sequences, but long dependencies suffer from vanishing or exploding gradients.
LSTMs add a cell state plus forget, input, and output gates so the model can retain, write, or expose long-term memory.
GRUs merge parts of the LSTM structure into update and reset gates, reducing parameters while remaining stable on many medium-scale sequence tasks.
02 / Math
From recurrent state to gated memory
01 / Vanilla RNN state
A sequence enters the same recurrent cell step by step. W_x, W_h, and b are shared across time, so the model can process variable-length inputs.
h_t = phi(W_x x_t + W_h h_{t-1} + b)02 / Output layer
Depending on the task, the model can emit y_t at every step or use the final state for a sequence-level prediction.
p(y_t|x_{<=t}) = softmax(W_y h_t)03 / BPTT gradient chain
Backpropagation crosses many time steps. The gradient contains a product of Jacobians; small spectral radius vanishes and large spectral radius explodes.
dL/dh_t includes prod_s W_h^T diag(phi_s)04 / LSTM gates
The forget gate keeps old memory, the input gate writes candidate memory, and the output gate exposes hidden state.
c_t=f_t*c_{t-1}+i_t*g_t, h_t=o_t*tanh(c_t)05 / GRU update
The update gate interpolates between old state and candidate state; the reset gate controls whether the candidate reads old memory.
h_t=(1-z_t)*h_{t-1}+z_t*h_tilde06 / Sequence representation
Classification, forecasting, or event detection can use the last state, pooled states, or attention-weighted state summaries.
r = pool(h_1,...,h_T) or alpha_t h_t03 / Code
NumPy demo: one-step updates for RNN, LSTM, and GRU
The snippet writes the three sequence cells as explicit matrix operations so the difference between plain recurrence and gated recurrence is visible.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def rnn_step(x_t, h_prev, p):
return np.tanh(x_t @ p["Wx"] + h_prev @ p["Wh"] + p["b"])
def lstm_step(x_t, h_prev, c_prev, p):
joined = x_t @ p["Wx"] + h_prev @ p["Wh"] + p["b"]
i, f, o, g = np.split(joined, 4)
i = sigmoid(i)
f = sigmoid(f)
o = sigmoid(o)
g = np.tanh(g)
c_t = f * c_prev + i * g
h_t = o * np.tanh(c_t)
return h_t, c_t
def gru_step(x_t, h_prev, p):
z = sigmoid(x_t @ p["Wxz"] + h_prev @ p["Whz"] + p["bz"])
r = sigmoid(x_t @ p["Wxr"] + h_prev @ p["Whr"] + p["br"])
candidate = np.tanh(x_t @ p["Wxh"] + (r * h_prev) @ p["Whh"] + p["bh"])
return (1 - z) * h_prev + z * candidate
rng = np.random.default_rng(9)
T, d_in, d_h = 6, 3, 4
X = rng.normal(size=(T, d_in))
h = np.zeros(d_h)
c = np.zeros(d_h)
rnn_params = {
"Wx": rng.normal(size=(d_in, d_h)) / np.sqrt(d_in),
"Wh": rng.normal(size=(d_h, d_h)) / np.sqrt(d_h),
"b": np.zeros(d_h),
}
lstm_params = {
"Wx": rng.normal(size=(d_in, 4 * d_h)) / np.sqrt(d_in),
"Wh": rng.normal(size=(d_h, 4 * d_h)) / np.sqrt(d_h),
"b": np.zeros(4 * d_h),
}
gru_params = {
name: rng.normal(size=shape) / np.sqrt(shape[0])
for name, shape in {
"Wxz": (d_in, d_h), "Whz": (d_h, d_h),
"Wxr": (d_in, d_h), "Whr": (d_h, d_h),
"Wxh": (d_in, d_h), "Whh": (d_h, d_h),
}.items()
}
gru_params.update({"bz": np.zeros(d_h), "br": np.zeros(d_h), "bh": np.zeros(d_h)})
for x_t in X:
h_rnn = rnn_step(x_t, h, rnn_params)
h_lstm, c = lstm_step(x_t, h, c, lstm_params)
h_gru = gru_step(x_t, h, gru_params)
h = h_gru
print("RNN state:", h_rnn.round(3))
print("LSTM state:", h_lstm.round(3))
print("GRU state:", h_gru.round(3))04 / Case
Case: research logs, macro time series, and text sequences
- In empirical workflows, a sequence is not only a sentence. It may be daily policy news, quarterly macro indicators, user-behavior logs, paper paragraphs, or an agent tool-call trace.
- A vanilla RNN is useful for teaching how state recurs, such as predicting next-period risk from recent indicators, but long dependencies can make training unstable.
- An LSTM is useful when long memory matters, such as remembering an early data limitation in a long reviewer report and using it later in a revision plan.
- A GRU is useful as a lighter sequence feature extractor for cleaning logs, command streams, or short event text before classification or anomaly detection.
05 / Risks