Architecture / Sequential Memory

RNNs, LSTMs, and GRUs: Hidden States, Gated Memory, and Sequence Modeling

Read a sequence step by step into hidden state: vanilla RNNs recurse, while LSTMs and GRUs use gates to control writing, forgetting, and output.

Mechanism Lab

Animation: hidden state flows through time and gates control memory

The animation starts with a vanilla recurrent chain, shows long-chain gradient risk, opens LSTM forget/input/output gates and GRU update/reset gates, then connects to a prediction head.

Step 1 / 5

Recurrence

Each step reads x_t and h_{t-1}, then writes the next hidden state.

h_t=f(x_t,h_{t-1})

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

The core idea of an RNN is to compress history into hidden state: current input x_t and previous state h_{t-1} jointly determine h_t.

Vanilla RNNs share parameters across time and handle variable-length sequences, but long dependencies suffer from vanishing or exploding gradients.

LSTMs add a cell state plus forget, input, and output gates so the model can retain, write, or expose long-term memory.

GRUs merge parts of the LSTM structure into update and reset gates, reducing parameters while remaining stable on many medium-scale sequence tasks.

02 / Math

From recurrent state to gated memory

01 / Vanilla RNN state

A sequence enters the same recurrent cell step by step. W_x, W_h, and b are shared across time, so the model can process variable-length inputs.

h_t = phi(W_x x_t + W_h h_{t-1} + b)

02 / Output layer

Depending on the task, the model can emit y_t at every step or use the final state for a sequence-level prediction.

p(y_t|x_{<=t}) = softmax(W_y h_t)

03 / BPTT gradient chain

Backpropagation crosses many time steps. The gradient contains a product of Jacobians; small spectral radius vanishes and large spectral radius explodes.

dL/dh_t includes prod_s W_h^T diag(phi_s)

04 / LSTM gates

The forget gate keeps old memory, the input gate writes candidate memory, and the output gate exposes hidden state.

c_t=f_t*c_{t-1}+i_t*g_t, h_t=o_t*tanh(c_t)

05 / GRU update

The update gate interpolates between old state and candidate state; the reset gate controls whether the candidate reads old memory.

h_t=(1-z_t)*h_{t-1}+z_t*h_tilde

06 / Sequence representation

Classification, forecasting, or event detection can use the last state, pooled states, or attention-weighted state summaries.

r = pool(h_1,...,h_T) or alpha_t h_t

03 / Code

NumPy demo: one-step updates for RNN, LSTM, and GRU

The snippet writes the three sequence cells as explicit matrix operations so the difference between plain recurrence and gated recurrence is visible.

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def rnn_step(x_t, h_prev, p):
    return np.tanh(x_t @ p["Wx"] + h_prev @ p["Wh"] + p["b"])

def lstm_step(x_t, h_prev, c_prev, p):
    joined = x_t @ p["Wx"] + h_prev @ p["Wh"] + p["b"]
    i, f, o, g = np.split(joined, 4)
    i = sigmoid(i)
    f = sigmoid(f)
    o = sigmoid(o)
    g = np.tanh(g)
    c_t = f * c_prev + i * g
    h_t = o * np.tanh(c_t)
    return h_t, c_t

def gru_step(x_t, h_prev, p):
    z = sigmoid(x_t @ p["Wxz"] + h_prev @ p["Whz"] + p["bz"])
    r = sigmoid(x_t @ p["Wxr"] + h_prev @ p["Whr"] + p["br"])
    candidate = np.tanh(x_t @ p["Wxh"] + (r * h_prev) @ p["Whh"] + p["bh"])
    return (1 - z) * h_prev + z * candidate

rng = np.random.default_rng(9)
T, d_in, d_h = 6, 3, 4
X = rng.normal(size=(T, d_in))
h = np.zeros(d_h)
c = np.zeros(d_h)

rnn_params = {
    "Wx": rng.normal(size=(d_in, d_h)) / np.sqrt(d_in),
    "Wh": rng.normal(size=(d_h, d_h)) / np.sqrt(d_h),
    "b": np.zeros(d_h),
}
lstm_params = {
    "Wx": rng.normal(size=(d_in, 4 * d_h)) / np.sqrt(d_in),
    "Wh": rng.normal(size=(d_h, 4 * d_h)) / np.sqrt(d_h),
    "b": np.zeros(4 * d_h),
}
gru_params = {
    name: rng.normal(size=shape) / np.sqrt(shape[0])
    for name, shape in {
        "Wxz": (d_in, d_h), "Whz": (d_h, d_h),
        "Wxr": (d_in, d_h), "Whr": (d_h, d_h),
        "Wxh": (d_in, d_h), "Whh": (d_h, d_h),
    }.items()
}
gru_params.update({"bz": np.zeros(d_h), "br": np.zeros(d_h), "bh": np.zeros(d_h)})

for x_t in X:
    h_rnn = rnn_step(x_t, h, rnn_params)
    h_lstm, c = lstm_step(x_t, h, c, lstm_params)
    h_gru = gru_step(x_t, h, gru_params)
    h = h_gru

print("RNN state:", h_rnn.round(3))
print("LSTM state:", h_lstm.round(3))
print("GRU state:", h_gru.round(3))

04 / Case

Case: research logs, macro time series, and text sequences

  • In empirical workflows, a sequence is not only a sentence. It may be daily policy news, quarterly macro indicators, user-behavior logs, paper paragraphs, or an agent tool-call trace.
  • A vanilla RNN is useful for teaching how state recurs, such as predicting next-period risk from recent indicators, but long dependencies can make training unstable.
  • An LSTM is useful when long memory matters, such as remembering an early data limitation in a long reviewer report and using it later in a revision plan.
  • A GRU is useful as a lighter sequence feature extractor for cleaning logs, command streams, or short event text before classification or anomaly detection.

05 / Risks

Common Pitfalls

Assuming the final hidden state is always enough; long sequences often need pooling, attention, or chunking.
Forgetting truncated BPTT or gradient clipping, causing slow training, high memory use, or exploding gradients.
Randomly shuffling temporal data in forecasting and leaking future information.
Assuming LSTMs and GRUs solve all long-range dependency problems; very long text or retrieval-heavy tasks often need attention or external memory.
Mixing up many-to-one, many-to-many, and sequence-to-sequence target alignment.
Treating recurrent hidden state as a causal mechanism. It is a predictive representation, not an identification assumption.

References