Causal ML / Orthogonalization

DML and Causal Forests

DML uses machine learning for high-dimensional nuisance functions, then protects the target parameter with Neyman-orthogonal moments; causal forests localize the same idea for heterogeneous effects.

Mechanism Lab

Animation: how DML turns prediction signal into orthogonal causal signal

The animation starts with high-dimensional X entering two nuisance models, then shows residualized Y and D, the orthogonal score, cross-fit folds, and local tau(x) from a causal forest.

Step 1 / 5

Nuisance ML

Use machine learning to estimate E[Y|X] and E[D|X].

ell_hat(X), m_hat(X)

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

Standard machine learning is good at predicting Y or D, but causal work needs interpretable variation in treatment D, not just accurate prediction.

The key DML move is residualization: remove the parts of Y and D predictable from X, then estimate theta from the remaining orthogonal signal.

Cross-fitting keeps nuisance predictions out of sample so overfitting errors do not directly contaminate the target; causal forests turn theta into tau(x).

02 / Math

From a partially linear model to orthogonal moments and local forests

01 / Target model

With high-dimensional controls X, the partially linear model separates the treatment effect, a nonparametric control function, and residual noise. theta_0 is the average marginal treatment effect.

Y = theta_0 D + g_0(X) + U
D = m_0(X) + V
E[U|X,D]=0, E[V|X]=0

02 / Residualization

Let ell_0(X)=E[Y|X]. Project Y and D on X, then take residuals. At the truth, the Y residual equals theta_0 times the treatment residual plus noise.

tilde Y = Y - ell_0(X)
tilde D = D - m_0(X)
tilde Y = theta_0 tilde D + U

03 / Orthogonal moment

DML does not treat machine-learning predictions as causal estimates. It inserts them into a moment condition that is insensitive to first-order nuisance error.

psi(W;theta,eta) = (Y - ell(X) - theta(D-m(X)))(D-m(X))
E[psi(W;theta_0,eta_0)] = 0

04 / Neyman orthogonality

Take Gateaux derivatives with respect to perturbations h_l for ell and h_m for m. At the truth, E[D-m_0(X)|X] and E[U|X] are zero, so the first-order terms vanish.

d/dt E[psi(theta_0, ell_0+t h_l, m_0)]|0 = -E[h_l(X)V] = 0
d/dt E[psi(theta_0, ell_0, m_0+t h_m)]|0 = E[h_m(X)(theta_0 V-U)] = 0

05 / Cross-fitted estimator

Split the sample into K folds, train nuisances outside each held-out fold, predict on the held-out fold, then regress residualized Y on residualized D.

theta_hat = sum_i tilde D_i tilde Y_i / sum_i tilde D_i^2
where tilde Y_i, tilde D_i are cross-fitted residuals

06 / Causal forest localization

If the treatment effect varies with X, a forest assigns neighborhood weights alpha_i(x) around each target point x and solves a local orthogonal moment.

tau_hat(x) = [sum_i alpha_i(x) tilde D_i tilde Y_i] / [sum_i alpha_i(x) tilde D_i^2]

03 / Code

Python code: cross-fitted DML plus a simple heterogeneity display

The example simulates high-dimensional confounding, estimates nuisance functions with random forests, cross-fits residuals, and then estimates the ATE plus a display model for tau(x).

import numpy as np
import pandas as pd
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

rng = np.random.default_rng(7)
n = 2500
p = 12
X = rng.normal(size=(n, p))

# Heterogeneous treatment effect for demonstration.
tau = 0.8 + 0.6 * (X[:, 0] > 0) - 0.35 * X[:, 1]
m = 0.4 * X[:, 0] - 0.25 * X[:, 2] + 0.2 * X[:, 3] ** 2
D = m + rng.normal(scale=1.0, size=n)
g = 1.2 * np.sin(X[:, 0]) + X[:, 1] * X[:, 2]
Y = tau * D + g + rng.normal(scale=1.0, size=n)

base_y = RandomForestRegressor(
    n_estimators=250,
    min_samples_leaf=20,
    random_state=11,
    n_jobs=-1,
)
base_d = RandomForestRegressor(
    n_estimators=250,
    min_samples_leaf=20,
    random_state=17,
    n_jobs=-1,
)

y_hat = np.zeros(n)
d_hat = np.zeros(n)
folds = KFold(n_splits=5, shuffle=True, random_state=23)

for train_idx, test_idx in folds.split(X):
    model_y = clone(base_y).fit(X[train_idx], Y[train_idx])
    model_d = clone(base_d).fit(X[train_idx], D[train_idx])
    y_hat[test_idx] = model_y.predict(X[test_idx])
    d_hat[test_idx] = model_d.predict(X[test_idx])

y_resid = Y - y_hat
d_resid = D - d_hat
theta_hat = np.sum(d_resid * y_resid) / np.sum(d_resid ** 2)

# Orthogonal score should be close to zero at theta_hat.
score = (y_resid - theta_hat * d_resid) * d_resid
se = np.sqrt(np.mean(score ** 2) / (np.mean(d_resid ** 2) ** 2 * n))

# A simple heterogeneity display: pseudo-outcome for tau(X).
# Production causal forests use honest splitting and forest weights.
pseudo_tau = y_resid / np.where(np.abs(d_resid) < 0.05, np.nan, d_resid)
mask = np.isfinite(pseudo_tau) & (np.abs(d_resid) > 0.2)
tau_model = RandomForestRegressor(
    n_estimators=300,
    min_samples_leaf=40,
    random_state=31,
    n_jobs=-1,
).fit(X[mask], pseudo_tau[mask])

grid = pd.DataFrame({
    "x0_group": ["low X0", "high X0"],
    "tau_hat": [
        tau_model.predict(X[X[:, 0] <= 0]).mean(),
        tau_model.predict(X[X[:, 0] > 0]).mean(),
    ],
})

print(f"DML theta_hat = {theta_hat:.3f} +/- {1.96 * se:.3f}")
print(f"orthogonal score mean = {score.mean():.4f}")
print(grid)

04 / Case

Case: selection bias and heterogeneity in a job-training program

Question: does a job-training program raise later earnings? Participants differ in education, industry, region, prior earnings, and job-search history.
Strong predictors of earnings are not causal evidence. DML first uses X to explain earnings and take-up, then estimates the average effect from residual treatment variation.
A causal forest asks who benefits more: for example, whether low-baseline earners, industry switchers, or younger job seekers have higher tau(x).
A credible report states the identification assumption, overlap checks, out-of-sample nuisance performance, cross-fitting design, ATE interval, heterogeneity calibration, and pre-specified subgroup interpretation.

05 / Risks

Common Pitfalls

Interpreting random-forest or boosting feature importance as a causal mechanism.

Training nuisances and estimating theta on the same observations, letting overfitting error enter the target parameter.

Ignoring overlap: residualization cannot create comparability in X regions with almost no treated or control units.

Showing an attractive tau(x) heat map without honest sample splitting, calibration, confidence intervals, or a pre-specified heterogeneity question.