Frontier / Text as Data
Text-Based Causal Inference: Turning Text into Identifiable Causal Variables
When treatment, outcome, or confounders hide in policy documents, filings, news, rulings, and social text, how to measure them with AI without breaking causal identification.
Schematic
The principle at a glance
Start Here
What you should be able to do
Decide the causal role of text first: treatment D, outcome Y, or confounder X.
Turn text into numeric variables with topics, embeddings, or LLM labels.
Know that text measurement error causes attenuation bias, needing double coding or reliability checks.
Understand that outcome-supervised text representations cause overfitting bias and need sample splitting.
Avoid post-treatment text leakage: build pre-treatment variables only from pre-treatment text.
Learning Path
Learning path: from text to identifiable causal variables
Follow this path: fix the role, measure, split to prevent leakage, plug into a design, then estimate and report measurement risk honestly.
Step 1
Role
Decide whether text is treatment D, outcome Y, or confounder X.
role(text)
Step 2
Measure
Encode text into numbers with embeddings/LLMs and record measurement error.
V_hat=f_theta(text)
Step 3
Split
Learn the representation and estimate the effect on different folds to avoid Y leakage.
fold A / fold B
Step 4
Design
Plug text variables into DiD / IV / DML.
D, Y, X -> design
Step 5
Report
Report reliability, attenuation handling, overlap, and human checks.
audit
01 / Intuition
Core Intuition
The first step in text-based causal inference is not modeling but drawing the causal graph: is this text D, Y, or X? The method depends entirely on the role.
There are three ways to numerify text: interpretable frequency/topic features, pretrained embeddings, and LLM labeling or extraction. More flexibility is more powerful but more prone to leaking outcome information.
The core risk is learning the representation and estimating the effect on the same text: if the representation is supervised by Y, the residuals carry Y and the estimate is biased. The fix is sample splitting / cross-fitting, the same idea as DML.
02 / Math
From text representation to an identifiable treatment effect
01 / Causal role of text
Fix where text enters the causal graph: as treatment D=g(text), outcome Y=h(text), or confounder X=e(text). Identification follows from the role, not the model.
role(text) in {D, Y, X}02 / Text measurement
Encode text into numbers with a map f_theta. Interpretable features, embeddings, or LLM labels all work, but all introduce measurement error epsilon.
V_hat = f_theta(text) = V_star + epsilon03 / Measurement error and attenuation
Regressing Y on a noisy D_hat pulls the coefficient toward zero (attenuation). Use repeated measurement, reliability correction, or a second independent measure as an instrument.
plim beta_hat = beta · Var(D_star) / (Var(D_star) + Var(epsilon))04 / Adjusting for text as a confounder
When the same text drives both D and Y, use the text representation e(text) as a control: text matching, or as part of the high-dimensional X residualized in DML.
tau = E[ E[Y|D=1, e(text)] − E[Y|D=0, e(text)] ]05 / Sample splitting against overfitting
If the text representation is learned with supervision, learn it on a fold separate from effect estimation, or outcome information leaks into the representation and biases the estimate.
learn f_theta on fold A ; estimate tau on fold B03 / Code
Code cases: from text measurement to split-based causal adjustment
Use a small corpus to turn text into numeric features, use sample splitting to avoid leakage, and estimate a treatment effect adjusting for the text representation as a confounder.
Case 1: decide whether text is D, Y, or X
The same filing text can be a treatment (disclosure tone), an outcome (sentiment), or a confounder (industry conditions). The wrong role ruins everything downstream.
roles = {
"mentions layoffs": "D treatment",
"report sentiment": "Y outcome",
"industry conditions": "X confounder",
}
for feature, role in roles.items():
print(f"{feature:>20} -> {role}")Expected output
mentions layoffs -> D treatment
report sentiment -> Y outcome
industry conditions -> X confounderHow to read this code
- One document yields different variables in different causal roles.
- Draw the causal graph first, then choose the identification strategy.
- Mistaking a confounding text for the treatment gives a completely wrong effect.
Case 2: measurement error causes attenuation
A text-measured treatment carries noise, and a naive regression pulls the true effect toward zero.
import numpy as np
rng = np.random.default_rng(3)
n = 5000
D_star = rng.normal(size=n) # true text construct
Y = 2.0 * D_star + rng.normal(size=n)
for sd in [0.0, 0.5, 1.0]:
D_hat = D_star + rng.normal(scale=sd, size=n) # measurement error
beta = np.polyfit(D_hat, Y, 1)[0]
print(f"noise sd={sd}: estimated beta = {beta:.2f}")Expected output
noise sd=0.0: estimated beta = 2.00
noise sd=0.5: estimated beta = 1.60
noise sd=1.0: estimated beta = 1.00How to read this code
- More measurement noise pulls the estimate further toward zero.
- This is why text measurement needs reliability checks or repeated measures.
- A second independent measure can serve as an instrument to correct attenuation.
Case 3: sample splitting prevents leaking Y into the representation
Supervising a text representation with Y and then estimating the effect on the same data overfits a spurious relationship.
# Right: learn representation/nuisance on fold A, estimate effect on fold B
# Wrong: tune representation and estimate effect on the same data -> optimism
print("learn f_theta on fold A")
print("estimate tau on fold B")
print("never reuse outcome-supervised text features in-sample")Expected output
learn f_theta on fold A
estimate tau on fold B
never reuse outcome-supervised text features in-sampleHow to read this code
- Outcome-supervised text representations memorize outcome information.
- Sample splitting / cross-fitting keeps representation and estimation uncontaminated.
- This is the same principle as DML cross-fitting.
Case 4: cross-fitting removes confounding via the text representation
When text drives both treatment and outcome, cross-fitting removes the text-predictable parts of Y and D before estimating the effect — the naive regression is confounded, the adjusted estimate is close to the truth.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import KFold
import numpy as np
X = TfidfVectorizer(max_features=30).fit_transform(docs).toarray()
y_hat, d_hat = np.zeros(n), np.zeros(n)
for tr, te in KFold(5, shuffle=True, random_state=1).split(X):
y_hat[te] = LinearRegression().fit(X[tr], Y[tr]).predict(X[te])
d_hat[te] = LogisticRegression(max_iter=500).fit(X[tr], D[tr]).predict_proba(X[te])[:, 1]
yr, dr = Y - y_hat, D - d_hat
theta = np.sum(dr * yr) / np.sum(dr ** 2)
print("naive:", round(np.polyfit(D, Y, 1)[0], 3))
print("text-adjusted:", round(theta, 3))Expected output
naive: 1.70
text-adjusted: 1.05 # true effect = 1.0How to read this code
- The naive regression is biased by text confounding (1.70 vs the true 1.0).
- After cross-fit residualization with the text representation, the estimate is close to the truth.
- The key is predicting nuisances out of fold to avoid overfitting the target parameter.
04 / Case
Case: evaluating a minimum-wage reform from policy-text intensity
- Question: the dynamic effect of local minimum-wage statute stringency on employment. Stringency hides in the text with no ready-made number.
- Use an encoder or LLM to map each statute into a strictness score D=g(text) as the treatment intensity in a continuous DiD — text only measures D; identification comes from the panel and timing.
- If statute wording also reflects local economic fundamentals (confounding), include the text representation e(text) as high-dimensional controls residualized in DML, with overlap checks.
- A credible report states measurement scheme and reliability, pre/post text separation, the sample-splitting design, attenuation handling, and human spot-checks of text labels.
05 / Causal
Which design to plug into: a text-variable-to-strategy map
Text-based causal inference is not new identification magic. It measures text into clean D / Y / X and hands them to designs you already know. Here are the common mappings.
01 / Text = treatment → continuous DiD / event study
Use text intensity as a continuous treatment; the panel and timing identify dynamic effects. Text only measures D.
D_it=g(text_it) -> Y_it=a_i+b_t+tau·D_it+e_it02 / Text = confounder → text matching / DML
When text drives both D and Y, use the representation as a control for matching or residualize it in DML, with overlap checks.
tau = E[Y|D=1,e(text)] − E[Y|D=0,e(text)]03 / Text = outcome → standard design + reliability
Measure text into an outcome and plug into an existing RCT / DiD / IV, but report measurement reliability and attenuation.
04 / Two independent measures → IV for attenuation
Use a second independent text measure as an instrument to correct single-measure attenuation.
Three red lines: (1) measurement error attenuates, so use reliability / repeated measures / IV; (2) guard against post-treatment text leakage by building pre-treatment variables only from pre-treatment text; (3) never estimate the effect with an outcome-supervised text representation on the same sample — always split.
06 / Risks
Common Pitfalls
Resources
Hands-on downloads
References
- Gentzkow, Kelly, and Taddy (2019), Text as Data, Journal of Economic Literaturehttps://doi.org/10.1257/jel.20181020
- Roberts, Stewart, and Nielsen (2020), Adjusting for Confounding with Text Matching, AJPShttps://doi.org/10.1111/ajps.12526
- Veitch, Sridhar, and Blei (2020), Adapting Text Embeddings for Causal Inferencehttps://arxiv.org/abs/1905.12741
- Egami et al. (2022), How to Make Causal Inferences Using Texts, Science Advanceshttps://doi.org/10.1126/sciadv.abg2652
- Ash and Hansen (2023), Text Algorithms in Economics, Annual Review of Economicshttps://doi.org/10.1146/annurev-economics-082222-074352