Frontier / LLM Measurement
LLMs as Measurement: Bias and Correction When Using Predicted Variables Downstream
Labeling huge samples with an LLM is tempting, but plugging predicted variables into a regression as if they were truth introduces systematic bias. This page covers the bias and three corrections: gold-standard validation, PPI, and DSL.
Schematic
The principle at a glance
Start Here
What you should be able to do
Know that LLM measurement error is usually not classical (mean-zero, independent); it can be systematic and correlated with X.
Understand that using predicted variables directly is both biased and over-confident.
Use a small randomly labeled gold-standard sample for bias correction.
Understand Prediction-Powered Inference (PPI): correct the prediction bias term with the gold standard.
Understand design-based supervised learning (DSL): reweight by known sampling probabilities.
Learning Path
Learning path: from cheap predictions to credible inference
Follow this path: recognize that LLM error is systematic, see the naive regression bias, then correct with a gold standard plus PPI/DSL.
Step 1
Predict
The LLM cheaply labels the full sample as Y_hat.
Y_hat=f(text)
Step 2
Bias
Check whether the error correlates with X (systematic).
e=Y_hat−Y
Step 3
Gold
Randomly sample a human-labeled gold-standard subset.
L: (Y, Y_hat)
Step 4
Correct
Remove the bias term with PPI / DSL.
theta_PPI
Step 5
Report
Report agreement, sampling design, and valid CIs.
valid CI
01 / Intuition
Core Intuition
Treat an LLM as a cheap but biased measuring instrument: it measures a lot, but the readings drift systematically.
Regressing on the LLM label Y_hat treats measurement error as real signal, so bias enters the coefficient; ignoring label uncertainty also makes standard errors too small.
The shared fix: pay to label a small random gold-standard subsample and use it to estimate and remove the LLM's systematic bias — predictions give scale, the gold standard gives accuracy.
02 / Math
From predicted labels to unbiased downstream estimates
01 / Predicted variable
The LLM maps inputs to a predicted label Y_hat=f(text). Its gap from the truth Y is measurement error e=Y_hat−Y, generally non-zero-mean and possibly correlated with X.
Y_hat = f(text), e = Y_hat − Y02 / Bias of the naive regression
Regressing Y_hat on X estimates the relation of X to Y_hat, not to the true Y; the bias is proportional to the relation of X to the measurement error e. When error correlates with X, this is not simple attenuation but directionally ambiguous bias.
plim(beta_naive) = beta + Cov(X, e)/Var(X)03 / Gold-standard correction
On a randomly drawn, human-labeled subsample with both true Y and predicted Y_hat, estimate and remove the LLM bias structure. A more random, representative gold standard gives a more reliable correction.
labeled set L: (Y_i, Y_hat_i, X_i)04 / Prediction-Powered Inference
The PPI point estimate equals the full-sample estimate using Y_hat minus a correction (the gold-standard estimate using Y_hat minus the one using Y). It is unbiased and tighter than using the small gold standard alone.
theta_PPI = theta(Y_hat; all) − [theta(Y_hat; L) − theta(Y; L)]05 / DSL: design-based supervised learning
Reweight by the known (researcher-designed) labeling probability pi to build a moment condition robust to LLM error, yielding consistent estimates and valid standard errors.
weight 1/pi_i on labeled units in the moment03 / Code
Code cases: from biased LLM labels to gold-standard correction
Simulate a systematic LLM error correlated with a covariate, show the naive regression bias, then correct with a gold-standard / PPI approach.
Case 1: LLM error is systematic, not classical noise
Classical measurement error is mean-zero and independent; LLM error often correlates with covariates and is directional.
import numpy as np
rng = np.random.default_rng(1)
X = rng.normal(size=6)
Y = 1.0 * X
Y_hat = Y + 0.6 * X - 0.3 # error depends on X
print("error e = Y_hat - Y:", np.round(Y_hat - Y, 2))
print("corr(e, X) is high -> not mean-zero independent noise")Expected output
error e = Y_hat - Y: [ 0.34 -0.64 0.27 -0.99 0.13 0.02]
corr(e, X) is high -> not mean-zero independent noiseHow to read this code
- The error correlates with X, so it is not classical measurement error.
- Such error pushes bias directly into the coefficient, not necessarily as attenuation.
- Bigger samples alone cannot remove it; correction is required.
Case 2: naive regression is biased by systematic error
Regressing X on the LLM label moves the coefficient away from the truth.
import numpy as np
rng = np.random.default_rng(0)
n = 4000
X = rng.normal(size=n)
Y = 1.0 * X + rng.normal(size=n)
Y_hat = Y + 0.6 * X - 0.3 + rng.normal(scale=0.5, size=n)
print("true beta = 1.00")
print("naive beta =", round(np.polyfit(X, Y_hat, 1)[0], 3))Expected output
true beta = 1.00
naive beta = 1.60How to read this code
- Using predicted labels as truth moves the coefficient from 1.0 to about 1.6.
- The bias comes from the correlation of X with measurement error.
- Ignoring label uncertainty would also make the interval too narrow.
Case 3: gold-standard / PPI correction recovers the truth
Use a small random human-labeled sample to estimate and remove the bias term.
L = rng.choice(n, size=300, replace=False)
b_all = np.polyfit(X, Y_hat, 1)[0]
b_yhat_L = np.polyfit(X[L], Y_hat[L], 1)[0]
b_y_L = np.polyfit(X[L], Y[L], 1)[0]
b_ppi = b_all - (b_yhat_L - b_y_L)
print("PPI-corrected beta =", round(b_ppi, 3))Expected output
PPI-corrected beta = 1.01How to read this code
- The correction term uses the gold-standard gap between the predicted and true estimates.
- After correction the coefficient returns to about 1.0.
- The gold standard supplies accuracy and the full-sample predictions supply scale.
04 / Case
Case: LLM-coded "populist tone" of speeches as an outcome
- Question: did an event raise the populist tone of politicians' speeches? Hand-coding tens of thousands of speeches is too costly.
- Use an LLM to score all speeches for populist tone as outcome Y_hat — cheap but possibly biased high or low for certain speaker types.
- Randomly sample a few hundred for human gold-standard coding, and use PPI or DSL to correct the LLM bias for an unbiased treatment effect with valid intervals.
- A credible report states the labeling prompt and version, gold-standard sampling probabilities, LLM-human agreement, the correction method, and robustness to prompt drift.
05 / Causal
Plugging into causal designs: measured D / Y / X all need correction
When LLM measurement enters causal research, whether it plays treatment, outcome, or confounder, prediction error propagates into the treatment-effect estimate. PPI / DSL apply not only to means or regression coefficients but also to treatment-effect estimands.
01 / LLM-measured Y → corrected effect
Plug LLM outcome scores into an RCT / DiD and use the gold standard to remove the systematic bias from the effect estimate.
tau_PPI = tau(Y_hat; all) − [tau(Y_hat; L) − tau(Y; L)]02 / LLM-measured D → noisy treatment
LLM-assigned treatment status carries error that attenuates or distorts the effect; use reliability assessment or a second measure as an instrument.
03 / LLM-measured X → under-adjustment risk
Using an LLM-measured confounder as a control leaves residual confounding if the measure is inaccurate (under-adjustment); validate against a gold standard.
04 / Design before scale
Fix the gold-standard sampling design and correction method before scaling LLM labeling — otherwise scale just amplifies systematic bias.
Three red lines: (1) LLM error is usually systematic and X-correlated, so more data does not remove it; (2) keep a random gold standard for correction and uncertainty quantification; (3) prompt/model version drift changes the measurement scheme, so pin and record versions.
06 / Risks
Common Pitfalls
References
- Angelopoulos et al. (2023), Prediction-Powered Inference, Sciencehttps://doi.org/10.1126/science.adi6000
- Egami, Hinck, Stewart, and Wei (2023), Using Imperfect Surrogates for Downstream Inference (DSL), NeurIPShttps://arxiv.org/abs/2306.04746
- Grimmer, Roberts, and Stewart (2022), Text as Data, Princeton University Presshttps://press.princeton.edu/books/hardcover/9780691207544/text-as-data
- Gentzkow, Kelly, and Taddy (2019), Text as Data, Journal of Economic Literaturehttps://doi.org/10.1257/jel.20181020