Frontier / LLM Measurement

LLMs as Measurement: Bias and Correction When Using Predicted Variables Downstream

Labeling huge samples with an LLM is tempting, but plugging predicted variables into a regression as if they were truth introduces systematic bias. This page covers the bias and three corrections: gold-standard validation, PPI, and DSL.

LLMs cheaply annotate text/images into variables, but their errors are often systematic and correlated with covariates — using predicted labels as truth in a regression biases coefficients and makes confidence intervals too narrow. The fix is a small human-labeled gold-standard sample for correction.

Schematic

The principle at a glance

LLM measurement + gold-standard correctionfull sampleunlabeledLLM → Y_hatcheap·biasedgold standardhuman·randomPPI / DSL correctremove biasunbiased θ̂ + valid CIcrediblecaution: LLM error is often systematic & X-correlated → using it as truth is biased + over-confident
An LLM cheaply labels the full sample as Y_hat (large scale but systematically biased); a small random human-labeled gold standard (accurate) removes the bias via PPI / DSL, giving an unbiased estimate with valid intervals. Predictions give scale, the gold standard gives accuracy.

Start Here

What you should be able to do

01

Know that LLM measurement error is usually not classical (mean-zero, independent); it can be systematic and correlated with X.

02

Understand that using predicted variables directly is both biased and over-confident.

03

Use a small randomly labeled gold-standard sample for bias correction.

04

Understand Prediction-Powered Inference (PPI): correct the prediction bias term with the gold standard.

05

Understand design-based supervised learning (DSL): reweight by known sampling probabilities.

Learning Path

Learning path: from cheap predictions to credible inference

Follow this path: recognize that LLM error is systematic, see the naive regression bias, then correct with a gold standard plus PPI/DSL.

  1. Step 1

    Predict

    The LLM cheaply labels the full sample as Y_hat.

    Y_hat=f(text)

  2. Step 2

    Bias

    Check whether the error correlates with X (systematic).

    e=Y_hat−Y

  3. Step 3

    Gold

    Randomly sample a human-labeled gold-standard subset.

    L: (Y, Y_hat)

  4. Step 4

    Correct

    Remove the bias term with PPI / DSL.

    theta_PPI

  5. Step 5

    Report

    Report agreement, sampling design, and valid CIs.

    valid CI

01 / Intuition

Core Intuition

Treat an LLM as a cheap but biased measuring instrument: it measures a lot, but the readings drift systematically.

Regressing on the LLM label Y_hat treats measurement error as real signal, so bias enters the coefficient; ignoring label uncertainty also makes standard errors too small.

The shared fix: pay to label a small random gold-standard subsample and use it to estimate and remove the LLM's systematic bias — predictions give scale, the gold standard gives accuracy.

02 / Math

From predicted labels to unbiased downstream estimates

01 / Predicted variable

The LLM maps inputs to a predicted label Y_hat=f(text). Its gap from the truth Y is measurement error e=Y_hat−Y, generally non-zero-mean and possibly correlated with X.

Y_hat = f(text),  e = Y_hat − Y

02 / Bias of the naive regression

Regressing Y_hat on X estimates the relation of X to Y_hat, not to the true Y; the bias is proportional to the relation of X to the measurement error e. When error correlates with X, this is not simple attenuation but directionally ambiguous bias.

plim(beta_naive) = beta + Cov(X, e)/Var(X)

03 / Gold-standard correction

On a randomly drawn, human-labeled subsample with both true Y and predicted Y_hat, estimate and remove the LLM bias structure. A more random, representative gold standard gives a more reliable correction.

labeled set L: (Y_i, Y_hat_i, X_i)

04 / Prediction-Powered Inference

The PPI point estimate equals the full-sample estimate using Y_hat minus a correction (the gold-standard estimate using Y_hat minus the one using Y). It is unbiased and tighter than using the small gold standard alone.

theta_PPI = theta(Y_hat; all) − [theta(Y_hat; L) − theta(Y; L)]

05 / DSL: design-based supervised learning

Reweight by the known (researcher-designed) labeling probability pi to build a moment condition robust to LLM error, yielding consistent estimates and valid standard errors.

weight 1/pi_i on labeled units in the moment

03 / Code

Code cases: from biased LLM labels to gold-standard correction

Simulate a systematic LLM error correlated with a covariate, show the naive regression bias, then correct with a gold-standard / PPI approach.

Case 1: LLM error is systematic, not classical noise

Classical measurement error is mean-zero and independent; LLM error often correlates with covariates and is directional.

import numpy as np
rng = np.random.default_rng(1)
X = rng.normal(size=6)
Y = 1.0 * X
Y_hat = Y + 0.6 * X - 0.3            # error depends on X
print("error e = Y_hat - Y:", np.round(Y_hat - Y, 2))
print("corr(e, X) is high -> not mean-zero independent noise")

Expected output

error e = Y_hat - Y: [ 0.34 -0.64  0.27 -0.99  0.13  0.02]
corr(e, X) is high -> not mean-zero independent noise

How to read this code

  • The error correlates with X, so it is not classical measurement error.
  • Such error pushes bias directly into the coefficient, not necessarily as attenuation.
  • Bigger samples alone cannot remove it; correction is required.

Case 2: naive regression is biased by systematic error

Regressing X on the LLM label moves the coefficient away from the truth.

import numpy as np
rng = np.random.default_rng(0)
n = 4000
X = rng.normal(size=n)
Y = 1.0 * X + rng.normal(size=n)
Y_hat = Y + 0.6 * X - 0.3 + rng.normal(scale=0.5, size=n)
print("true beta = 1.00")
print("naive beta =", round(np.polyfit(X, Y_hat, 1)[0], 3))

Expected output

true beta = 1.00
naive beta = 1.60

How to read this code

  • Using predicted labels as truth moves the coefficient from 1.0 to about 1.6.
  • The bias comes from the correlation of X with measurement error.
  • Ignoring label uncertainty would also make the interval too narrow.

Case 3: gold-standard / PPI correction recovers the truth

Use a small random human-labeled sample to estimate and remove the bias term.

L = rng.choice(n, size=300, replace=False)
b_all = np.polyfit(X, Y_hat, 1)[0]
b_yhat_L = np.polyfit(X[L], Y_hat[L], 1)[0]
b_y_L = np.polyfit(X[L], Y[L], 1)[0]
b_ppi = b_all - (b_yhat_L - b_y_L)
print("PPI-corrected beta =", round(b_ppi, 3))

Expected output

PPI-corrected beta = 1.01

How to read this code

  • The correction term uses the gold-standard gap between the predicted and true estimates.
  • After correction the coefficient returns to about 1.0.
  • The gold standard supplies accuracy and the full-sample predictions supply scale.

04 / Case

Case: LLM-coded "populist tone" of speeches as an outcome

  • Question: did an event raise the populist tone of politicians' speeches? Hand-coding tens of thousands of speeches is too costly.
  • Use an LLM to score all speeches for populist tone as outcome Y_hat — cheap but possibly biased high or low for certain speaker types.
  • Randomly sample a few hundred for human gold-standard coding, and use PPI or DSL to correct the LLM bias for an unbiased treatment effect with valid intervals.
  • A credible report states the labeling prompt and version, gold-standard sampling probabilities, LLM-human agreement, the correction method, and robustness to prompt drift.

05 / Causal

Plugging into causal designs: measured D / Y / X all need correction

When LLM measurement enters causal research, whether it plays treatment, outcome, or confounder, prediction error propagates into the treatment-effect estimate. PPI / DSL apply not only to means or regression coefficients but also to treatment-effect estimands.

01 / LLM-measured Y → corrected effect

Plug LLM outcome scores into an RCT / DiD and use the gold standard to remove the systematic bias from the effect estimate.

tau_PPI = tau(Y_hat; all) − [tau(Y_hat; L) − tau(Y; L)]

02 / LLM-measured D → noisy treatment

LLM-assigned treatment status carries error that attenuates or distorts the effect; use reliability assessment or a second measure as an instrument.

03 / LLM-measured X → under-adjustment risk

Using an LLM-measured confounder as a control leaves residual confounding if the measure is inaccurate (under-adjustment); validate against a gold standard.

04 / Design before scale

Fix the gold-standard sampling design and correction method before scaling LLM labeling — otherwise scale just amplifies systematic bias.

Three red lines: (1) LLM error is usually systematic and X-correlated, so more data does not remove it; (2) keep a random gold standard for correction and uncertainty quantification; (3) prompt/model version drift changes the measurement scheme, so pin and record versions.

06 / Risks

Common Pitfalls

Regressing on LLM labels as a gold standard and ignoring systematic bias.
Assuming LLM error is classical noise and trying to average it out with a bigger sample — systematic error does not average out.
Keeping no gold-standard validation sample, leaving bias unestimable and uncorrectable.
Non-random gold-standard sampling (only labeling easy cases), which biases the correction itself.
Ignoring label uncertainty in standard errors, producing intervals that are too narrow and over-significant.

References