StatsPAI: The Statistics Toolkit Built for the AI Agent Era

bash

pip install statspai

python

import statspai as sp

A Python econometrics and causal inference toolkit designed for the AI Agent era. 150+ functions, unified API, MIT license.

StatsPAI's core positioning: a statistics package built for Agent-driven development. As AI coding tools — Claude Code, Codex, Cursor — increasingly participate in empirical research, the ecosystem needs a unified, stable, and API-standardized Python statistics package to handle computation. StatsPAI is built for exactly this scenario: designed for researchers, but optimized for agents.

Why Python

Stata and R have deep roots in empirical research, but they are both domain-specific languages. Python is a general-purpose language, and this gives it one decisive advantage: new algorithms ship faster in Python than in Stata or R.

In Stata, new methods wait for StataCorp's official releases or community-written .ado files. In R, new packages go through the CRAN review pipeline. In Python's ecosystem, when a new causal inference paper drops, AI coding tools can implement, test, and publish the algorithm to PyPI within days. The iteration speed of scikit-learn, PyTorch, and transformers has already proven this.

StatsPAI's goal is to bring this speed advantage to empirical research: leverage AI-assisted coding to rapidly implement frontier algorithms, so researchers and agents can use them immediately.

And Python naturally connects to data engineering, machine learning, and cloud computing. A researcher can clean data with pandas, run double machine learning with sp.dml(), train a custom model with PyTorch — all in the same environment, with no data export or language switching. That's an experience Stata and R simply cannot offer.

What StatsPAI Does

Anyone in empirical research knows Stata and R. Stata has built a highly unified econometric command system over 40 years — from reg to xtreg to ivregress, consistent syntax, smooth workflow. R's CRAN hosts dozens of causal inference packages — fixest, did, rdrobust, grf — cutting-edge methods with broad coverage, where new paper implementations often land first.

On the Python side, data science infrastructure has long been industry standard — pandas, scikit-learn, TensorFlow, PyTorch form the bedrock of AI. But for empirical research, there's never been a unified, convenient tool. statsmodels handles regression, linearmodels handles panels, EconML handles heterogeneous treatment effects, differences handles DID — each with its own API, its own output format. Running one paper requires installing a pile of packages and playing glue engineer.

StatsPAI takes the best of both worlds — Stata's command system and R's causal inference coverage — and unifies them in a single Python package:

python

import statspai as sp

# Classical econometrics
sp.regress("wage ~ edu + exp", data=df, robust=class="syn-string">'hc1')
sp.ivreg("wage ~ (edu ~ parent_edu) + exp", data=df)
sp.panel(df, "wage ~ edu + exp", entity=class="syn-string">'id', time=class="syn-string">'year', model=class="syn-string">'fe')

# Causal inference
sp.did(df, y=class="syn-string">'wage', treat=class="syn-string">'policy', time=class="syn-string">'year', id=class="syn-string">'worker')
sp.rdrobust(df, y=class="syn-string">'score', x=class="syn-string">'running_var', c=0)
sp.synth(df, treat_unit=1, outcome=class="syn-string">'y', time=class="syn-string">'year', unit=class="syn-string">'id')

# ML causal
sp.dml(df, y=class="syn-string">'wage', treat=class="syn-string">'training', covariates=[class="syn-string">'age', class="syn-string">'edu'])
sp.causal_forest("y ~ treatment | x1 + x2 + x3", data=df)

# All results, same interface
result.summary()
result.plot()
result.to_docx()       # Word
result.to_latex()      # LaTeX
result.cite()          # BibTeX for the method

# Publication tables in one line
sp.modelsummary(r1, r2, r3, output=class="syn-string">'results.docx')

All functions share a unified API design. All results return a standardized CausalResult object. Publication-ready tables export directly to Word, Excel, LaTeX, and HTML. Stata users will find sp.regress(), sp.margins(), sp.test() familiar. R users will recognize sp.callaway_santanna() and sp.rdrobust().

Method Coverage

150+ public functions. Here's an overview by category:

Classical Econometrics — regress, ivreg, panel, heckman, qreg, tobit, xtabond

DID Family — did, callaway_santanna, sun_abraham, bacon_decomposition, honest_did

Regression Discontinuity — rdrobust, rdplot, rddensity

Matching & Reweighting — match (PSM / Mahalanobis / CEM), ebalance

Synthetic Control — synth, sdid

ML Causal — dml, causal_forest, metalearner (S/T/X/R/DR), tmle, aipw, deepiv

Neural Causal — tarnet, cfrnet, dragonnet

Causal Discovery — notears, pc_algorithm

Policy Learning — policy_tree, policy_value

More — dose_response, multi_treatment, lee_bounds, manski_bounds, spillover, g_estimation, bunching, mc_panel, causal_impact, mediate, bartik, conformal_cate, bcf

Post-estimation — margins, test, lincom, oster_bounds, sensemakr, evalue, hausman_test, het_test, reset_test, vif

Automated Robustness (one-click solutions that neither Stata nor R offer)

spec_curve() — Specification curve / multiverse analysis
robustness_report() — Automatically varies standard errors, winsorization, control variables, subsamples
subgroup_analysis() — Heterogeneity analysis + forest plots + Wald tests

Publication Tables — modelsummary, outreg2, sumstats, balance_table, tab, coefplot, binscatter

Designed for Agents

When an AI Agent conducts empirical research, the workflow has two layers: Skills orchestrate "what to do," and the statistics package executes "how to do it."

A Skill might specify: "Run Callaway-Sant'Anna staggered DID → test parallel trends → export Word table." Under the hood, the Agent needs to call concrete Python functions. If the underlying layer consists of seven or eight scattered packages, the Agent must juggle different APIs and adapt between formats. With StatsPAI, it's three lines of sp.xxx().

StatsPAI is designed with Agent invocation in mind:

Unified function signatures — Agents don't need to memorize each package's unique parameter style
Standardized result objects — All methods return the same type of object, enabling standardized downstream processing
Built-in publication output — .to_docx(), .to_latex() work out of the box, no extra assembly required
Method-level citations — .cite() returns BibTeX, making automated reference generation easy

As agents play an increasingly central role in academic research, the uniformity and reliability of the underlying statistics package will become ever more critical. This is StatsPAI's most fundamental long-term value.

Timeline

2025.07 — StatsPAI v0.1.0 published on PyPI. Open source first, starting with classical econometrics and publication tables
2025.08 — StatsPAI Inc. incorporated. A sustainable entity to support the open-source project
2025.12 — [CoPaper.AI](https://copaper.ai) launches. An AI-assisted empirical research co-authoring platform built on StatsPAI
2026.04 — v0.3.1. Three major version iterations, 150+ functions, covering classical econometrics through frontier ML causal methods

Package first, then company, then product. StatsPAI is and always will be MIT-licensed and open source.

What's Next

StatsPAI is iterating rapidly. The current 150+ common functions are just the starting point, with significant new features shipping weekly.

Near-term directions:

Performance — Computational efficiency for large-scale panel data, benchmarking against Stata's compiled backend and R's fixest
Method expansion — Survey design, spatial econometrics, structural estimation — closing the remaining gaps where Stata and R still lead
Ecosystem integration — Deeper fusion with pandas, scikit-learn, PyTorch, accelerating the convergence of AI and causal inference
AI adaptation — Structured function documentation and standardized interfaces for Agents and Skills, lowering the barrier for AI invocation
Rapid frontier implementation — Leveraging Python + AI coding speed advantages to implement new paper methods as soon as they're published

StatsPAI aims to become foundational infrastructure for Python-based empirical research — analogous to what scikit-learn is to machine learning and pandas is to data processing. This takes time, and it takes community. The ecosystem is in a period of rapid growth, with each version adding new methods, optimizing existing implementations, and refining API consistency. We'll maintain this pace.

Get Involved

StatsPAI is an open ecosystem, and all forms of participation are welcome:

Use it — pip install statspai, try it in your next empirical analysis
Feedback — Tell us what's broken or what could be better on [GitHub Issues](https://github.com/brycewang-stanford/statspai/issues)
Suggest — What methods do you want most? What features? Let us know
Contribute — Fork & PR, build with us

GitHub: [github.com/brycewang-stanford/statspai](https://github.com/brycewang-stanford/statspai)

CoPaper.AI: [copaper.ai](https://copaper.ai)

StatsPAI Inc. · Stanford REAP Program