Workflow / Tooling

Python vs Stata / R / Excel / SQL / SAS: trade-offs and task-based selection

Choosing a tool is not about which language is best; it is about matching each tool to the research step it is best at: data engineering, mature econometrics, statistical modeling, database queries, machine learning, and automation. This page gives an actionable task-to-tool decision framework grounded in the social-science causal workflow.

Many students agonize over Python vs Stata or R. The honest answer is usually all of them, but each for one part: Stata or R for mature causal econometrics, Python for collection, cleaning, ML and agent automation, SQL to query large databases, and Excel only for small manual checks. This page builds the judgment for that hybrid workflow.

Schematic

The principle at a glance

Task → tool: give each step to the right languagequerySQLcollectPythonclean / reshapepandas / RidentifyStata / Rfrontier + autoPython · agentreproducibility > language wars: rerun from raw data + scripts
Match each step to its best tool: database queries to SQL, web/API/text collection to Python, merging and reshaping to pandas/tidyverse, mature causal econometrics (DiD/IV/RD/SCM) to Stata/R, and frontier ML causal plus agent automation to Python. Tooling serves the identification strategy; it does not replace it.

Start Here

What you should be able to do

01

Compare Python / Stata / R / Excel / SQL / SAS along four axes: data scale, method maturity, reproducibility, and automation ecosystem.

02

Know that mature causal designs (DiD / IV / RD / SCM) are often one command with well-tested robust SE in Stata / R, while frontier ML causal methods (DML / causal forests / deep causal) are richest in Python.

03

Understand why Excel fits small checks but not reproducible analysis, and why large queries belong in SQL.

04

Draw a task-to-tool division of labor for a concrete project and justify each choice.

Learning Path

Animation: the task-to-tool pipeline

Split an empirical project into five steps and see who each step should go to.

  1. Step 1

    Query

    Database / huge tables → SQL filters and aggregates in place.

    SQL

  2. Step 2

    Collect

    Web / API / text → Python (requests / BeautifulSoup).

    Python

  3. Step 3

    Clean

    Merge / reshape / missing / regex → pandas or tidyverse.

    pandas / R

  4. Step 4

    Identify

    Mature causal econometrics (DiD / IV / RD / SCM) → Stata / R.

    Stata / R

  5. Step 5

    Frontier + automation

    DML / causal forest / deep causal and agent automation → Python.

    Python

01 / Intuition

Core Intuition

A tool is good or bad only relative to a task: the same regression runs in statsmodels, Stata reg, or R lm; but for turnkey clustered SE, fixed effects, panels, and IV with sensible defaults, the Stata / R econometrics ecosystem is less work.

Python shines before the data arrives and after the econometrics: scraping / APIs, text and unstructured data, machine learning, reproducible scripts, and letting agents chain the whole pipeline.

Reproducibility matters more than the language: an R or Python pipeline that reruns from raw data plus scripts beats pretty-but-untraceable numbers clicked out in Excel.

02 / Math

A task-to-tool selection framework

01 / Data scale and source

Up to a few hundred thousand local rows, all three languages are fine; for hundreds of millions of rows or a database, filter and aggregate inside SQL first, then pull a small sample into Python / R. Excel struggles past a few hundred thousand rows and is hard to reproduce.

data in DB / >1e7 rows → SQL aggregates first → Python / R analyze

02 / Data engineering (collect / clean / text)

Scraping, APIs, regex, unstructured text, and merging or reshaping wide tables: Python (requests / BeautifulSoup / pandas) is the most complete; R tidyverse is strong; Stata is weaker.

scraping / API / text / reshape → Python ≳ R ≫ Stata

03 / Mature causal econometrics

DiD (including heterogeneity-robust estimators), IV, RD, and synthetic control are mature in Stata (reghdfe / did / rdrobust / synth) and R (fixest / did / rdrobust), with robust SE by default and literature alignment.

DiD / IV / RD / SCM (turnkey + robust SE) → Stata ≈ R

04 / ML and frontier / deep causal

DML, causal forests, and DeepIV / Dragonnet / TARNet are richest in Python (scikit-learn / econml / pytorch); R has highlights such as grf.

DML / causal forest / deep causal → Python ≳ R

05 / Automation and AI / agents

MCP, Skills, StatsPAI, and Paper-WorkFlow automate the whole research chain with Python as the glue layer; this is why everyone ends up needing some Python.

agent / MCP / automated research → Python (glue layer)

03 / Code

Code: the same regression across three tools

Run a minimal interaction regression in Python, with the equivalent Stata and R commands in comments, to show that can-run is not the same as least-effort.

Case 1: aggregate huge data in SQL first, then pull into Python

When raw data lives in a database and rows are huge, do not pull the whole table into memory. Filter and aggregate to the unit of analysis in SQL, then analyze with pandas.

import pandas as pd
import sqlite3

# in-memory demo: build a tiny "firm panel", then aggregate IN the database
con = sqlite3.connect(":memory:")
pd.DataFrame({
    "province": ["AH", "AH", "AH", "ZJ", "ZJ"],
    "year":     [2010, 2010, 2011, 2010, 2011],
    "employment": [50, 66, 71, 90, 88],
}).to_sql("firm_panel", con, index=False)

query = """
SELECT province, year,
       AVG(employment) AS mean_emp,
       COUNT(*)        AS n_firms
FROM firm_panel
WHERE year BETWEEN 2010 AND 2020
GROUP BY province, year
"""
panel = pd.read_sql(query, con)
print(panel)

Expected output

  province  year  mean_emp  n_firms
0       AH  2010      58.0        2
1       AH  2011      71.0        1
2       ZJ  2010      90.0        1
3       ZJ  2011      88.0        1

How to read this code

  • SQL filters and aggregates in the database, pulling only the province-year panel into memory.
  • The same task is nearly impossible in Excel and memory-hungry in pure pandas if you read the full table first.
  • Layer the work: SQL for retrieval, Python for computation and modeling.

Case 2: mature econometrics is often one command in Stata / R

High-dimensional fixed effects plus clustered robust SE for DiD is the home turf of the Stata / R ecosystem; Python can do it, but you assemble the fixed effects and variance type more explicitly.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

rng = np.random.default_rng(0)
n = 300
df = pd.DataFrame({
    "firm":  rng.integers(0, 30, n),
    "year":  rng.integers(2010, 2020, n),
    "treat": rng.integers(0, 2, n),
    "post":  rng.integers(0, 2, n),
})
df["y"] = 1 + 1.2 * (df.treat * df.post) + 0.3 * (df.firm % 5) + rng.normal(size=n)

# Stata (one line):    reghdfe y i.treat##i.post, absorb(firm year) cluster(firm)
# R/fixest (one line): feols(y ~ treat*post | firm + year, cluster = ~firm, df)
# Python: spell out fixed effects as factors + cluster-robust SE
m = smf.ols("y ~ treat*post + C(firm) + C(year)", data=df).fit(
    cov_type="cluster", cov_kwds={"groups": df["firm"]})
print(round(float(m.params["treat:post"]), 3))

Expected output

1.214

How to read this code

  • Stata reghdfe and R fixest make high-dimensional FE plus clustered robust SE a single, correct-by-default line.
  • Python reaches the same point estimate, but you must write out the fixed effects (C(firm)+C(year)) and specify the cluster variance.
  • Bottom line: hand mature, literature-aligned econometrics to Stata / R rather than reimplementing it by hand in Python.

04 / Case

Case: division of labor in a hybrid-workflow project

  • Scenario: evaluate an industrial-subsidy policy on firm employment, with raw data from a firm database, policy-announcement web pages, and statistical yearbooks.
  • Query and collect: use SQL to pull a province-year-industry panel from the firm database; use Python (requests / BeautifulSoup) to scrape announcements and yearbook text.
  • Clean and construct: use pandas to merge sources and handle missing and duplicate rows, and regex to extract policy timing and intensity from text.
  • Identify and estimate: use Stata / R for staggered DiD (modern robust estimators) plus an event study and robustness; use Python econml for DML / causal-forest heterogeneity.
  • Automate and deliver: use an agent (MCP / Skills / StatsPAI) to chain query → clean → estimate → robustness → tables into a one-command rerun, leaving Excel for small manual checks.

05 / Causal

Wiring tool selection back into the causal workflow

Tool selection does not replace causal methods; it serves the same identifying assumptions. Choosing the right tool at each step makes the measure-identify-estimate-robustness chain more reliable and reproducible.

01 / Measurement and data engineering (Python / SQL)

Use Python to scrape and use regex / LLMs to construct treatment and covariates from text, and SQL to build the unit of analysis in the database, all before identification.

raw → SQL/Python → treatment D, outcome Y, covariates X

02 / Identification and estimation (Stata / R)

Feed constructed variables into mature DiD / IV / RD / SCM commands and obtain literature-aligned, robust-by-default standard errors.

D, Y, X → reghdfe / fixest / rdrobust → tau-hat

03 / Nuisance and heterogeneity (Python)

DML and causal forests hand high-dimensional nuisance prediction to ML, but identification still comes from design; Python econml / grf are the richest.

theta identified by an orthogonal moment; ML only for nuisance

Three red lines: (1) tool convenience must not dictate the identification strategy; pick a credible design first, then the tool that implements it; (2) whatever the language, results must rerun from raw data plus scripts; (3) when moving data across tools, watch variable definitions, missingness, and units so that switching languages does not silently switch the sample.

06 / Risks

Common Pitfalls

Letting "I only know language X" drive the method: tool habits should never dictate the identification strategy.
Doing reproducible analysis in Excel: manual clicking is untraceable and collapses the moment the sample changes.
Reimplementing mature Stata / R estimators by hand in Python: slow and easy to get the standard errors wrong.
Pulling a full table into memory and filtering in pandas instead of aggregating in SQL first, blowing up memory or runtime.
Chasing new tools while ignoring team and journal conventions: being rerunnable by collaborators and reviewers matters more than using the trendiest language.

References