Workflow / Reproducibility
Python and Reproducible Research Workflow
Treat a research project as a rerunnable DAG, not as a folder of manual files.
Mechanism Lab
Animation: outputs flowing through a DAG
The signal starts at raw data and moves through clean, feature, model, and table, leaving a verifiable file at every step.
Step 1 / 5
Raw
Raw data is stored read-only.
D0Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
Reproducible research is less about knowing Python syntax and more about making dependencies inspectable.
A result table is the output of F(raw data, code, config, seed), not an isolated artifact.
Clear dependency graphs let humans, reviewers, and agents rerun, debug, replace data, and compare outputs.
02 / Math
Model the workflow as a dependency graph
01 / State
Let D0 be raw data, C code, theta config, and s the random seed.
Y = F(D0, C, theta, s)02 / DAG rule
If node v_j depends on v_i, v_i must be generated first.
v_i -> v_j implies order(v_i) < order(v_j)03 / Audit
Output changes should decompose into data, code, or config changes.
Delta Y = F(D0, C2, theta, s) - F(D0, C1, theta, s)03 / Code
Minimal pipeline code
The example separates cleaning and analysis and records inputs and outputs explicitly.
from pathlib import Path
import pandas as pd
ROOT = Path("project")
def clean(raw_path: Path, out_path: Path) -> pd.DataFrame:
df = pd.read_csv(raw_path)
df = df.drop_duplicates("id")
df["post"] = (df["year"] >= 2020).astype(int)
df.to_parquet(out_path, index=False)
return df
def summarize(clean_path: Path, table_path: Path) -> None:
df = pd.read_parquet(clean_path)
table = df.groupby("post")["outcome"].agg(["count", "mean", "std"])
table.to_csv(table_path)
clean(ROOT / "data/raw.csv", ROOT / "data/processed.parquet")
summarize(ROOT / "data/processed.parquet", ROOT / "tables/descriptive.csv")04 / Case
Case: converting an Excel workflow into an auditable pipeline
- Turn a manual Excel-style research workflow into raw data, cleaning scripts, analysis scripts, and outputs.
- Every output table traces back to input data, code version, and random seed.
- Agents become reliable only when the project structure is explicit enough to inspect.
05 / Risks