Workflow / Reproducibility

Python and Reproducible Research Workflow

Treat a research project as a rerunnable DAG, not as a folder of manual files.

Mechanism Lab

Animation: outputs flowing through a DAG

The signal starts at raw data and moves through clean, feature, model, and table, leaving a verifiable file at every step.

Step 1 / 5

Raw

Raw data is stored read-only.

D0

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

Reproducible research is less about knowing Python syntax and more about making dependencies inspectable.

A result table is the output of F(raw data, code, config, seed), not an isolated artifact.

Clear dependency graphs let humans, reviewers, and agents rerun, debug, replace data, and compare outputs.

02 / Math

Model the workflow as a dependency graph

01 / State

Let D0 be raw data, C code, theta config, and s the random seed.

Y = F(D0, C, theta, s)

02 / DAG rule

If node v_j depends on v_i, v_i must be generated first.

v_i -> v_j  implies  order(v_i) < order(v_j)

03 / Audit

Output changes should decompose into data, code, or config changes.

Delta Y = F(D0, C2, theta, s) - F(D0, C1, theta, s)

03 / Code

Minimal pipeline code

The example separates cleaning and analysis and records inputs and outputs explicitly.

from pathlib import Path
import pandas as pd

ROOT = Path("project")

def clean(raw_path: Path, out_path: Path) -> pd.DataFrame:
    df = pd.read_csv(raw_path)
    df = df.drop_duplicates("id")
    df["post"] = (df["year"] >= 2020).astype(int)
    df.to_parquet(out_path, index=False)
    return df

def summarize(clean_path: Path, table_path: Path) -> None:
    df = pd.read_parquet(clean_path)
    table = df.groupby("post")["outcome"].agg(["count", "mean", "std"])
    table.to_csv(table_path)

clean(ROOT / "data/raw.csv", ROOT / "data/processed.parquet")
summarize(ROOT / "data/processed.parquet", ROOT / "tables/descriptive.csv")

04 / Case

Case: converting an Excel workflow into an auditable pipeline

  • Turn a manual Excel-style research workflow into raw data, cleaning scripts, analysis scripts, and outputs.
  • Every output table traces back to input data, code version, and random seed.
  • Agents become reliable only when the project structure is explicit enough to inspect.

05 / Risks

Common Pitfalls

Keeping only notebooks without run order.
Manually editing tables without provenance.
Leaving paths, seeds, and dependency versions implicit.

References