Agent Systems / Empirical Workflow

Agentic Empirical Research Workflow: From Research Question to Auditable Evidence Bundle

Treat an agent as a research pipeline with state, tools, evidence gates, and human checkpoints, not as a one-shot chat answer.

Mechanism Lab

Animation: how a research agent follows a DAG to produce evidence

The animation shows a research question entering a plan node, data and code executing in parallel, diagnostics gating the model, and only verified artifacts entering the replication package.

Step 1 / 5

State

The research question, files, data versions, code, and human notes form state.

s_t={task,files,data,code,checks}

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

The object of an empirical-research agent is not a single prompt. It is long-lived state s: research question, data paths, code, models, logs, outputs, and human feedback.

Every step should be an auditable tool action: search literature, read data, run scripts, render tables, check balance, or draft a manuscript section.

The workflow needs gates: cleaning must rerun, diagnostics must pass, tables must come from code, citations must exist, and claims must stay within identification assumptions.

Humans are not only final reviewers. They are permission controllers at key nodes: data source, identification strategy, external sends, destructive actions, and final interpretation.

02 / Math

Model the research workflow as a state machine and evidence DAG

01 / State

State includes task, file inventory, data version, code version, model objects, checks, and human notes.

s_t={task,files,data,code,models,checks,notes}

02 / Policy

The agent policy selects a tool action from state. The action is a tool name, arguments, and expected output, not free text.

a_t ~ pi(a | s_t), a_t={tool,args,expected}

03 / Execution

The environment executes the tool and returns an observation: logs, errors, tables, documents, or diagnostics.

o_t=E(a_t,s_t)

04 / Evidence gate

A verifier maps observations into pass, fail, or review. Failures must return to planning or repair.

g_t=V(o_t,s_t) in {pass,fail,review}

05 / State update

Only verified or human-confirmed observations become trusted state; failures still remain in the trace.

s_{t+1}=U(s_t,a_t,o_t,g_t)

06 / Evidence bundle

Completion means reproducible inputs, commands, outputs, logs, assumptions, and review notes.

B={D,C,cmd,outputs,logs,assumptions,review}

03 / Code

Python demo: a gated research-agent state machine

This minimal scheduler records action, observation, and gate status at every step. Failures go to repair paths instead of being narrated as success.

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class State:
    task: str
    data_ready: bool = False
    model_ready: bool = False
    report_ready: bool = False
    trace: list[dict] = field(default_factory=list)

def clean_data(state: State) -> dict:
    return {"rows": 1280, "missing_rate": 0.018, "artifact": "data/clean.parquet"}

def estimate_model(state: State) -> dict:
    if not state.data_ready:
        return {"error": "clean data missing"}
    return {"coef": 0.12, "se": 0.04, "artifact": "models/did.json"}

def render_report(state: State) -> dict:
    if not state.model_ready:
        return {"error": "model not verified"}
    return {"artifact": "outputs/report.md", "claim": "effect is positive"}

def gate(tool_name: str, obs: dict) -> str:
    if "error" in obs:
        return "fail"
    if tool_name == "clean_data" and obs["missing_rate"] < 0.05:
        return "pass"
    if tool_name == "estimate_model" and abs(obs["coef"] / obs["se"]) > 1.96:
        return "review"
    if tool_name == "render_report":
        return "pass"
    return "fail"

TOOLS: dict[str, Callable[[State], dict]] = {
    "clean_data": clean_data,
    "estimate_model": estimate_model,
    "render_report": render_report,
}

def choose_action(state: State) -> str:
    if not state.data_ready:
        return "clean_data"
    if not state.model_ready:
        return "estimate_model"
    return "render_report"

state = State(task="Build an auditable DID research note")

for _ in range(5):
    action = choose_action(state)
    observation = TOOLS[action](state)
    status = gate(action, observation)
    state.trace.append({"action": action, "observation": observation, "gate": status})

    if action == "clean_data" and status == "pass":
        state.data_ready = True
    elif action == "estimate_model" and status in {"pass", "review"}:
        # Human review is required before the claim is externally used.
        state.model_ready = True
    elif action == "render_report" and status == "pass":
        state.report_ready = True
        break

print("report ready:", state.report_ready)
print("trace:")
for row in state.trace:
    print(row)

04 / Case

Case: from course project to reviewer-grade replication bundle

A student or researcher wants an agent to complete a policy-evaluation project: clean raw data, construct a panel, estimate DID, generate tables, write interpretation, and prepare a replication bundle.
The low-quality version asks a model to write conclusions directly. The robust version builds a DAG: raw data, clean data, features, model, diagnostics, tables, figures, manuscript, and review notes.
Every node needs a gate: sample counts, duplicate keys, convergence, parallel-trends figure, table provenance, and whether prose cites the right output.
The deliverable is not a single answer. It is an auditable set of artifacts: processing logs, commands, tables, screenshots, human confirmations, and still-open identification assumptions.

05 / Risks

Common Pitfalls

Designing the research agent as a one-turn answer instead of a recoverable, auditable state machine.

Dropping failed observations from the trace, making later conclusions look cleaner than the real run.

Letting the agent decide high-risk actions such as deleting data, sending email, overwriting manuscripts, or submitting to external systems.

Checking only whether prose is fluent rather than whether tables, figures, code, and data versions match.

Failing to mark identification assumptions as human-review gates, so statistical output becomes overclaimed causal prose.