Agent Systems / Empirical Workflow
Agentic Empirical Research Workflow: From Research Question to Auditable Evidence Bundle
Treat an agent as a research pipeline with state, tools, evidence gates, and human checkpoints, not as a one-shot chat answer.
Mechanism Lab
Animation: how a research agent follows a DAG to produce evidence
The animation shows a research question entering a plan node, data and code executing in parallel, diagnostics gating the model, and only verified artifacts entering the replication package.
Step 1 / 5
State
The research question, files, data versions, code, and human notes form state.
s_t={task,files,data,code,checks}Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
The object of an empirical-research agent is not a single prompt. It is long-lived state s: research question, data paths, code, models, logs, outputs, and human feedback.
Every step should be an auditable tool action: search literature, read data, run scripts, render tables, check balance, or draft a manuscript section.
The workflow needs gates: cleaning must rerun, diagnostics must pass, tables must come from code, citations must exist, and claims must stay within identification assumptions.
Humans are not only final reviewers. They are permission controllers at key nodes: data source, identification strategy, external sends, destructive actions, and final interpretation.
02 / Math
Model the research workflow as a state machine and evidence DAG
01 / State
State includes task, file inventory, data version, code version, model objects, checks, and human notes.
s_t={task,files,data,code,models,checks,notes}02 / Policy
The agent policy selects a tool action from state. The action is a tool name, arguments, and expected output, not free text.
a_t ~ pi(a | s_t), a_t={tool,args,expected}03 / Execution
The environment executes the tool and returns an observation: logs, errors, tables, documents, or diagnostics.
o_t=E(a_t,s_t)04 / Evidence gate
A verifier maps observations into pass, fail, or review. Failures must return to planning or repair.
g_t=V(o_t,s_t) in {pass,fail,review}05 / State update
Only verified or human-confirmed observations become trusted state; failures still remain in the trace.
s_{t+1}=U(s_t,a_t,o_t,g_t)06 / Evidence bundle
Completion means reproducible inputs, commands, outputs, logs, assumptions, and review notes.
B={D,C,cmd,outputs,logs,assumptions,review}03 / Code
Python demo: a gated research-agent state machine
This minimal scheduler records action, observation, and gate status at every step. Failures go to repair paths instead of being narrated as success.
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class State:
task: str
data_ready: bool = False
model_ready: bool = False
report_ready: bool = False
trace: list[dict] = field(default_factory=list)
def clean_data(state: State) -> dict:
return {"rows": 1280, "missing_rate": 0.018, "artifact": "data/clean.parquet"}
def estimate_model(state: State) -> dict:
if not state.data_ready:
return {"error": "clean data missing"}
return {"coef": 0.12, "se": 0.04, "artifact": "models/did.json"}
def render_report(state: State) -> dict:
if not state.model_ready:
return {"error": "model not verified"}
return {"artifact": "outputs/report.md", "claim": "effect is positive"}
def gate(tool_name: str, obs: dict) -> str:
if "error" in obs:
return "fail"
if tool_name == "clean_data" and obs["missing_rate"] < 0.05:
return "pass"
if tool_name == "estimate_model" and abs(obs["coef"] / obs["se"]) > 1.96:
return "review"
if tool_name == "render_report":
return "pass"
return "fail"
TOOLS: dict[str, Callable[[State], dict]] = {
"clean_data": clean_data,
"estimate_model": estimate_model,
"render_report": render_report,
}
def choose_action(state: State) -> str:
if not state.data_ready:
return "clean_data"
if not state.model_ready:
return "estimate_model"
return "render_report"
state = State(task="Build an auditable DID research note")
for _ in range(5):
action = choose_action(state)
observation = TOOLS[action](state)
status = gate(action, observation)
state.trace.append({"action": action, "observation": observation, "gate": status})
if action == "clean_data" and status == "pass":
state.data_ready = True
elif action == "estimate_model" and status in {"pass", "review"}:
# Human review is required before the claim is externally used.
state.model_ready = True
elif action == "render_report" and status == "pass":
state.report_ready = True
break
print("report ready:", state.report_ready)
print("trace:")
for row in state.trace:
print(row)04 / Case
Case: from course project to reviewer-grade replication bundle
- A student or researcher wants an agent to complete a policy-evaluation project: clean raw data, construct a panel, estimate DID, generate tables, write interpretation, and prepare a replication bundle.
- The low-quality version asks a model to write conclusions directly. The robust version builds a DAG: raw data, clean data, features, model, diagnostics, tables, figures, manuscript, and review notes.
- Every node needs a gate: sample counts, duplicate keys, convergence, parallel-trends figure, table provenance, and whether prose cites the right output.
- The deliverable is not a single answer. It is an auditable set of artifacts: processing logs, commands, tables, screenshots, human confirmations, and still-open identification assumptions.
05 / Risks
Common Pitfalls
References
- Yao et al. (2022), ReAct: Synergizing Reasoning and Acting in Language Modelshttps://arxiv.org/abs/2210.03629
- Project TIER Protocolhttps://www.projecttier.org/tier-protocol/
- ACM Artifact Review and Badginghttps://www.acm.org/publications/policies/artifact-review-and-badging-current
- AEA Data and Code Availability Policyhttps://www.aeaweb.org/journals/data/data-code-policy