# 文本因果推断作业模板 · Text-as-Data Causal Assignment

> StatsPAI Summer Bootcamp · 配合知识页 `/courses/summer-bootcamp/topics/text-causal-inference` 与 `text_causal_lab.ipynb`。
> 目标：把一段文本测成可识别的因果变量，并诚实处理测量误差、泄漏与样本分裂。

---

## 中文版

### 目标
选一个文本进入因果链条的研究问题，把文本测成 D / Y / X，接入一个标准设计，给出可信估计与边界讨论。

### 任务
1. **角色判定**：明确你的文本是处理 D、结果 Y，还是混淆 X，并画出因果图。
2. **测量**：用词频 / 主题 / 嵌入 / LLM 把文本编码成数值变量，记录测量方案与版本。
3. **识别**：根据角色接入设计——文本=处理→连续 DiD / 事件研究；文本=混淆→文本匹配或 DML 残差化；文本=结果→标准设计 + 信度。
4. **样本分裂**：若文本表示是有监督学出来的，必须在不同折上学表示与估效应。
5. **稳健性**：测量信度（双标注 / 多次测量）、衰减偏误处理、后处理文本泄漏检查、overlap。
6. **报告**：估计 + 置信区间 + 测量风险 + limitations。

### 交付物
- `data/`（或来源说明）、`code/`（可复跑 notebook）、`report.md`（≤2 页）。
- 因果图 + 主估计 + 一段"测量误差与泄漏"讨论。

### 评分要点
| 维度 | 权重 |
|---|---|
| 角色判定与因果图清晰 | 25% |
| 测量方案与信度 | 25% |
| 识别与样本分裂正确 | 30% |
| 衰减 / 泄漏 / overlap 讨论 | 20% |

---

## English Version

### Goal
Pick a question where text enters the causal chain, measure it into D / Y / X, plug it into a standard design, and deliver a credible estimate with a discussion of limits.

### Tasks
1. **Role**: decide whether your text is treatment D, outcome Y, or confounder X, and draw the causal graph.
2. **Measurement**: encode text into numeric variables with frequencies / topics / embeddings / LLMs, recording the scheme and version.
3. **Identification**: plug in by role — text=treatment → continuous DiD / event study; text=confounder → text matching or DML residualization; text=outcome → standard design + reliability.
4. **Sample splitting**: if the text representation is learned with supervision, learn it on a fold separate from effect estimation.
5. **Robustness**: measurement reliability (double coding / repeated measures), attenuation handling, post-treatment leakage checks, overlap.
6. **Report**: estimate + confidence interval + measurement risk + limitations.

### Deliverables
- `data/` (or a source note), `code/` (rerunnable notebook), `report.md` (<= 2 pages).
- A causal graph, the main estimate, and a "measurement error and leakage" paragraph.

### Rubric
| Dimension | Weight |
|---|---|
| Clear role and causal graph | 25% |
| Measurement scheme and reliability | 25% |
| Correct identification and sample splitting | 30% |
| Attenuation / leakage / overlap discussion | 20% |
