Agent Systems / Skills

Agent Skills Basics: Package Research Procedures into Reusable, Tested Skills

Turn 'rewrite a prompt every time' into 'call a documented, scripted, tested skill,' so an agent runs research procedures discoverably, composably, and reliably.

A Skill is a procedure's manual plus scripts plus tests. It tells the agent what the procedure is called, when to use it, the steps, the inputs, and the artifacts. Unlike a one-off prompt, a skill is durable, reusable, testable, and composable — the key to turning a course project from hand-assembled into a reproducible pipeline.

Start Here

What you should be able to do

01

Package a reusable research step into a skill: name, when_to_use, inputs, steps, artifacts.

02

Understand skill selection: the agent matches the task description to the best skill instead of hard-coding.

03

Compose skills into a gated pipeline, and know that overall reliability is the product of per-step pass rates.

04

Know skills need tests: cases plus a pass rate, treating skills like code.

05

Understand progressive disclosure: load a skill body only when triggered to save context.

Learning Path

Learning path: define -> select -> compose -> test -> disclose

Read skills along this path: write the procedure as a contract, select by description, compose a gated pipeline, write tests for each step, then manage context with progressive disclosure.

  1. Step 1

    Define

    Write the procedure as name + when_to_use + steps + run.

  2. Step 2

    Select

    Pick the best skill by task-description similarity.

  3. Step 3

    Compose

    Compose skills by dependency into a pipeline.

  4. Step 4

    Test

    Write cases per skill and track the pass rate.

  5. Step 5

    Disclose

    Load only triggered skill bodies to save context.

01 / Intuition

Core Intuition

A prompt is a one-off verbal instruction; a skill is a written, rerunnable, shareable procedure manual — the latter is like a lab SOP.

The core of a skill is when_to_use: a clear trigger lets the agent pick the right skill in the right situation instead of stuffing everything into one big prompt.

Skills compose: clean -> estimate -> table -> robustness, where each step is an independent, testable skill, and the chain is the research pipeline.

The more skills you have, the more context matters: describe all skills with a light index and load the full body only when a skill is triggered.

02 / Math

Model skills as contract, selection, composition, and reliability

01 / Skill contract

A skill is an executable unit with metadata: name, trigger, inputs, steps, and artifacts.

02 / Skill selection

The agent selects the best skill by similarity between the task and each skill description — like tool discovery, but higher level.

03 / Compose a pipeline

Skills compose by dependency order; one skill's artifact is the next skill's input.

04 / Pre-gate

Each skill checks preconditions before running; if unmet, it returns or errors instead of proceeding broken.

05 / Pipeline reliability

If skills pass independently, pipeline reliability is the product of per-step pass rates — which is why you test each one.

06 / Context economy

Progressive disclosure: context cost = index cost + the body cost of only the triggered skills.

03 / Code

Code cases: define, select, compose, and test skills

Use plain Python to simulate the beginner logic of skills: write a procedure as a contract, select by description, compose a pipeline, compute reliability, do progressive disclosure, and write tests.

Case 1: select a skill automatically from the task description

Do not hard-code which skill to call. Let the agent select by overlap between task and skill descriptions.

REGISTRY = {
    "clean_panel": "balance a panel drop duplicates handle missing values",
    "event_study": "panel staggered treatment event time relative dummies",
    "make_table":  "render regression results into a publication table",
}

def select(task, registry):
    tok = set(task.lower().split())
    scored = {name: len(tok & set(desc.split())) for name, desc in registry.items()}
    return max(scored, key=scored.get), scored

task = "estimate an event study for a staggered treatment in panel data"
best, scores = select(task, REGISTRY)
print("scores:", scores)
print("selected skill:", best)

Expected output

scores: {'clean_panel': 2, 'event_study': 4, 'make_table': 1}
selected skill: event_study

How to read this code

  • The most overlapping skill is selected — an argmax over similarity.
  • Real systems use embedding similarity, but the idea is the same: good descriptions enable good selection.

Case 2: compose skills into a gated pipeline

clean -> estimate -> table, each an independent skill, where a failed pre-gate raises instead of proceeding.

def clean_panel(state):
    state["clean"] = True; return state
def event_study(state):
    assert state.get("clean"), "needs a clean panel first"
    state["coefs"] = [0.01, 0.04, 0.12]; return state
def make_table(state):
    assert "coefs" in state, "needs estimates first"
    state["table"] = "outputs/event_study.tex"; return state

pipeline = [clean_panel, event_study, make_table]
state = {"panel": "firm-year.csv"}
for skill in pipeline:
    state = skill(state)
print("artifacts:", {k: state[k] for k in ("clean", "coefs", "table")})

Expected output

artifacts: {'clean': True, 'coefs': [0.01, 0.04, 0.12], 'table': 'outputs/event_study.tex'}

How to read this code

  • One skill's artifact is the next skill's input, forming a research pipeline.
  • The gate (assert) prevents proceeding broken — e.g., no estimation before cleaning.

Case 3: pipeline reliability = product of per-skill pass rates

Five seemingly high skills chained together have a noticeably lower overall reliability.

import math
pass_rates = {"clean_panel": 0.99, "event_study": 0.95, "make_table": 0.98}
R = math.prod(pass_rates.values())
print("per-skill pass rates:", pass_rates)
print(f"pipeline reliability = {R:.3f}")
print(f"failure rate = {1 - R:.3f}  -> test each skill, do not trust the chain")

Expected output

per-skill pass rates: {'clean_panel': 0.99, 'event_study': 0.95, 'make_table': 0.98}
pipeline reliability = 0.922
failure rate = 0.078  -> test each skill, do not trust the chain

How to read this code

  • 0.99x0.95x0.98 ~ 0.92, a failure rate near 8%.
  • This is why you test each skill rather than trust the whole chain.

Case 4: progressive disclosure — load a skill body only when triggered

Describe all skills with a light index and load the full body only for the triggered skill.

skills = {
    "clean_panel": {"index": 40, "body": 1200},
    "event_study": {"index": 45, "body": 1500},
    "make_table":  {"index": 38, "body": 900},
}
triggered = ["event_study"]                       # only this matches the task
index_only = sum(s["index"] for s in skills.values())
progressive = index_only + sum(skills[n]["body"] for n in triggered)
load_all = index_only + sum(s["body"] for s in skills.values())
print("index-only tokens:        ", index_only)
print("progressive-disclosure:   ", progressive)
print("load-everything tokens:   ", load_all)

Expected output

index-only tokens:         123
progressive-disclosure:    1623
load-everything tokens:    3723

How to read this code

  • Index is 123 tokens; load only the triggered skill -> 1623; load all -> 3723.
  • Progressive disclosure lets a skill library grow large without flooding the context.

Case 5: skills need tests too

Like code, write cases (including edges and failures) for a skill and report the pass rate.

def event_study_skill(n_periods):
    if n_periods < 2:
        raise ValueError("need >= 2 periods")
    return {"ok": True, "periods": n_periods}

test_cases = [5, 3, 1, 8, 0]
results = []
for tc in test_cases:
    try:
        event_study_skill(tc); results.append(True)
    except Exception:
        results.append(False)
print("results:", results)
print(f"pass rate = {sum(results)/len(results):.2f}  (skills need tests, like code)")

Expected output

results: [True, True, False, True, False]
pass rate = 0.60  (skills need tests, like code)

How to read this code

  • The pass rate exposes a skill's fragility on edge inputs.
  • An untested skill fails silently as data and dependencies change.

04 / Case

Case: package an event study into a reusable skill

  • You run event studies across many projects: build relative-time dummies, estimate, plot coefficients. Rewriting each time invites errors.
  • Package it as an event_study skill: SKILL.md states when to use it and the steps, scripts/ holds rerunnable code, tests/ holds cases.
  • Given the task 'estimate an event study for a staggered treatment,' the agent selects the skill by description and checks the panel is cleaned before running (a gate).
  • Skills compose: clean_panel -> event_study -> make_table -> robustness_suite, each leaving a trace; overall reliability is the product of pass rates, so every skill needs tests.

05 / Causal

Bridge to causal inference: make identification and robustness into skills

Skills make the key steps of causal research reusable, testable, and auditable: identification checks, estimation, and robustness can each be independent skills composed into a reproducible causal pipeline.

01 / Identification checks as skills (assumption -> skill)

Write parallel-trends, overlap, and instrument relevance / exclusion as check skills, run before estimation.

02 / Estimation as skills (design -> estimator skill)

Package an estimation skill per design (DiD / IV / RD / DML), callable by passing design parameters.

03 / Robustness as a skill suite (estimate -> robustness suite)

Bundle placebo, sensitivity, and honest DiD into one reusable robustness skill suite.

04 / Reliability is quantifiable (chain -> tested chain)

The reliability of the whole causal pipeline is the product of skill pass rates, forcing a test per step.

Three red lines: (1) skills reuse procedures, but identification assumptions still need a human to defend; (2) skills must have tests, or they fail silently when data changes; (3) high-risk skills need a gate and human confirmation — automation must not overreach.

06 / Risks

Common Pitfalls

Writing a skill as a vague prompt with no clear when_to_use or inputs/outputs, so the agent mis-selects or misuses it.
Piling up skills without tests: skills break as data and dependencies change, and without cases you cannot detect it.
Ignoring composition reliability: five 95% skills chained give about 77% overall, so harden each one.
Stuffing all skill bodies into the context at once; load only when triggered.
Hard-coding paths or samples inside a skill, preventing reuse; parameterize the inputs.
Letting a skill take high-risk actions (delete data, overwrite results) without a gate and human confirmation.

Resources

Hands-on downloads

References