2220 min read · 250 XP

Capstone: build an eval harness

Ship the keystone artifact — and your calling card.

Time to build. Create a reusable, CI-integrated eval harness for one real AI feature. This is the single most valuable artifact you can own as a lead: it makes quality measurable and becomes the standard other teams adopt.

Key ideas

1
Done means: a golden dataset, automated scoring, a CI gate that fails on regression, and a score-over-time view — reusable by other teams.
2
Ground it in error analysis on real outputs, not generic benchmarks.
3
Include cost & latency, not just quality.
4
Make it a template others can copy: clear README, swappable dataset and scorers.

Build steps

Pick one AI feature and collect 50–100 representative inputs (include edge cases).
Do error analysis; define 3–6 checks (assertions, rubric, or calibrated LLM-judge).
Implement a runner that scores the dataset and outputs a report (quality, cost, latency).
Add a CI job that fails the build when scores drop below a threshold.
Write a README so another team can point it at their feature in under an hour.

Stretch goals

Pairwise comparison vs a baseline; track win-rate.
Pull a sample of production traffic into the dataset automatically.

Watch

Why AI evals are the hottest new skillHusain & Shankar

▶Building an eval harness (promptfoo / Braintrust)Find it on YouTube →

Do the work

0/5 · 0%

Test yourself

Question 1 / 2

What makes this harness valuable beyond your own feature?

27 chapters · progress saves automatically