2220 min read · 250 XP
Capstone: build an eval harness
Ship the keystone artifact — and your calling card.
0%
Time to build. Create a reusable, CI-integrated eval harness for one real AI feature. This is the single most valuable artifact you can own as a lead: it makes quality measurable and becomes the standard other teams adopt.
Key ideas
- 1
Done means: a golden dataset, automated scoring, a CI gate that fails on regression, and a score-over-time view — reusable by other teams.
- 2
Ground it in error analysis on real outputs, not generic benchmarks.
- 3
Include cost & latency, not just quality.
- 4
Make it a template others can copy: clear README, swappable dataset and scorers.
Build steps
- Pick one AI feature and collect 50–100 representative inputs (include edge cases).
- Do error analysis; define 3–6 checks (assertions, rubric, or calibrated LLM-judge).
- Implement a runner that scores the dataset and outputs a report (quality, cost, latency).
- Add a CI job that fails the build when scores drop below a threshold.
- Write a README so another team can point it at their feature in under an hour.
Stretch goals
- Pairwise comparison vs a baseline; track win-rate.
- Pull a sample of production traffic into the dataset automatically.
Watch
Do the work
0/5 · 0%Test yourself
Question 1 / 2
What makes this harness valuable beyond your own feature?
27 chapters · progress saves automatically