An eval you run manually before each release isn't an eval · it's hope. The RAG systems that actually hold up in production have evals-as-code: a fixed gold set, metric classes in CI, regression blocking, and diff reporting per pull request.
Five metric classes we measure
- Faithfulness · did the answer stay inside the retrieved context?
- Context precision · did retrieval pull the right chunks?
- Answer relevance · did the answer address the question?
- Bias · did neutral context produce a neutral answer?
- Injection resistance · did the system resist 80+ known attack patterns?
Gold set discipline
Version the gold set in the repo · not a spreadsheet, not a shared Notion page. Any change to it requires a PR with reviewer sign-off. The eval harness pins to a specific gold-set version so you can compare eval runs across time fairly.
Regression blocking
CI fails the build when any metric drops more than 5 points from the baseline. Baseline updates only merge when a human explicitly approves · not automatically on every improvement. Stops drift where 'small' regressions compound.
if current_score < baseline_score - 0.05:
raise AssertionError(
f'Regression · {metric}: {current_score:.3f} < {baseline_score:.3f} - 0.05'
)Weekly canary against live model
Beyond PR-blocking evals, we run the same suite weekly against the live production traffic · catches drift between what CI saw at merge time and what happens in the wild (query distribution shift, prompt-template drift, data-staleness).
The fastest evals-as-code win: point promptfoo at 20 of your real customer prompts + what you'd accept as the right answer. Ship to CI. Before that, any LLM quality conversation is vibes.

By
Dezső Mező
Founder, DField Solutions
I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.
Keep reading
RELATED PROJECTS