LLM evals-as-code · the CI gate we run on every RAG deploy
An eval that's not in CI is not an eval. Here's the evals-as-code workflow we run on every RAG project.
An eval that's not in CI is not an eval. Here's the evals-as-code workflow we run on every RAG project.
An eval you run manually before each release isn't an eval · it's hope. The RAG systems that actually hold up in production have evals-as-code: a fixed gold set, metric classes in CI, regression blocking, and diff reporting per pull request.
Version the gold set in the repo · not a spreadsheet, not a shared Notion page. Any change to it requires a PR with reviewer sign-off. The eval harness pins to a specific gold-set version so you can compare eval runs across time fairly.
CI fails the build when any metric drops more than 5 points from the baseline. Baseline updates only merge when a human explicitly approves · not automatically on every improvement. Stops drift where 'small' regressions compound.
if current_score < baseline_score - 0.05:
raise AssertionError(
f'Regression · {metric}: {current_score:.3f} < {baseline_score:.3f} - 0.05'
)Beyond PR-blocking evals, we run the same suite weekly against the live production traffic · catches drift between what CI saw at merge time and what happens in the wild (query distribution shift, prompt-template drift, data-staleness).
The fastest evals-as-code win: point promptfoo at 20 of your real customer prompts + what you'd accept as the right answer. Ship to CI. Before that, any LLM quality conversation is vibes.

Founder, DField Solutions
I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.
Let's talk about your project. 30 minutes, no strings.