LLM evals-as-code · the CI gate we run on every RAG deploy

An eval that's not in CI is not an eval. Here's the evals-as-code workflow we run on every RAG project.

Last verified22 April 2026

Listen

By Dezső MezőFounder, DField Solutions

ShareX LinkedIn#

LLM evals-as-code · the CI gate we run on every RAG deploy

An eval you run manually before each release isn't an eval · it's hope. The RAG systems that actually hold up in production have evals-as-code: a fixed gold set, metric classes in CI, regression blocking, and diff reporting per pull request.

Five metric classes we measure

Faithfulness · did the answer stay inside the retrieved context?
Context precision · did retrieval pull the right chunks?
Answer relevance · did the answer address the question?
Bias · did neutral context produce a neutral answer?
Injection resistance · did the system resist 80+ known attack patterns?

Gold set discipline

Version the gold set in the repo · not a spreadsheet, not a shared Notion page. Any change to it requires a PR with reviewer sign-off. The eval harness pins to a specific gold-set version so you can compare eval runs across time fairly.

Regression blocking

CI fails the build when any metric drops more than 5 points from the baseline. Baseline updates only merge when a human explicitly approves · not automatically on every improvement. Stops drift where 'small' regressions compound.

if current_score < baseline_score - 0.05:
    raise AssertionError(
        f'Regression · {metric}: {current_score:.3f} < {baseline_score:.3f} - 0.05'
    )

Weekly canary against live model

Beyond PR-blocking evals, we run the same suite weekly against the live production traffic · catches drift between what CI saw at merge time and what happens in the wild (query distribution shift, prompt-template drift, data-staleness).

The fastest evals-as-code win: point promptfoo at 20 of your real customer prompts + what you'd accept as the right answer. Ship to CI. Before that, any LLM quality conversation is vibes.

ShareX LinkedIn#

Dezső Mező

Founder, DField Solutions

I'm a full-stack engineer and I build across the whole stack myself · AI agents, web and mobile apps, blockchain, backends, security, right down to the OS layer. If it's software, I've probably built it and broken it.

ABOUT Let's talk

Keep reading

26 Apr 2026·11 min read

Build an LLM Eval Harness in 200 Lines of TS

Frameworks are great until they get in the way. Here is a 200-line TS eval harness that runs in CI, blocks regressions and prints a diff.

Read

08 Apr 2026·9 min read

Shipping AI agents that actually work in production

From demo to live system: the retrieval, eval, guardrails and cost control we run on every AI project we ship.

Read

22 Apr 2026·8 min read

LLM prompt caching in production · a 60-80% cost cut

Prompt caching is the single biggest LLM cost lever in 2026. 4 patterns, real savings numbers, 2 gotchas worth knowing.

Read

RELATED PROJECTS

Websites, web apps & online shops · Custom software · everything else · AI solutions · 2026Vilya ProtectionVilya Protection · assassination-prevention software platform for public figures and large events. The demo shows the full operational dashboard.

Custom software · everything else · Websites, web apps & online shops · AI solutions · 2026AutoImportEU→HU car-import arbitrage platform - turns 'you can buy this car abroad and resell it at home' into a live, scored feed.

AI solutions · Websites, web apps & online shops · Custom software · everything else · 2026ClarixAIA misconception-pattern radar for teachers · open-ended student answers in, the reasoning errors dominating a cohort out.

Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk