DField SolutionsMérnöki stúdió · Budapest
Loading · Töltődik
Skip to content

Eval harness

Related service AI solutions

DEFINITION

An eval harness is the runnable infrastructure that, on every model bump, every prompt change, and every release, automatically runs a fixed test set, computes metrics (accuracy, factuality, refusal rate, latency, cost), stores results as a time series, and blocks release if any threshold drops. Saying we tested it usually means a developer played a few prompts through once by hand. An eval harness is the CI-wired, regression-catching, version-comparing version of that. Without one, every model bump is flying blind. A serious LLM stack today is a dataset, a runner (Promptfoo, Inspect, in-house), a scoring layer (LLM-as-judge plus deterministic asserts), and a dashboard where yesterday's run sits next to the new one.

RELATED TERMS06
  • Context Engineering

    The successor to prompt engineering: deliberately curating what enters the model's context window - system prompt, retrieved docs, tools, memory. Goal is max accuracy on the fewest tokens. A model only knows what you put in front of it.

  • AI Gateway

    A proxy layer between your app and LLM providers (OpenAI, Anthropic): routing, retries, caching, rate-limits, key management, cost tracking and failover. One place to see your whole AI bill - and no lock-in to a single vendor.

  • Model Routing

    Send each request to the cheapest model that can handle it: a small model for easy queries, a frontier model for hard ones - often decided by a classifier. Cuts inference cost dramatically, frequently 5-10× on real traffic.

  • Graph RAG

    A RAG variant that retrieves over a knowledge graph (entities + relationships) instead of flat text chunks. Lets the model answer multi-hop questions ("how is X connected to Y?") that pure vector search misses.

  • Agent Memory

    How an AI agent persists state across turns and sessions: short-term (the context window), long-term (a vector store / DB of facts), and episodic. The difference between an agent that forgets and one that learns your business.

  • Synthetic Data

    Model-generated training and eval data for when real data is scarce, sensitive (GDPR), or imbalanced. Useful, but you must check quality and diversity - otherwise you bake the model's own blind spots into your system.

MENTIONED IN THE BLOG08