Eval harness

DEFINITION

An eval harness is the runnable infrastructure that, on every model bump, every prompt change, and every release, automatically runs a fixed test set, computes metrics (accuracy, factuality, refusal rate, latency, cost), stores results as a time series, and blocks release if any threshold drops. Saying we tested it usually means a developer played a few prompts through once by hand. An eval harness is the CI-wired, regression-catching, version-comparing version of that. Without one, every model bump is flying blind. A serious LLM stack today is a dataset, a runner (Promptfoo, Inspect, in-house), a scoring layer (LLM-as-judge plus deterministic asserts), and a dashboard where yesterday's run sits next to the new one.

RELATED TERMS06

Context Engineering→
The successor to prompt engineering: deliberately curating what enters the model's context window - system prompt, retrieved docs, tools, memory. Goal is max accuracy on the fewest tokens. A model only knows what you put in front of it.
AI Gateway→
A proxy layer between your app and LLM providers (OpenAI, Anthropic): routing, retries, caching, rate-limits, key management, cost tracking and failover. One place to see your whole AI bill - and no lock-in to a single vendor.
Model Routing→
Send each request to the cheapest model that can handle it: a small model for easy queries, a frontier model for hard ones - often decided by a classifier. Cuts inference cost dramatically, frequently 5-10× on real traffic.
Graph RAG→
A RAG variant that retrieves over a knowledge graph (entities + relationships) instead of flat text chunks. Lets the model answer multi-hop questions ("how is X connected to Y?") that pure vector search misses.
Agent Memory→
How an AI agent persists state across turns and sessions: short-term (the context window), long-term (a vector store / DB of facts), and episodic. The difference between an agent that forgets and one that learns your business.
Synthetic Data→
Model-generated training and eval data for when real data is scarce, sensitive (GDPR), or imbalanced. Useful, but you must check quality and diversity - otherwise you bake the model's own blind spots into your system.

MENTIONED IN THE BLOG08