Skip to content
Back to blog
·9 min read
Shipping AI agents that actually work in production
AILLMRAGProduction

Shipping AI agents that actually work in production

From demo to live system: the retrieval, eval, guardrails and cost control we run on every AI project we ship.

Last verified• new
Dezső Mező
By Dezső MezőFounder, DField Solutions
ShareXLinkedIn#

Reviewed by:Dezső Mező· Founder · Engineer, DField Solutions· 18 Apr 2026

Most 'AI agent' projects we see start with a promising ChatGPT demo, and three months later nobody knows why it hallucinates, why it's expensive, or why it falls apart under a real user. The problem isn't the LLM. The problem is the missing systems thinking.

Here's how we deliver AI agents that behave like production systems: every release passes an eval suite, every token has a cost SLA, and we see — in real time — when behavior drifts from the trend.

1. Retrieval: if you only do one thing, do this

Most hallucinations aren't solved by 'bigger model' — they're solved by retrieval. If the context is in the prompt, the model has nothing to invent. Hybrid retrieval (BM25 + vector + reranker) plus careful chunking covers 80% of customer-side errors.

  • Chunk size 300–800 tokens, overlap 15–20%.
  • Reranker (bge-reranker, Cohere rerank-3) is a dramatic quality jump.
  • Always return citations. No hits → refuse.

2. Evals: 'looks fine' is no longer fine

We build a golden set from the customer's data — 50–200 questions — and run it in CI before every release. LLM-as-judge + factual regression tests. If the quality trend breaks, we don't deploy.

// Eval CI step
import { runEvals } from "@dfield/eval";

const result = await runEvals({
  suite: "support-copilot",
  model: process.env.MODEL_VERSION,
  thresholds: { accuracy: 0.88, factual: 0.95, latencyP95Ms: 1800 },
});

if (!result.passed) {
  throw new Error(`Eval failed: ${result.failures.join(", ")}`);
}

3. Guardrails: PII, prompt injection, output schema

Input side: PII scrubber, prompt-injection detector (keyword + LLM classifier). Output side: JSON schema validation, topic filters. This isn't cosmetic — it's what protects the brand.

Guardrails are cheap insurance: they barely affect latency, and they stop 99% of unsafe / off-brand output.

4. Cost control: LLM router + cache

Not every question needs a GPT-4o answer. Route by intent: simple FAQ → small model + cache; complex reasoning → big model. 3–5x cost reduction is realistic.

5. Observability: every question measured

OpenTelemetry + our own dashboard: tokens in/out, latency P50/P95/P99, quality metrics (accuracy, refusal rate), cost per user. When a metric breaks, we know instantly and a pager fires.

Closing

An AI system isn't fundamentally different from any other backend service — it needs the same engineering discipline. If you want to start this way, email us — we can show a running prototype on your data within a week.

ShareXLinkedIn#
Dezső Mező

By

Dezső Mező

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling — for startups and enterprises, from Budapest to San Francisco.

Keep reading

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk