DField SolutionsMérnöki stúdió · Budapest
Loading · Töltődik
Skip to content
Back to blog
·9 min read
AI··9 min read

Why your AI agent leaks money · 6 prompt-cache wins worth doing this week

Six prompt-cache patterns. Real before/after numbers. Most agents leave 60-80% on the table. Fix it this week.

Last verified
Dezso Mezo
Founder, DField Solutions
ShareXLinkedIn#
Why your AI agent leaks money · 6 prompt-cache wins worth doing this week

Three months. Six different agent codebases opened with the same brief: 'cut the LLM bill'. Six times the same problem. The system prompt and the tool definitions get re-billed every call. Caching either off entirely, or partially wired to the system prompt only.

These six patterns stack. Each one gives you 10-30%. Stacked, you land at 60-80% cost cut on a typical agent (around 18k prompt tokens, 4k calls/day, Sonnet 4.5). Real shipped agent: monthly bill from 980 EUR to 220 EUR. Two engineering days.

1. System prompt cache breakpoint

Cheapest win. In the Anthropic SDK, mark the last block of the system prompt with `cache_control: { type: 'ephemeral' }`. 5 minute TTL, every subsequent call reuses the cached prefix. Typical save: 30-50% of total token cost when the system prompt is over 2k tokens.

2. Tool schemas

A 12 tool agent typically carries 4-6k tokens of tool definitions. Static across calls. Drop a cache breakpoint at the end of the tools block. 10-15% off the bill.

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  tools: tools.map((t, i, arr) => i === arr.length - 1
    ? { ...t, cache_control: { type: "ephemeral" } }
    : t),
  system: [
    { type: "text", text: STATIC_INSTRUCTIONS },
    { type: "text", text: COMPANY_KB, cache_control: { type: "ephemeral" } },
  ],
  messages,
});

3. Files API · the 2026 entry

Anthropic's Files API lets you upload large static documents (a wiki, product catalogue, 60-page manual), get a file_id, and reference the file by ID on every call. The platform caches the prefix automatically. On a 60 page PDF we measured daily context cost dropping from ~7 EUR to ~0.40 EUR.

4. Conversation prefix cache for long-running agents

Long chats (developer copilots, multi-turn support agents) re-bill the entire history on each turn. If your conversation has a stable prefix (system + few-shot + retrieved context), put the cache breakpoint at the end of the stable block, not on the last user turn. 20-40% save on chat agents.

5. Few-shot examples in their own cache layer

Classification or extraction agents carry 3-5k tokens of few-shot examples. Static. Put them in their own block, mark cacheable. Plus 10-20%. Watch the hash · ordering and whitespace count, even one swapped example invalidates the cache.

6. Pre-warm cron for low-traffic agents

Cache TTL is ~5 min on Anthropic, ~10 min on OpenAI. If you serve 200 calls/day spread across 24 hours, your hit ratio dies. A 1-minute cron that sends a no-op user message against the stable prefix keeps the cache warm. Cost: under 0.10 EUR/day. Save: 30-50% of normal call token costs.

Real before/after · one month

Production support agent for a Hungarian client. 3800 calls/day, ~14k token average prompt, Sonnet 4.5.

  • Before (no cache): 1 120 EUR/mo, p95 4.8s.
  • + pattern 1 (system): 720 EUR/mo, p95 3.9s.
  • + pattern 2 (tools): 540 EUR/mo, p95 3.2s.
  • + pattern 5 (few-shot): 380 EUR/mo, p95 2.7s.
  • All six patterns: 220 EUR/mo, p95 2.1s. 80% saving.

Two gotchas

  • Anthropic charges ~25% extra on first cache write (storage). At low traffic this hurts. Above 50 calls/hour it always pays back.
  • Cache breakpoint order matters. If anything in the prefix changes by even one character, every subsequent block re-bills. Put stable content at the start, volatile at the end.

Measure across 1000 production calls before and after. If you are not below 50% of the original bill, you skipped a pattern. The 60-80% number is real, not marketing.

ShareXLinkedIn#
Dezso Mezo
By

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

Keep reading
RELATED PROJECTS
Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.