26 April 2026·9 min read

AI·26 April 2026·9 min read

AI LLM Prompt caching Anthropic OpenAI Cost

Why your AI agent leaks money · 6 prompt-cache wins worth doing this week

Six prompt-cache patterns. Real before/after numbers. Most agents leave 60-80% on the table. Fix it this week.

Last verified26 April 2026

By Dezso MezoFounder, DField Solutions

ShareX LinkedIn#

Why your AI agent leaks money · 6 prompt-cache wins worth doing this week

Three months. Six different agent codebases opened with the same brief: 'cut the LLM bill'. Six times the same problem. The system prompt and the tool definitions get re-billed every call. Caching either off entirely, or partially wired to the system prompt only.

These six patterns stack. Each one gives you 10-30%. Stacked, you land at 60-80% cost cut on a typical agent (around 18k prompt tokens, 4k calls/day, Sonnet 4.5). Real shipped agent: monthly bill from 980 EUR to 220 EUR. Two engineering days.

1. System prompt cache breakpoint

Cheapest win. In the Anthropic SDK, mark the last block of the system prompt with `cache_control: { type: 'ephemeral' }`. 5 minute TTL, every subsequent call reuses the cached prefix. Typical save: 30-50% of total token cost when the system prompt is over 2k tokens.

2. Tool schemas

A 12 tool agent typically carries 4-6k tokens of tool definitions. Static across calls. Drop a cache breakpoint at the end of the tools block. 10-15% off the bill.

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  tools: tools.map((t, i, arr) => i === arr.length - 1
    ? { ...t, cache_control: { type: "ephemeral" } }
    : t),
  system: [
    { type: "text", text: STATIC_INSTRUCTIONS },
    { type: "text", text: COMPANY_KB, cache_control: { type: "ephemeral" } },
  ],
  messages,
});

3. Files API · the 2026 entry

Anthropic's Files API lets you upload large static documents (a wiki, product catalogue, 60-page manual), get a file_id, and reference the file by ID on every call. The platform caches the prefix automatically. On a 60 page PDF we measured daily context cost dropping from ~7 EUR to ~0.40 EUR.

4. Conversation prefix cache for long-running agents

Long chats (developer copilots, multi-turn support agents) re-bill the entire history on each turn. If your conversation has a stable prefix (system + few-shot + retrieved context), put the cache breakpoint at the end of the stable block, not on the last user turn. 20-40% save on chat agents.

5. Few-shot examples in their own cache layer

Classification or extraction agents carry 3-5k tokens of few-shot examples. Static. Put them in their own block, mark cacheable. Plus 10-20%. Watch the hash · ordering and whitespace count, even one swapped example invalidates the cache.

6. Pre-warm cron for low-traffic agents

Cache TTL is ~5 min on Anthropic, ~10 min on OpenAI. If you serve 200 calls/day spread across 24 hours, your hit ratio dies. A 1-minute cron that sends a no-op user message against the stable prefix keeps the cache warm. Cost: under 0.10 EUR/day. Save: 30-50% of normal call token costs.

Real before/after · one month

Production support agent for a Hungarian client. 3800 calls/day, ~14k token average prompt, Sonnet 4.5.

Before (no cache): 1 120 EUR/mo, p95 4.8s.
+ pattern 1 (system): 720 EUR/mo, p95 3.9s.
+ pattern 2 (tools): 540 EUR/mo, p95 3.2s.
+ pattern 5 (few-shot): 380 EUR/mo, p95 2.7s.
All six patterns: 220 EUR/mo, p95 2.1s. 80% saving.

Two gotchas

Anthropic charges ~25% extra on first cache write (storage). At low traffic this hurts. Above 50 calls/hour it always pays back.
Cache breakpoint order matters. If anything in the prefix changes by even one character, every subsequent block re-bills. Put stable content at the start, volatile at the end.

Measure across 1000 production calls before and after. If you are not below 50% of the original bill, you skipped a pattern. The 60-80% number is real, not marketing.

ShareX LinkedIn#

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

ABOUT Let's talk

Keep reading

30 Sept 2026·11 min read

DField Q3 2026 roundup · what shifted, what we shipped, what is broken

Three months in. SZEP 2.0 live, NAV v3 cutover, AI Act enforcement, OWASP LLM Top 10 v2. Hard numbers, one strong opinion on the consulting tier.

Read

01 Jul 2026·11 min read

DField Q2 2026 roundup · what shifted, what we shipped, what is broken

Four months in. Eleven shipped projects, real before/after numbers, one strong opinion on what the consulting tier got wrong this quarter.

Read

26 Apr 2026·9 min read

RAG's three failure modes · and the diagnostic table we use on every audit

Three failure modes, one table. 30 minutes of diagnosis, then you know what to fix. Stop guessing.

Read

RELATED PROJECTS

AI solutions · Website & online shop · Mobile app (iPhone + Android) · 2026AIHealthIQAI reads your wearable data, spots changes, recommends treatment and lifestyle tweaks.

AI solutions · Website & online shop · 20263D AI PropertyAI-generated 3D properties and interiors, walkable in FPV · with a manual editor and drone view.

AI solutions · Cybersecurity · Website & online shop · 2026Use AI EasilyAn AI firm's website · home of Hungary's first dedicated AI-security practice.

Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk