Prompt caching

DEFINITION

Major model providers (Anthropic, OpenAI, Google) let you mark the rarely-changing front of the prompt (system prompt, document context, tool definitions) as cacheable on their side. A follow-up call within roughly 5 minutes that reuses the same prefix can cut input-token cost by up to 90 percent and roughly halve time-to-first-token. It pays off when the prefix is at least a few thousand tokens and many calls share it, for example a support bot, a RAG pipeline, or a code-review agent. It does not pay off when the prompt is unique per call (user-level personalisation injected into the middle of the prefix) or when context is only a few hundred tokens. Architect the prompt so the stable bulk is at the front and the volatile user turn at the back.

RELATED TERMS06

RAG (Retrieval-Augmented Generation)→
An AI architecture where the model retrieves relevant documents from your own data before answering, and only reasons over that context. Kills ~80% of hallucinations.
LLM (Large Language Model)→
A neural model with billions of parameters (GPT-4, Claude, Mistral) that generates text. In production we never use one bare · always wrapped in retrieval and guardrails.
Embedding→
A vector representation of text (e.g. 1536 floats). If two embeddings are close, the meanings are close. In RAG we use this to pick relevant chunks.
Vector database→
A database specialised for fast approximate-nearest-neighbour search over embedding vectors (pgvector, Qdrant, Weaviate). The engineering base of RAG retrieval.
Eval (LLM evaluation)→
An automated test suite that runs ~50–200 'golden' questions against the model before every release and checks that quality metrics (accuracy, factuality, latency) clear the threshold.
Guardrail→
An input- or output-layer that filters the model's prompt/response (PII scrubbers, prompt-injection detectors, JSON-schema validation, topic blocks). Not before/after the model · around it.

MENTIONED IN THE BLOG08