Prompt Caching / KV Cache
Reduce inference costs by 90% and time-to-first-token by 80% by reusing computed attention states across requests with identical prefixes.
Every time a transformer model processes a request, it computes Key and Value matrices for every token in the context. This is the most computationally expensive step in inference, and it is repeated in full for every new request — even when 95% of the tokens are identical to the previous one. KV caching solves this by storing those computed matrices and reusing them whenever the same prefix appears again. The savings are dramatic: up to 90% cost reduction on cached tokens, and 70–85% reduction in time-to-first-token for large prefixes.
How KV Caching Works
The cache is prefix-based. When a request arrives, the provider compares the beginning of the context against stored KV states. If the tokens match exactly from position 0 through some point, those states are reused and only the remaining tokens require fresh computation. A single character difference anywhere in the matched region breaks the cache entirely.
WITHOUT CACHING WITH CACHING
─────────────── ────────────
Request 1: Request 1:
┌──────────────────────┐ ┌──────────────────────┐
│ System: "You are..." │ ◄─ Compute │ System: "You are..." │ ◄─ Compute
│ User: "Hello" │ K,V │ User: "Hello" │ K,V
└──────────────────────┘ └──────────────────────┘
│
▼
┌──────────────────────┐
│ KV CACHE │
│ Store K,V matrices │
└──────────────────────┘
Request 2: Request 2:
┌──────────────────────┐ ┌──────────────────────┐
│ System: "You are..." │ ◄─ Compute │ System: "You are..." │ ◄─ Cache HIT!
│ User: "Hi there" │ K,V AGAIN │ User: "Hi there" │ (reuse K,V)
└──────────────────────┘ └──────────────────────┘
│
Cost: Full price ▼
Only compute K,V for "Hi there"
Cost: 90% reduction on cached portion The cache is prefix-based. Every token must match exactly from the beginning. A single character difference at any point breaks the cache.
Provider Implementations
Major providers implement caching differently. Anthropic uses explicit cache_control markers that you add to system message content blocks, giving you precise control over what gets cached and when. OpenAI caches automatically based on prefix matching — no code changes required, though you get less visibility and control. Google’s context caching API takes a more explicit approach similar to Anthropic’s. Self-hosted deployments using vLLM or TGI get the latency benefits of prefix caching without the monetary discount.
| Provider | Mechanism | Discount | TTL |
|---|---|---|---|
| Anthropic | Explicit cache_control markers | 90% off cached reads | 5 minutes (ephemeral) |
| OpenAI | Automatic prefix matching | 50% off cached tokens | 5–10 minutes |
| Context caching API | Variable by model | Configurable | |
| Self-hosted | vLLM/TGI prefix caching | N/A (latency benefit) | Memory-dependent |
Impact: Cost and Latency
The financial case for caching is clearest when you have a large, stable system prompt used across many requests. Consider a 50K-token system prompt serving 1,000 requests:
COST IMPACT (Example: 50K token system prompt) Without Caching: ├── Request 1: 50,000 tokens × $0.003/1K = $0.15 ├── Request 2: 50,000 tokens × $0.003/1K = $0.15 ├── Request 3: 50,000 tokens × $0.003/1K = $0.15 └── Total: $0.45 With Caching (90% discount on cached): ├── Request 1: 50,000 tokens × $0.003/1K = $0.15 (cache created) ├── Request 2: 50,000 tokens × $0.0003/1K = $0.015 (cache hit) ├── Request 3: 50,000 tokens × $0.0003/1K = $0.015 (cache hit) └── Total: $0.18 (60% savings) At scale (1000 requests): ├── Without: $150.00 ├── With: $15.15 └── Savings: $134.85 (90%) ───────────────────────────────────────────────────────── LATENCY IMPACT Without Caching: Time to First Token (TTFT): ~2-3 seconds (compute all K,V) With Caching: Time to First Token (TTFT): ~0.3-0.5 seconds (only new tokens) Improvement: 70-85% reduction in TTFT
| Use Case | Cached Prefix Size | Cost Reduction | TTFT Reduction |
|---|---|---|---|
| RAG with fixed docs | 20–50K tokens | 70–85% | 60–80% |
| Agent with many tools | 10–30K tokens | 50–75% | 50–70% |
| Multi-turn chat | 5–15K tokens | 40–60% | 40–60% |
| Code assistant | 30–100K tokens | 80–90% | 70–85% |
Caching provides the most benefit when you have a large, stable system prompt, you make many requests with the same prefix, and requests happen within the cache TTL (typically 5 minutes).
Basic Implementation
The following shows how to implement prompt caching with Anthropic’s explicit markers. Note that this topic is one of the exceptions to the standard LangChain/provider-agnostic approach — caching requires provider-specific APIs:
from anthropic import Anthropic
client = Anthropic()
SYSTEM_PROMPT = """You are an expert assistant for our
e-commerce platform. Here is our complete product catalog:
[... imagine 50,000 tokens of product data ...]
Use this catalog to answer customer questions accurately.
Always cite specific product IDs when making recommendations.
"""
def query_with_caching(user_question: str) -> str:
"""Query with prompt caching enabled."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_question}
]
)
# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
# First call: cache_creation_input_tokens = ~50,000
# cache_read_input_tokens = 0
# Subsequent: cache_creation_input_tokens = 0
# cache_read_input_tokens = ~50,000
return response.content[0].text
Multi-Turn Conversation Caching
In multi-turn conversations, the system prompt and tool definitions remain stable across every turn while the conversation history grows. The key is to structure requests so the stable prefix is always at the beginning and the dynamic conversation history follows it. This way, each turn gets a cache hit on the expensive system content.
Turn 1: ┌────────────────────────────────────────────────────┐ │ [CACHE CREATED] │ │ ┌──────────────────────────────────────────────┐ │ │ │ System prompt (50K tokens) │ │ │ │ Tool definitions (5K tokens) │ │ │ └──────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────┐ │ │ │ User: "Hello" │ │ │ └──────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────┘ Turn 2: ┌────────────────────────────────────────────────────┐ │ [CACHE HIT] │ │ ┌──────────────────────────────────────────────┐ │ │ │ System prompt (50K tokens) ✓ CACHED │ │ │ │ Tool definitions (5K tokens) ✓ CACHED │ │ │ └──────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────┐ │ │ │ User: "Hello" │ │ │ │ Assistant: "Hi! How can I help?" │ │ │ │ User: "What's the weather?" │ │ │ └──────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────┘ Cost breakdown: - Turn 1: 55K tokens at full price (cache creation) - Turn 2: 55K tokens at 90% discount + ~50 new tokens full price - Turn 3+: Same pattern, savings compound
from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import List, Dict
@dataclass
class CachedConversation:
"""Multi-turn conversation optimized for prompt caching."""
client: Anthropic
system_prompt: str
tool_definitions: str
model: str = "claude-sonnet-4-20250514"
messages: List[Dict] = field(default_factory=list)
total_cache_reads: int = 0
total_cache_creates: int = 0
def chat(self, user_message: str) -> str:
"""Send message and get response with caching."""
self.messages.append({
"role": "user",
"content": user_message
})
response = self.client.messages.create(
model=self.model,
max_tokens=2048,
system=[
# Large system prompt - cached
{
"type": "text",
"text": self.system_prompt,
"cache_control": {"type": "ephemeral"}
},
# Tool definitions - cached
{
"type": "text",
"text": self.tool_definitions,
"cache_control": {"type": "ephemeral"}
}
],
messages=self.messages
)
# Track cache performance
usage = response.usage
self.total_cache_reads += usage.cache_read_input_tokens
self.total_cache_creates += usage.cache_creation_input_tokens
assistant_message = response.content[0].text
self.messages.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
def get_cache_stats(self) -> dict:
"""Get cumulative cache statistics."""
return {
"total_cache_reads": self.total_cache_reads,
"total_cache_creates": self.total_cache_creates,
"estimated_savings": self._calculate_savings()
}
def _calculate_savings(self) -> str:
"""Estimate cost savings from caching."""
# Anthropic pricing: cached reads are 90% cheaper
if self.total_cache_reads == 0:
return "0%"
full_cost = self.total_cache_reads + self.total_cache_creates
actual_cost = (
self.total_cache_creates +
self.total_cache_reads * 0.1 # 90% discount
)
savings = (1 - actual_cost / full_cost) * 100
return f"{savings:.1f}%"
Optimization Strategies
Maximizing cache hit rates requires careful control over prompt structure. Content should be ordered from most stable to least stable, because the cache prefix ends at the first differing byte. Static instructions go first. Reference data and knowledge bases come next. Tool definitions follow — but only when sorted deterministically, since random ordering breaks the cache. Session-specific context and conversation history go last, outside the cached region.
MOST STABLE (highest cache benefit)
│
▼
┌───────────────────────────────────────┐
│ 1. Static Instructions │ ← Never changes
│ "You are a helpful assistant..." │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 2. Reference Data / Knowledge Base │ ← Changes rarely
│ Documentation, product catalog... │ (daily/weekly)
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 3. Tool Definitions │ ← Changes occasionally
│ Sorted alphabetically for │ (with deployments)
│ deterministic ordering │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 4. Session Context │ ← Changes per session
│ User preferences, auth info... │ (no caching)
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 5. Conversation History │ ← Changes per turn
│ Previous messages... │ (no caching)
└───────────────────────────────────────┘
│
▼
LEAST STABLE (no cache benefit) from dataclasses import dataclass
from typing import List, Optional
import hashlib
import json
@dataclass
class CacheOptimizedPromptBuilder:
"""Build prompts optimized for cache hit rates."""
static_instructions: str
reference_data: str
tool_definitions: List[dict]
def build_system_content(
self,
session_context: Optional[str] = None
) -> List[dict]:
"""Build system content with optimal cache structure."""
content = []
# Layer 1: Static instructions (highest cache stability)
content.append({
"type": "text",
"text": self._normalize(self.static_instructions),
"cache_control": {"type": "ephemeral"}
})
# Layer 2: Reference data (changes rarely)
content.append({
"type": "text",
"text": self._normalize(self.reference_data),
"cache_control": {"type": "ephemeral"}
})
# Layer 3: Tool definitions (normalized for consistency)
# Sort tools alphabetically for deterministic ordering
sorted_tools = sorted(
self.tool_definitions,
key=lambda t: t.get("name", "")
)
tools_text = json.dumps(sorted_tools, sort_keys=True)
content.append({
"type": "text",
"text": f"Available tools:\n{tools_text}",
"cache_control": {"type": "ephemeral"}
})
# Layer 4: Session context (if any, no cache control)
if session_context:
content.append({
"type": "text",
"text": session_context
# No cache_control = not cached
})
return content
def _normalize(self, text: str) -> str:
"""Normalize text to maximize cache hits."""
lines = [line.rstrip() for line in text.split('\n')]
normalized = '\n'.join(lines)
return normalized
def get_cache_key(self) -> str:
"""Get a hash representing the cacheable prefix."""
cacheable = (
self._normalize(self.static_instructions) +
self._normalize(self.reference_data) +
json.dumps(sorted(
self.tool_definitions,
key=lambda t: t.get("name", "")
), sort_keys=True)
)
return hashlib.sha256(cacheable.encode()).hexdigest()[:16]
| Pattern | Problem | Fix |
|---|---|---|
| Timestamps in prompt | Changes every second/minute | Move to metadata or round to day |
| Request IDs in system | Unique per request | Move to message, not system prompt |
| Random example order | Different each time | Sort deterministically |
| User name in system | Breaks cache per user | Move to conversation, after cache boundary |
| Inconsistent whitespace | Trailing spaces vary | Normalize with .strip() |
Monitoring and Evaluation
Without tracking cache metrics, you will not know whether caching is working. Always log cache_read_input_tokens on every response. If cache_creation_input_tokens keeps appearing on requests that should be hitting the cache, you have a cache-breaking pattern in your prompt structure.
| Metric | What it Measures | Target |
|---|---|---|
| Cache hit rate | % of requests using cached prefix | >90% for steady workloads |
| Cached token ratio | Cached tokens / total input tokens | Higher is better (varies by use case) |
| TTFT improvement | Latency reduction from caching | 50–80% for large prefixes |
| Cost savings | Actual vs theoretical cost | Track against baseline |
| Cache creation rate | New caches created / total requests | <10% indicates good stability |
A simple debugging technique: compute a SHA-256 hash of your system prompt before each request and log it. If the hash changes between requests that should be identical, you have found your cache-breaking source. Byte-level matching means invisible differences — trailing spaces, different line endings, CRLF vs LF — will silently prevent cache hits.
Common Pitfalls
Any dynamic content (timestamps, IDs, user names) in the cached portion will break the cache on every request. Audit your prompts carefully.
Caches expire (typically 5 minutes). If your traffic is sporadic, you may not get cache benefits. Consider batching requests or accepting cold starts.
If tool definitions or examples appear in different orders, the cache breaks. Always sort deterministically — alphabetically, by ID, or by any stable key.
Without tracking cache metrics, you will not know if caching is working. Always log cache_read_input_tokens to verify hits.
Best Practices
Put all stable content (system prompt, docs, tools) at the beginning. Cache benefit ends at the first byte of difference.
Strip whitespace, sort collections, use consistent formatting. Byte-level matching means even invisible differences break cache.
If you have multiple similar requests, send them close together in time to maximize cache hits before TTL expires.
Log cache metrics on every request. Alert if hit rate drops significantly — it indicates prompt structure changed.