Guide
Cached token pricing explained
What cached input pricing actually means, where it helps, and how to estimate it honestly.
What cached input pricing is really discounting
Last updated 2026-04-20
Cached input pricing matters when a meaningful part of your prompt stays the same across requests. Think long system prompts, stable policy instructions, repeated document context, or other shared text that the provider can treat as reused input rather than brand new tokens every time.
That is why cached pricing can change the math so much. If a large chunk of every request is repeated, the effective cost per request may fall well below what you would expect from standard input pricing alone. For some workloads, that difference is large enough to change which model looks economical.
Where teams overestimate the savings
Last updated 2026-04-20
The common mistake is assuming most of the request is cacheable when only a small piece actually repeats. User messages, fresh retrieval results, and changing task instructions often make up more of the payload than teams expect. If that part changes on every call, it should not be counted as cached input.
A good estimate separates stable context from variable context. If you cannot clearly point to the part of the prompt that repeats, your cache assumption is probably too generous. Honest modeling beats optimistic modeling, especially when cost is tied to margin.
How to estimate cache share in practice
Last updated 2026-04-20
Start by looking at a real request and marking which tokens are identical across calls. That gives you a concrete cached-input ratio instead of a guess. In many apps, the right number is not 80 percent or 90 percent. It is something more modest that still helps, but does not magically erase input cost.
If you are still early, model a few scenarios. A low cache-share case tells you what happens when usage is messy or prompts change often. A higher cache-share case tells you what efficient prompt architecture could unlock. The spread between those numbers is often more informative than a single neat estimate.
When cached pricing changes the model decision
Last updated 2026-04-20
Cached pricing matters most when repeated context is large enough to dominate input cost. RAG assistants, policy-heavy support bots, and tools that resend the same instruction block can all behave very differently once cached input is taken seriously.
If two models look similar on the price table, but one gives you a better effective rate for repeated context, that can be the difference between a workable unit cost and a feature that is hard to scale. This is why cache-aware estimates are more useful than list-price comparisons.