Skip to main content

Anthropic

Anthropic models support explicit Prompt caching. On this platform, whether you use the OpenAI chat/completions protocol or the Anthropic v1/messages protocol, you can specify content to be cached with "cache_control": {"type": "ephemeral"}.
{
  "model": "claude-sonnet-4-5-20250929",
  "max_tokens": 4096,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": { "type": "ephemeral" }
        },
        {
          "type": "text",
          "text": "Name all the characters in the above book"
        }
      ]
    }
  ]
}
⚠️ cache_control is an extended field provided by us. It is not included in the official OpenAI SDK protocol, so you need to add it explicitly when making calls. You can verify cache creation/hits from the response.
{
  "prompt_tokens": 7039,
  "completion_tokens": 650,
  "total_tokens": 7689,
  "prompt_tokens_details": {
    "cached_tokens": 7019,
    "cache_creation_input_tokens": 7019,  # 👈 cache created
    "cache_read_input_tokens": 0
  }
}
---
{
  "prompt_tokens": 7042,
  "completion_tokens": 572,
  "total_tokens": 7614,
  "prompt_tokens_details": {
    "audio_tokens": 0,
    "cached_tokens": 7019,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 7019 # 👈 cache read
  }
}
⚠️⚠️⚠️ For Anthropic models, the minimum Input Tokens requirements for using Prompt caching are as follows:
  • Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4.5, Claude Sonnet 4, and Claude Sonnet 3.7: 1024 tokens
  • Claude Haiku 4.5, Claude Haiku 3.5, and Claude Haiku 3: 2048 tokens

OpenAI and OpenAI-compatible models

Typically, these models may support implicit caching. When users repeatedly access the same model with the same Prompt prefix, there is a chance of a cache hit.
// Round 1
{
  "model": "gpt-4",
  "messages": [
    {
      "role": "system",
      "content": "HUGE TEXT BODY: Complete API documentation, code style guide, best practices (5000+ lines)"
    },
    {
      "role": "user",
      "content": "How do I authenticate API requests?"
    }
  ]
}

// Round 2 - Documentation cached
{
  "model": "gpt-4",
  "messages": [
    {
      "role": "system",
      "content": "HUGE TEXT BODY: Complete API documentation, code style guide, best practices (5000+ lines)"
    },
    {
      "role": "user",
      "content": "How do I authenticate API requests?"
    },
    {
      "role": "assistant",
      "content": "Use Bearer token in Authorization header..."
    },
    {
      "role": "user",
      "content": "What about rate limiting?"
    }
  ]
}
The following is a usage example for a cache hit:
{
  "prompt_tokens": 3003,
  "completion_tokens": 1564,
  "total_tokens": 4567,
  "prompt_tokens_details": {
    "cached_tokens": 2025 # 👈 cache hit
  }
}

Gemini

Currently, only implicit caching is supported. Implicit caching does not require manual setup or additional cache_control configuration. When users repeatedly access the same model with the same Prompt prefix, there is a chance of a cache hit. Notes:
  • The average TTL (cache lifetime) is 3–5 minutes, but it may vary (for example, it may be only a few seconds)
  • Gemini 2.5 Flash requires a minimum input of 1024 tokens, while Gemini 2.5 Pro requires a minimum of 4096 tokens
The following is a usage example for a cache hit:
{
  "prompt_tokens": 2004,
  "completion_tokens": 1564,
  "total_tokens": 3568,
  "prompt_tokens_details": {
    "cached_tokens": 1994 # 👈 cache hit
  }
}
For input examples, refer to OpenAI models and OpenAI-compatible models.