How to Handle Claude API 429 Rate Limits in Production

Q: How long should I wait after a 429?

Use the `retry-after` response header when it is present. If it is not present, use exponential backoff with jitter and a maximum wait such as 60 seconds.

If your Claude API integration returns HTTP 429 Too Many Requests, or a batch job starts failing halfway through, you do not just need a bigger retry loop. You need a throughput plan.

This guide explains how Claude API rate limits work, why 429s happen in real systems, and how to build a safer client with exponential backoff, concurrency limits, model routing, prompt caching, and monitoring.

The examples use ClaudeAPI’s Anthropic-compatible endpoint:

https://gw.claudeapi.com

https://gw.claudeapi.com

For OpenAI-compatible clients, use:

https://gw.claudeapi.com/v1

https://gw.claudeapi.com/v1

ClaudeAPI is an independent third-party technical service provider. Claude and Anthropic are trademarks of their respective owners. This article is based on public Anthropic documentation and practical engineering patterns; it does not imply any official affiliation or endorsement.

How Claude API rate limits work

Anthropic’s public rate-limit documentation describes multiple limits. For the Messages API, the important dimensions are:

Dimension	Meaning	Why it matters
RPM	Requests per minute	Limits how many requests you can start
ITPM	Input tokens per minute	Limits how much prompt/context you can send
OTPM	Output tokens per minute	Limits how much text the model can generate

Some APIs and account configurations may also have daily token, spend, acceleration, or workspace-level limits. The exact numbers depend on account tier, model, organization, and provider path, so always confirm your live limits in the relevant console.

The important operational point is simple:

Any exhausted dimension can produce a 429.

Any exhausted dimension can produce a 429.

That means reducing request count alone is not always enough. A few long-context requests can hit token limits even when RPM still looks healthy.

Rate-limit response headers

When you hit a rate limit, the response may include headers that describe the current limit state and when you can retry.

Example shape:

HTTP/1.1 429 Too Many Requests
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-input-tokens-limit: 40000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-output-tokens-limit: 8000
anthropic-ratelimit-output-tokens-remaining: 0
anthropic-ratelimit-output-tokens-reset: 2026-06-30T12:01:00Z
retry-after: 30

HTTP/1.1 429 Too Many Requests
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-input-tokens-limit: 40000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-output-tokens-limit: 8000
anthropic-ratelimit-output-tokens-remaining: 0
anthropic-ratelimit-output-tokens-reset: 2026-06-30T12:01:00Z
retry-after: 30

In production, read the retry-after header when it is present. It tells your client how long to wait before retrying. Also log the remaining request and token headers so you can slow down before the system starts returning 429s.

Why 429s keep happening

Most repeated 429s come from one of these patterns.

1. RPM is exhausted by bursty concurrency

If your code starts 100 async tasks at once, you can exhaust request-per-minute capacity even if each request is small.

This often happens in queue workers, cron jobs, and “process all rows” scripts.

2. Input token limits are exhausted by long context

A few requests with large documents, long chat histories, or big retrieved context blocks can burn through input-token limits quickly.

The symptoms can be confusing: your request count looks low, but 429s still appear.

3. Output token limits are exhausted by verbose responses

Output is often the expensive and rate-sensitive side of generation. If prompts allow long answers, the model can consume output capacity faster than expected.

Set max_tokens, ask for concise output, and avoid letting batch jobs request unlimited prose.

4. Multiple services share one API key

In microservice architectures, several services may share the same key. Each service may believe it is under the limit, while the combined traffic exceeds it.

For production systems, either use separate keys per service or implement shared rate limiting at a gateway, queue, or Redis-backed limiter.

5. Daily or budget limits are consumed by offline jobs

Nightly summarization, report generation, embedding-style enrichment, or data cleanup can consume enough quota that daytime product traffic starts failing.

Separate offline workloads from user-facing traffic whenever possible.

Estimate safe concurrency before launch

Many 429 incidents are capacity-planning problems dressed up as bugs. Before launch, convert your quota into a safe operating level.

Metric	Formula	Use
Safe RPM	`account_rpm * 0.7`	Keeps 30% headroom for spikes
Average input tokens	measured from real requests	Helps estimate ITPM pressure
Average output tokens	measured from real requests	Helps estimate OTPM pressure
Safe input-limited tasks/min	`(account_itpm * 0.7) / avg_input_tokens`	Estimates prompt-side throughput
Safe output-limited tasks/min	`(account_otpm * 0.7) / avg_output_tokens`	Estimates generation-side throughput
Suggested concurrency	`min(safe_rpm, input_limited, output_limited) * p95_latency_seconds / 60`	Converts throughput into in-flight requests

Example:

avg input:  3,000 tokens
avg output:   500 tokens

avg input:  3,000 tokens
avg output:   500 tokens

Even if RPM looks high enough, input-token limits may be the true bottleneck. Increasing concurrency only makes 429s arrive faster if tokens are the limiting dimension.

Production rollout order

Use this order when adding Claude API to a production workflow:

Measure 50 to 100 real requests before launch.
Record average input tokens, average output tokens, p95 latency, and failure rate.
Calculate a conservative starting concurrency limit.
Retry only temporary failures such as 429, 529, and selected 5xx responses.
Do not retry 400 validation errors or 401 authentication errors.
Add a global limiter if multiple services share one API key.
Start at 30% traffic, then move to 50% and 70% after observing stability.
Alert on sustained 429 rate, low remaining token capacity, and abnormal daily burn.

That sequence is less exciting than “just increase workers.” It also ruins fewer afternoons.

Exponential backoff with jitter

Fixed-interval retries are dangerous under load. If 100 workers all retry after exactly one second, they create a retry storm.

Use exponential backoff with jitter:

wait = min(base_delay * 2^attempt + random_jitter, max_delay)

wait = min(base_delay * 2^attempt + random_jitter, max_delay)

If the server provides retry-after, prefer that value.

Python retry example

This example uses the official Anthropic Python SDK style with ClaudeAPI’s Anthropic-compatible base URL.

import os
import random
import time

import anthropic

client = anthropic.Anthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

def retry_after_seconds(error: Exception) -> float | None:
    response = getattr(error, "response", None)
    if not response:
        return None

    value = response.headers.get("retry-after")
    if not value:
        return None

    try:
        return float(value)
    except ValueError:
        return None

def call_with_retry(
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 1024,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> anthropic.types.Message:
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages,
            )
        except anthropic.RateLimitError as error:
            if attempt == max_retries - 1:
                raise

            wait = retry_after_seconds(error)
            if wait is None:
                wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)

            print(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait:.1f}s")
            time.sleep(wait)

        except anthropic.APIStatusError as error:
            retryable = error.status_code in {408, 500, 502, 503, 504, 529}
            if not retryable or attempt == max_retries - 1:
                raise

            wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
            print(f"Temporary API error {error.status_code}. Retry in {wait:.1f}s")
            time.sleep(wait)

    raise RuntimeError("Max retries exceeded")

response = call_with_retry(
    messages=[
        {
            "role": "user",
            "content": "Summarize the history of large language models in 100 words.",
        }
    ],
)

print(response.content[0].text)

import os
import random
import time

import anthropic

client = anthropic.Anthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

def retry_after_seconds(error: Exception) -> float | None:
    response = getattr(error, "response", None)
    if not response:
        return None

    value = response.headers.get("retry-after")
    if not value:
        return None

    try:
        return float(value)
    except ValueError:
        return None

def call_with_retry(
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 1024,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> anthropic.types.Message:
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages,
            )
        except anthropic.RateLimitError as error:
            if attempt == max_retries - 1:
                raise

            wait = retry_after_seconds(error)
            if wait is None:
                wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)

            print(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait:.1f}s")
            time.sleep(wait)

        except anthropic.APIStatusError as error:
            retryable = error.status_code in {408, 500, 502, 503, 504, 529}
            if not retryable or attempt == max_retries - 1:
                raise

            wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
            print(f"Temporary API error {error.status_code}. Retry in {wait:.1f}s")
            time.sleep(wait)

    raise RuntimeError("Max retries exceeded")

response = call_with_retry(
    messages=[
        {
            "role": "user",
            "content": "Summarize the history of large language models in 100 words.",
        }
    ],
)

print(response.content[0].text)

Node.js retry example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: process.env.CLAUDE_API_KEY,
  baseURL: "https://gw.claudeapi.com",
});

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function retryAfterMs(error: unknown): number | null {
  const headers = (error as any)?.headers;
  const value = headers?.["retry-after"] ?? headers?.get?.("retry-after");
  if (!value) return null;

  const seconds = Number(value);
  return Number.isFinite(seconds) ? seconds * 1000 : null;
}

export async function callWithRetry(
  messages: Anthropic.MessageParam[],
  model = "claude-sonnet-4-6",
  maxRetries = 5
): Promise<Anthropic.Message> {
  const baseDelay = 1000;
  const maxDelay = 60000;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model,
        max_tokens: 1024,
        messages,
      });
    } catch (error: unknown) {
      const status = (error as any)?.status;
      const retryable =
        error instanceof Anthropic.RateLimitError ||
        status === 408 ||
        status === 500 ||
        status === 502 ||
        status === 503 ||
        status === 504 ||
        status === 529;

      if (!retryable || attempt === maxRetries - 1) {
        throw error;
      }

      const retryAfter = retryAfterMs(error);
      const jitter = Math.random() * 1000;
      const wait =
        retryAfter ?? Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);

      console.log(
        `Temporary API error. Retry ${attempt + 1}/${maxRetries} in ${(wait / 1000).toFixed(1)}s`
      );

      await sleep(wait);
    }
  }

  throw new Error("Max retries exceeded");
}

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: process.env.CLAUDE_API_KEY,
  baseURL: "https://gw.claudeapi.com",
});

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function retryAfterMs(error: unknown): number | null {
  const headers = (error as any)?.headers;
  const value = headers?.["retry-after"] ?? headers?.get?.("retry-after");
  if (!value) return null;

  const seconds = Number(value);
  return Number.isFinite(seconds) ? seconds * 1000 : null;
}

export async function callWithRetry(
  messages: Anthropic.MessageParam[],
  model = "claude-sonnet-4-6",
  maxRetries = 5
): Promise<Anthropic.Message> {
  const baseDelay = 1000;
  const maxDelay = 60000;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model,
        max_tokens: 1024,
        messages,
      });
    } catch (error: unknown) {
      const status = (error as any)?.status;
      const retryable =
        error instanceof Anthropic.RateLimitError ||
        status === 408 ||
        status === 500 ||
        status === 502 ||
        status === 503 ||
        status === 504 ||
        status === 529;

      if (!retryable || attempt === maxRetries - 1) {
        throw error;
      }

      const retryAfter = retryAfterMs(error);
      const jitter = Math.random() * 1000;
      const wait =
        retryAfter ?? Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);

      console.log(
        `Temporary API error. Retry ${attempt + 1}/${maxRetries} in ${(wait / 1000).toFixed(1)}s`
      );

      await sleep(wait);
    }
  }

  throw new Error("Max retries exceeded");
}

Concurrency control with semaphores

Retries help after you hit a limit. Pre-throttling helps you avoid hitting the limit in the first place.

For Python async workloads, start with a semaphore:

import asyncio
import os

import anthropic

MAX_CONCURRENCY = 10

client = anthropic.AsyncAnthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

async def process_single(sem: asyncio.Semaphore, text: str) -> str:
    async with sem:
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": text}],
        )
        return response.content[0].text

async def batch_process(texts: list[str]) -> list[str]:
    sem = asyncio.Semaphore(MAX_CONCURRENCY)
    tasks = [process_single(sem, text) for text in texts]
    return await asyncio.gather(*tasks)

results = asyncio.run(
    batch_process(["summarize item 1", "summarize item 2", "summarize item 3"])
)

import asyncio
import os

import anthropic

MAX_CONCURRENCY = 10

client = anthropic.AsyncAnthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

async def process_single(sem: asyncio.Semaphore, text: str) -> str:
    async with sem:
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": text}],
        )
        return response.content[0].text

async def batch_process(texts: list[str]) -> list[str]:
    sem = asyncio.Semaphore(MAX_CONCURRENCY)
    tasks = [process_single(sem, text) for text in texts]
    return await asyncio.gather(*tasks)

results = asyncio.run(
    batch_process(["summarize item 1", "summarize item 2", "summarize item 3"])
)

A local semaphore is only local. If you deploy eight worker replicas and each has MAX_CONCURRENCY = 10, your global concurrency is 80.

For production, move rate-limit state into a shared layer:

Redis token bucket
queue worker concurrency controls
API gateway throttling
per-service API keys
separate keys for offline and user-facing workloads

Monitoring metrics

Track these metrics from day one:

Metric	How to measure	Why it matters
`429_rate`	429 responses / total requests in a 1-minute window	Shows whether throttling is occasional or systemic
`retry_count`	Number of retry attempts per logical request	Often rises before total failures rise
`requests_remaining_min`	Lowest remaining request capacity per minute	Detects burst pressure
`input_tokens_remaining_min`	Lowest remaining input-token capacity per minute	Detects long-context pressure
`output_tokens_remaining_min`	Lowest remaining output-token capacity per minute	Detects verbose-output pressure
`p95_latency`	P95 latency for successful requests	Helps tune concurrency
`daily_token_burn`	Cumulative daily token use	Prevents offline jobs from consuming the day

If you use ClaudeAPI, also watch the ClaudeAPI console during rollout. Spikes in retry count or token burn are often visible before users report failures.

Model routing reduces pressure

Not every request needs the strongest model. Routing easy work to smaller models can increase throughput and reduce cost.

Model	Good fit
`claude-haiku-4-5-20251001`	Classification, short summaries, formatting, simple Q&A
`claude-sonnet-4-6`	Coding, multi-step reasoning, content generation, RAG
`claude-opus-4-8`	Complex analysis, long documents, high-precision tasks

A practical routing pattern:

Use Haiku for intent classification or lightweight preprocessing.
Send normal product work to Sonnet.
Reserve Opus for tasks that truly need deeper reasoning.

This is not just a cost strategy. It is also a capacity strategy. Smaller models and shorter outputs reduce token pressure.

Prompt caching can reduce input-token pressure

Prompt caching is useful when requests share a large stable prefix: a system prompt, policy text, codebase context, document, or repeated examples.

Anthropic’s prompt caching documentation describes cache reads as much cheaper than normal input tokens, and cache writes as more expensive than normal input tokens. Caching pays off when the cached content is reused.

Example:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": (
                "You are a professional code review assistant. "
                "Here is the stable project context:\n\n"
                + large_codebase_context
            ),
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review src/auth.py for security issues.",
        }
    ],
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": (
                "You are a professional code review assistant. "
                "Here is the stable project context:\n\n"
                + large_codebase_context
            ),
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review src/auth.py for security issues.",
        }
    ],
)

Caching is strongest when:

the same prefix is reused across multiple calls
the repeated context is large enough to justify cache writes
your application can preserve a stable prompt prefix
you monitor cache write and cache read tokens separately

Error handling checklist

Error	Likely cause	Retry?
`429 RateLimitError`	Request or token limit reached	Yes, with `retry-after` or backoff
`529 OverloadedError`	Server overloaded	Yes, with backoff
`408 RequestTimeout`	Timeout or transient network problem	Limited retry
`500/502/503/504`	Temporary server or gateway issue	Limited retry
`401 AuthenticationError`	Invalid or expired API key	No
`400 InvalidRequestError`	Bad request, invalid model, invalid parameters	No

Do not retry errors that require code or configuration changes. Retrying a bad request just turns a bug into a bill.

FAQ

How long should I wait after a 429?

Use the retry-after response header when it is present. If it is not present, use exponential backoff with jitter and a maximum wait such as 60 seconds.

Where do I find my actual RPM and token limits?

For Anthropic direct accounts, check the Anthropic Console and official rate-limit documentation. For ClaudeAPI usage, check the ClaudeAPI console for the limits and usage visible to your account.

Yes. Treat the API key as the quota boundary unless your provider documentation says otherwise. If several services use one key, they need shared rate limiting.

Does streaming reduce rate-limit usage?

Streaming does not reduce request count or token usage by itself. It can improve perceived latency because the user sees output earlier. Pair streaming with max_tokens and concise prompts to control output size.

Does prompt caching reduce token pressure?

It can reduce input-token pressure for repeated context. Cache hits are much cheaper than normal input processing, but cache writes have a cost. Measure cache hit rate before assuming savings.

Does ClaudeAPI support Message Batches API?

The source article states that ClaudeAPI does not currently support Message Batches API. For large asynchronous workloads, use controlled queues, model routing, prompt caching, and contact ClaudeAPI support for current alternatives.

Next steps

Create or review your API key in the ClaudeAPI console.
Confirm the right base URL:
- Anthropic SDK: https://gw.claudeapi.com
- OpenAI-compatible clients: https://gw.claudeapi.com/v1
Measure real input and output token usage before raising concurrency.
Add retries with exponential backoff and jitter.
Add a global concurrency limiter.
Route simple work to Haiku and reserve stronger models for harder tasks.
Use prompt caching when large stable context repeats.

How to Handle Claude API 429 Rate Limits in Production

How Claude API rate limits work

Rate-limit response headers

Why 429s keep happening

1. RPM is exhausted by bursty concurrency

2. Input token limits are exhausted by long context

3. Output token limits are exhausted by verbose responses

4. Multiple services share one API key

5. Daily or budget limits are consumed by offline jobs

Estimate safe concurrency before launch

Production rollout order

Exponential backoff with jitter

Python retry example

Node.js retry example

Concurrency control with semaphores

Monitoring metrics

Model routing reduces pressure

Prompt caching can reduce input-token pressure

Error handling checklist

FAQ

How long should I wait after a 429?

Where do I find my actual RPM and token limits?

Does streaming reduce rate-limit usage?

Does prompt caching reduce token pressure?

Does ClaudeAPI support Message Batches API?

Next steps

Sources

Related Articles

Claude API Model ID List: Opus, Sonnet, and Haiku IDs for 2026

Claude API Base URL Configuration Guide for Cursor, Cline, Dify, Open WebUI, and More

How to Migrate OpenAI API Code to Claude API

How Claude API rate limits work

Rate-limit response headers

Why 429s keep happening

1. RPM is exhausted by bursty concurrency

2. Input token limits are exhausted by long context

3. Output token limits are exhausted by verbose responses

4. Multiple services share one API key

5. Daily or budget limits are consumed by offline jobs

Estimate safe concurrency before launch

Production rollout order

Exponential backoff with jitter

Python retry example

Node.js retry example

Concurrency control with semaphores

Monitoring metrics

Model routing reduces pressure

Prompt caching can reduce input-token pressure

Error handling checklist

FAQ

How long should I wait after a 429?

Where do I find my actual RPM and token limits?

Do multiple services sharing one key share the same limit?

Does streaming reduce rate-limit usage?

Does prompt caching reduce token pressure?

Does ClaudeAPI support Message Batches API?

Next steps

Sources

Related Articles

Claude API Model ID List: Opus, Sonnet, and Haiku IDs for 2026

Claude API Base URL Configuration Guide for Cursor, Cline, Dify, Open WebUI, and More

How to Migrate OpenAI API Code to Claude API