Skip to main content
This site is an independent third-party technical service provider. Claude™ and Anthropic® are trademarks of Anthropic, PBC. This site has no affiliation, endorsement, or partnership with Anthropic.

How to Handle Claude API 429 Rate Limits in Production

A production-focused guide to Claude API rate limits, retry logic, concurrency control, prompt caching, and model routing.

Dev GuidesclaudeapiratelimittutorialEst. read10min
2026.06.30 published
claude-api-rate-limit-429-retry-concurrency-optimization-guide--cover

If your Claude API integration returns HTTP 429 Too Many Requests, or a batch job starts failing halfway through, you do not just need a bigger retry loop. You need a throughput plan.

This guide explains how Claude API rate limits work, why 429s happen in real systems, and how to build a safer client with exponential backoff, concurrency limits, model routing, prompt caching, and monitoring.

The examples use ClaudeAPI’s Anthropic-compatible endpoint:

https://gw.claudeapi.com
https://gw.claudeapi.com

For OpenAI-compatible clients, use:

https://gw.claudeapi.com/v1
https://gw.claudeapi.com/v1

ClaudeAPI is an independent third-party technical service provider. Claude and Anthropic are trademarks of their respective owners. This article is based on public Anthropic documentation and practical engineering patterns; it does not imply any official affiliation or endorsement.

How Claude API rate limits work

Anthropic’s public rate-limit documentation describes multiple limits. For the Messages API, the important dimensions are:

Dimension Meaning Why it matters
RPM Requests per minute Limits how many requests you can start
ITPM Input tokens per minute Limits how much prompt/context you can send
OTPM Output tokens per minute Limits how much text the model can generate

Some APIs and account configurations may also have daily token, spend, acceleration, or workspace-level limits. The exact numbers depend on account tier, model, organization, and provider path, so always confirm your live limits in the relevant console.

The important operational point is simple:

Any exhausted dimension can produce a 429.
Any exhausted dimension can produce a 429.

That means reducing request count alone is not always enough. A few long-context requests can hit token limits even when RPM still looks healthy.

Rate-limit response headers

When you hit a rate limit, the response may include headers that describe the current limit state and when you can retry.

Example shape:

HTTP/1.1 429 Too Many Requests
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-input-tokens-limit: 40000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-output-tokens-limit: 8000
anthropic-ratelimit-output-tokens-remaining: 0
anthropic-ratelimit-output-tokens-reset: 2026-06-30T12:01:00Z
retry-after: 30
HTTP/1.1 429 Too Many Requests
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-input-tokens-limit: 40000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-output-tokens-limit: 8000
anthropic-ratelimit-output-tokens-remaining: 0
anthropic-ratelimit-output-tokens-reset: 2026-06-30T12:01:00Z
retry-after: 30

In production, read the retry-after header when it is present. It tells your client how long to wait before retrying. Also log the remaining request and token headers so you can slow down before the system starts returning 429s.

Why 429s keep happening

Most repeated 429s come from one of these patterns.

1. RPM is exhausted by bursty concurrency

If your code starts 100 async tasks at once, you can exhaust request-per-minute capacity even if each request is small.

This often happens in queue workers, cron jobs, and “process all rows” scripts.

2. Input token limits are exhausted by long context

A few requests with large documents, long chat histories, or big retrieved context blocks can burn through input-token limits quickly.

The symptoms can be confusing: your request count looks low, but 429s still appear.

3. Output token limits are exhausted by verbose responses

Output is often the expensive and rate-sensitive side of generation. If prompts allow long answers, the model can consume output capacity faster than expected.

Set max_tokens, ask for concise output, and avoid letting batch jobs request unlimited prose.

4. Multiple services share one API key

In microservice architectures, several services may share the same key. Each service may believe it is under the limit, while the combined traffic exceeds it.

For production systems, either use separate keys per service or implement shared rate limiting at a gateway, queue, or Redis-backed limiter.

5. Daily or budget limits are consumed by offline jobs

Nightly summarization, report generation, embedding-style enrichment, or data cleanup can consume enough quota that daytime product traffic starts failing.

Separate offline workloads from user-facing traffic whenever possible.

Estimate safe concurrency before launch

Many 429 incidents are capacity-planning problems dressed up as bugs. Before launch, convert your quota into a safe operating level.

Metric Formula Use
Safe RPM account_rpm * 0.7 Keeps 30% headroom for spikes
Average input tokens measured from real requests Helps estimate ITPM pressure
Average output tokens measured from real requests Helps estimate OTPM pressure
Safe input-limited tasks/min (account_itpm * 0.7) / avg_input_tokens Estimates prompt-side throughput
Safe output-limited tasks/min (account_otpm * 0.7) / avg_output_tokens Estimates generation-side throughput
Suggested concurrency min(safe_rpm, input_limited, output_limited) * p95_latency_seconds / 60 Converts throughput into in-flight requests

Example:

avg input:  3,000 tokens
avg output:   500 tokens
avg input:  3,000 tokens
avg output:   500 tokens

Even if RPM looks high enough, input-token limits may be the true bottleneck. Increasing concurrency only makes 429s arrive faster if tokens are the limiting dimension.

Production rollout order

Use this order when adding Claude API to a production workflow:

  1. Measure 50 to 100 real requests before launch.
  2. Record average input tokens, average output tokens, p95 latency, and failure rate.
  3. Calculate a conservative starting concurrency limit.
  4. Retry only temporary failures such as 429, 529, and selected 5xx responses.
  5. Do not retry 400 validation errors or 401 authentication errors.
  6. Add a global limiter if multiple services share one API key.
  7. Start at 30% traffic, then move to 50% and 70% after observing stability.
  8. Alert on sustained 429 rate, low remaining token capacity, and abnormal daily burn.

That sequence is less exciting than “just increase workers.” It also ruins fewer afternoons.

Exponential backoff with jitter

Fixed-interval retries are dangerous under load. If 100 workers all retry after exactly one second, they create a retry storm.

Use exponential backoff with jitter:

wait = min(base_delay * 2^attempt + random_jitter, max_delay)
wait = min(base_delay * 2^attempt + random_jitter, max_delay)

If the server provides retry-after, prefer that value.

Python retry example

This example uses the official Anthropic Python SDK style with ClaudeAPI’s Anthropic-compatible base URL.

import os
import random
import time

import anthropic

client = anthropic.Anthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

def retry_after_seconds(error: Exception) -> float | None:
    response = getattr(error, "response", None)
    if not response:
        return None

    value = response.headers.get("retry-after")
    if not value:
        return None

    try:
        return float(value)
    except ValueError:
        return None

def call_with_retry(
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 1024,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> anthropic.types.Message:
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages,
            )
        except anthropic.RateLimitError as error:
            if attempt == max_retries - 1:
                raise

            wait = retry_after_seconds(error)
            if wait is None:
                wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)

            print(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait:.1f}s")
            time.sleep(wait)

        except anthropic.APIStatusError as error:
            retryable = error.status_code in {408, 500, 502, 503, 504, 529}
            if not retryable or attempt == max_retries - 1:
                raise

            wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
            print(f"Temporary API error {error.status_code}. Retry in {wait:.1f}s")
            time.sleep(wait)

    raise RuntimeError("Max retries exceeded")

response = call_with_retry(
    messages=[
        {
            "role": "user",
            "content": "Summarize the history of large language models in 100 words.",
        }
    ],
)

print(response.content[0].text)
import os
import random
import time

import anthropic

client = anthropic.Anthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

def retry_after_seconds(error: Exception) -> float | None:
    response = getattr(error, "response", None)
    if not response:
        return None

    value = response.headers.get("retry-after")
    if not value:
        return None

    try:
        return float(value)
    except ValueError:
        return None

def call_with_retry(
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 1024,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> anthropic.types.Message:
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages,
            )
        except anthropic.RateLimitError as error:
            if attempt == max_retries - 1:
                raise

            wait = retry_after_seconds(error)
            if wait is None:
                wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)

            print(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait:.1f}s")
            time.sleep(wait)

        except anthropic.APIStatusError as error:
            retryable = error.status_code in {408, 500, 502, 503, 504, 529}
            if not retryable or attempt == max_retries - 1:
                raise

            wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
            print(f"Temporary API error {error.status_code}. Retry in {wait:.1f}s")
            time.sleep(wait)

    raise RuntimeError("Max retries exceeded")

response = call_with_retry(
    messages=[
        {
            "role": "user",
            "content": "Summarize the history of large language models in 100 words.",
        }
    ],
)

print(response.content[0].text)

Node.js retry example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: process.env.CLAUDE_API_KEY,
  baseURL: "https://gw.claudeapi.com",
});

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function retryAfterMs(error: unknown): number | null {
  const headers = (error as any)?.headers;
  const value = headers?.["retry-after"] ?? headers?.get?.("retry-after");
  if (!value) return null;

  const seconds = Number(value);
  return Number.isFinite(seconds) ? seconds * 1000 : null;
}

export async function callWithRetry(
  messages: Anthropic.MessageParam[],
  model = "claude-sonnet-4-6",
  maxRetries = 5
): Promise<Anthropic.Message> {
  const baseDelay = 1000;
  const maxDelay = 60000;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model,
        max_tokens: 1024,
        messages,
      });
    } catch (error: unknown) {
      const status = (error as any)?.status;
      const retryable =
        error instanceof Anthropic.RateLimitError ||
        status === 408 ||
        status === 500 ||
        status === 502 ||
        status === 503 ||
        status === 504 ||
        status === 529;

      if (!retryable || attempt === maxRetries - 1) {
        throw error;
      }

      const retryAfter = retryAfterMs(error);
      const jitter = Math.random() * 1000;
      const wait =
        retryAfter ?? Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);

      console.log(
        `Temporary API error. Retry ${attempt + 1}/${maxRetries} in ${(wait / 1000).toFixed(1)}s`
      );

      await sleep(wait);
    }
  }

  throw new Error("Max retries exceeded");
}
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: process.env.CLAUDE_API_KEY,
  baseURL: "https://gw.claudeapi.com",
});

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function retryAfterMs(error: unknown): number | null {
  const headers = (error as any)?.headers;
  const value = headers?.["retry-after"] ?? headers?.get?.("retry-after");
  if (!value) return null;

  const seconds = Number(value);
  return Number.isFinite(seconds) ? seconds * 1000 : null;
}

export async function callWithRetry(
  messages: Anthropic.MessageParam[],
  model = "claude-sonnet-4-6",
  maxRetries = 5
): Promise<Anthropic.Message> {
  const baseDelay = 1000;
  const maxDelay = 60000;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model,
        max_tokens: 1024,
        messages,
      });
    } catch (error: unknown) {
      const status = (error as any)?.status;
      const retryable =
        error instanceof Anthropic.RateLimitError ||
        status === 408 ||
        status === 500 ||
        status === 502 ||
        status === 503 ||
        status === 504 ||
        status === 529;

      if (!retryable || attempt === maxRetries - 1) {
        throw error;
      }

      const retryAfter = retryAfterMs(error);
      const jitter = Math.random() * 1000;
      const wait =
        retryAfter ?? Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);

      console.log(
        `Temporary API error. Retry ${attempt + 1}/${maxRetries} in ${(wait / 1000).toFixed(1)}s`
      );

      await sleep(wait);
    }
  }

  throw new Error("Max retries exceeded");
}

Concurrency control with semaphores

Retries help after you hit a limit. Pre-throttling helps you avoid hitting the limit in the first place.

For Python async workloads, start with a semaphore:

import asyncio
import os

import anthropic

MAX_CONCURRENCY = 10

client = anthropic.AsyncAnthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

async def process_single(sem: asyncio.Semaphore, text: str) -> str:
    async with sem:
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": text}],
        )
        return response.content[0].text

async def batch_process(texts: list[str]) -> list[str]:
    sem = asyncio.Semaphore(MAX_CONCURRENCY)
    tasks = [process_single(sem, text) for text in texts]
    return await asyncio.gather(*tasks)

results = asyncio.run(
    batch_process(["summarize item 1", "summarize item 2", "summarize item 3"])
)
import asyncio
import os

import anthropic

MAX_CONCURRENCY = 10

client = anthropic.AsyncAnthropic(
    api_key=os.environ["CLAUDE_API_KEY"],
    base_url="https://gw.claudeapi.com",
)

async def process_single(sem: asyncio.Semaphore, text: str) -> str:
    async with sem:
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": text}],
        )
        return response.content[0].text

async def batch_process(texts: list[str]) -> list[str]:
    sem = asyncio.Semaphore(MAX_CONCURRENCY)
    tasks = [process_single(sem, text) for text in texts]
    return await asyncio.gather(*tasks)

results = asyncio.run(
    batch_process(["summarize item 1", "summarize item 2", "summarize item 3"])
)

A local semaphore is only local. If you deploy eight worker replicas and each has MAX_CONCURRENCY = 10, your global concurrency is 80.

For production, move rate-limit state into a shared layer:

  • Redis token bucket
  • queue worker concurrency controls
  • API gateway throttling
  • per-service API keys
  • separate keys for offline and user-facing workloads

Monitoring metrics

Track these metrics from day one:

Metric How to measure Why it matters
429_rate 429 responses / total requests in a 1-minute window Shows whether throttling is occasional or systemic
retry_count Number of retry attempts per logical request Often rises before total failures rise
requests_remaining_min Lowest remaining request capacity per minute Detects burst pressure
input_tokens_remaining_min Lowest remaining input-token capacity per minute Detects long-context pressure
output_tokens_remaining_min Lowest remaining output-token capacity per minute Detects verbose-output pressure
p95_latency P95 latency for successful requests Helps tune concurrency
daily_token_burn Cumulative daily token use Prevents offline jobs from consuming the day

If you use ClaudeAPI, also watch the ClaudeAPI console during rollout. Spikes in retry count or token burn are often visible before users report failures.

Model routing reduces pressure

Not every request needs the strongest model. Routing easy work to smaller models can increase throughput and reduce cost.

Model Good fit
claude-haiku-4-5-20251001 Classification, short summaries, formatting, simple Q&A
claude-sonnet-4-6 Coding, multi-step reasoning, content generation, RAG
claude-opus-4-8 Complex analysis, long documents, high-precision tasks

A practical routing pattern:

  1. Use Haiku for intent classification or lightweight preprocessing.
  2. Send normal product work to Sonnet.
  3. Reserve Opus for tasks that truly need deeper reasoning.

This is not just a cost strategy. It is also a capacity strategy. Smaller models and shorter outputs reduce token pressure.

Prompt caching can reduce input-token pressure

Prompt caching is useful when requests share a large stable prefix: a system prompt, policy text, codebase context, document, or repeated examples.

Anthropic’s prompt caching documentation describes cache reads as much cheaper than normal input tokens, and cache writes as more expensive than normal input tokens. Caching pays off when the cached content is reused.

Example:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": (
                "You are a professional code review assistant. "
                "Here is the stable project context:\n\n"
                + large_codebase_context
            ),
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review src/auth.py for security issues.",
        }
    ],
)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": (
                "You are a professional code review assistant. "
                "Here is the stable project context:\n\n"
                + large_codebase_context
            ),
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Review src/auth.py for security issues.",
        }
    ],
)

Caching is strongest when:

  • the same prefix is reused across multiple calls
  • the repeated context is large enough to justify cache writes
  • your application can preserve a stable prompt prefix
  • you monitor cache write and cache read tokens separately

Error handling checklist

Error Likely cause Retry?
429 RateLimitError Request or token limit reached Yes, with retry-after or backoff
529 OverloadedError Server overloaded Yes, with backoff
408 RequestTimeout Timeout or transient network problem Limited retry
500/502/503/504 Temporary server or gateway issue Limited retry
401 AuthenticationError Invalid or expired API key No
400 InvalidRequestError Bad request, invalid model, invalid parameters No

Do not retry errors that require code or configuration changes. Retrying a bad request just turns a bug into a bill.

FAQ

How long should I wait after a 429?

Use the retry-after response header when it is present. If it is not present, use exponential backoff with jitter and a maximum wait such as 60 seconds.

Where do I find my actual RPM and token limits?

For Anthropic direct accounts, check the Anthropic Console and official rate-limit documentation. For ClaudeAPI usage, check the ClaudeAPI console for the limits and usage visible to your account.

Do multiple services sharing one key share the same limit?

Yes. Treat the API key as the quota boundary unless your provider documentation says otherwise. If several services use one key, they need shared rate limiting.

Does streaming reduce rate-limit usage?

Streaming does not reduce request count or token usage by itself. It can improve perceived latency because the user sees output earlier. Pair streaming with max_tokens and concise prompts to control output size.

Does prompt caching reduce token pressure?

It can reduce input-token pressure for repeated context. Cache hits are much cheaper than normal input processing, but cache writes have a cost. Measure cache hit rate before assuming savings.

Does ClaudeAPI support Message Batches API?

The source article states that ClaudeAPI does not currently support Message Batches API. For large asynchronous workloads, use controlled queues, model routing, prompt caching, and contact ClaudeAPI support for current alternatives.

Next steps

  1. Create or review your API key in the ClaudeAPI console.
  2. Confirm the right base URL:
    • Anthropic SDK: https://gw.claudeapi.com
    • OpenAI-compatible clients: https://gw.claudeapi.com/v1
  3. Measure real input and output token usage before raising concurrency.
  4. Add retries with exponential backoff and jitter.
  5. Add a global concurrency limiter.
  6. Route simple work to Haiku and reserve stronger models for harder tasks.
  7. Use prompt caching when large stable context repeats.

Sources

Related Articles