If your Claude API integration returns HTTP 429 Too Many Requests, or a batch job starts failing halfway through, you do not just need a bigger retry loop. You need a throughput plan.
This guide explains how Claude API rate limits work, why 429s happen in real systems, and how to build a safer client with exponential backoff, concurrency limits, model routing, prompt caching, and monitoring.
The examples use ClaudeAPI’s Anthropic-compatible endpoint:
https://gw.claudeapi.com
https://gw.claudeapi.com
For OpenAI-compatible clients, use:
https://gw.claudeapi.com/v1
https://gw.claudeapi.com/v1
ClaudeAPI is an independent third-party technical service provider. Claude and Anthropic are trademarks of their respective owners. This article is based on public Anthropic documentation and practical engineering patterns; it does not imply any official affiliation or endorsement.
How Claude API rate limits work
Anthropic’s public rate-limit documentation describes multiple limits. For the Messages API, the important dimensions are:
| Dimension | Meaning | Why it matters |
|---|---|---|
| RPM | Requests per minute | Limits how many requests you can start |
| ITPM | Input tokens per minute | Limits how much prompt/context you can send |
| OTPM | Output tokens per minute | Limits how much text the model can generate |
Some APIs and account configurations may also have daily token, spend, acceleration, or workspace-level limits. The exact numbers depend on account tier, model, organization, and provider path, so always confirm your live limits in the relevant console.
The important operational point is simple:
Any exhausted dimension can produce a 429.
Any exhausted dimension can produce a 429.
That means reducing request count alone is not always enough. A few long-context requests can hit token limits even when RPM still looks healthy.
Rate-limit response headers
When you hit a rate limit, the response may include headers that describe the current limit state and when you can retry.
Example shape:
HTTP/1.1 429 Too Many Requests
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-input-tokens-limit: 40000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-output-tokens-limit: 8000
anthropic-ratelimit-output-tokens-remaining: 0
anthropic-ratelimit-output-tokens-reset: 2026-06-30T12:01:00Z
retry-after: 30
HTTP/1.1 429 Too Many Requests
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-input-tokens-limit: 40000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-30T12:01:00Z
anthropic-ratelimit-output-tokens-limit: 8000
anthropic-ratelimit-output-tokens-remaining: 0
anthropic-ratelimit-output-tokens-reset: 2026-06-30T12:01:00Z
retry-after: 30
In production, read the retry-after header when it is present. It tells your client how long to wait before retrying. Also log the remaining request and token headers so you can slow down before the system starts returning 429s.
Why 429s keep happening
Most repeated 429s come from one of these patterns.
1. RPM is exhausted by bursty concurrency
If your code starts 100 async tasks at once, you can exhaust request-per-minute capacity even if each request is small.
This often happens in queue workers, cron jobs, and “process all rows” scripts.
2. Input token limits are exhausted by long context
A few requests with large documents, long chat histories, or big retrieved context blocks can burn through input-token limits quickly.
The symptoms can be confusing: your request count looks low, but 429s still appear.
3. Output token limits are exhausted by verbose responses
Output is often the expensive and rate-sensitive side of generation. If prompts allow long answers, the model can consume output capacity faster than expected.
Set max_tokens, ask for concise output, and avoid letting batch jobs request unlimited prose.
4. Multiple services share one API key
In microservice architectures, several services may share the same key. Each service may believe it is under the limit, while the combined traffic exceeds it.
For production systems, either use separate keys per service or implement shared rate limiting at a gateway, queue, or Redis-backed limiter.
5. Daily or budget limits are consumed by offline jobs
Nightly summarization, report generation, embedding-style enrichment, or data cleanup can consume enough quota that daytime product traffic starts failing.
Separate offline workloads from user-facing traffic whenever possible.
Estimate safe concurrency before launch
Many 429 incidents are capacity-planning problems dressed up as bugs. Before launch, convert your quota into a safe operating level.
| Metric | Formula | Use |
|---|---|---|
| Safe RPM | account_rpm * 0.7 |
Keeps 30% headroom for spikes |
| Average input tokens | measured from real requests | Helps estimate ITPM pressure |
| Average output tokens | measured from real requests | Helps estimate OTPM pressure |
| Safe input-limited tasks/min | (account_itpm * 0.7) / avg_input_tokens |
Estimates prompt-side throughput |
| Safe output-limited tasks/min | (account_otpm * 0.7) / avg_output_tokens |
Estimates generation-side throughput |
| Suggested concurrency | min(safe_rpm, input_limited, output_limited) * p95_latency_seconds / 60 |
Converts throughput into in-flight requests |
Example:
avg input: 3,000 tokens
avg output: 500 tokens
avg input: 3,000 tokens
avg output: 500 tokens
Even if RPM looks high enough, input-token limits may be the true bottleneck. Increasing concurrency only makes 429s arrive faster if tokens are the limiting dimension.
Production rollout order
Use this order when adding Claude API to a production workflow:
- Measure 50 to 100 real requests before launch.
- Record average input tokens, average output tokens, p95 latency, and failure rate.
- Calculate a conservative starting concurrency limit.
- Retry only temporary failures such as 429, 529, and selected 5xx responses.
- Do not retry 400 validation errors or 401 authentication errors.
- Add a global limiter if multiple services share one API key.
- Start at 30% traffic, then move to 50% and 70% after observing stability.
- Alert on sustained 429 rate, low remaining token capacity, and abnormal daily burn.
That sequence is less exciting than “just increase workers.” It also ruins fewer afternoons.
Exponential backoff with jitter
Fixed-interval retries are dangerous under load. If 100 workers all retry after exactly one second, they create a retry storm.
Use exponential backoff with jitter:
wait = min(base_delay * 2^attempt + random_jitter, max_delay)
wait = min(base_delay * 2^attempt + random_jitter, max_delay)
If the server provides retry-after, prefer that value.
Python retry example
This example uses the official Anthropic Python SDK style with ClaudeAPI’s Anthropic-compatible base URL.
import os
import random
import time
import anthropic
client = anthropic.Anthropic(
api_key=os.environ["CLAUDE_API_KEY"],
base_url="https://gw.claudeapi.com",
)
def retry_after_seconds(error: Exception) -> float | None:
response = getattr(error, "response", None)
if not response:
return None
value = response.headers.get("retry-after")
if not value:
return None
try:
return float(value)
except ValueError:
return None
def call_with_retry(
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_tokens: int = 1024,
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
) -> anthropic.types.Message:
for attempt in range(max_retries):
try:
return client.messages.create(
model=model,
max_tokens=max_tokens,
messages=messages,
)
except anthropic.RateLimitError as error:
if attempt == max_retries - 1:
raise
wait = retry_after_seconds(error)
if wait is None:
wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
print(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait:.1f}s")
time.sleep(wait)
except anthropic.APIStatusError as error:
retryable = error.status_code in {408, 500, 502, 503, 504, 529}
if not retryable or attempt == max_retries - 1:
raise
wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
print(f"Temporary API error {error.status_code}. Retry in {wait:.1f}s")
time.sleep(wait)
raise RuntimeError("Max retries exceeded")
response = call_with_retry(
messages=[
{
"role": "user",
"content": "Summarize the history of large language models in 100 words.",
}
],
)
print(response.content[0].text)
import os
import random
import time
import anthropic
client = anthropic.Anthropic(
api_key=os.environ["CLAUDE_API_KEY"],
base_url="https://gw.claudeapi.com",
)
def retry_after_seconds(error: Exception) -> float | None:
response = getattr(error, "response", None)
if not response:
return None
value = response.headers.get("retry-after")
if not value:
return None
try:
return float(value)
except ValueError:
return None
def call_with_retry(
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_tokens: int = 1024,
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
) -> anthropic.types.Message:
for attempt in range(max_retries):
try:
return client.messages.create(
model=model,
max_tokens=max_tokens,
messages=messages,
)
except anthropic.RateLimitError as error:
if attempt == max_retries - 1:
raise
wait = retry_after_seconds(error)
if wait is None:
wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
print(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait:.1f}s")
time.sleep(wait)
except anthropic.APIStatusError as error:
retryable = error.status_code in {408, 500, 502, 503, 504, 529}
if not retryable or attempt == max_retries - 1:
raise
wait = min(base_delay * (2**attempt) + random.uniform(0, 1), max_delay)
print(f"Temporary API error {error.status_code}. Retry in {wait:.1f}s")
time.sleep(wait)
raise RuntimeError("Max retries exceeded")
response = call_with_retry(
messages=[
{
"role": "user",
"content": "Summarize the history of large language models in 100 words.",
}
],
)
print(response.content[0].text)
Node.js retry example
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.CLAUDE_API_KEY,
baseURL: "https://gw.claudeapi.com",
});
function sleep(ms: number) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
function retryAfterMs(error: unknown): number | null {
const headers = (error as any)?.headers;
const value = headers?.["retry-after"] ?? headers?.get?.("retry-after");
if (!value) return null;
const seconds = Number(value);
return Number.isFinite(seconds) ? seconds * 1000 : null;
}
export async function callWithRetry(
messages: Anthropic.MessageParam[],
model = "claude-sonnet-4-6",
maxRetries = 5
): Promise<Anthropic.Message> {
const baseDelay = 1000;
const maxDelay = 60000;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create({
model,
max_tokens: 1024,
messages,
});
} catch (error: unknown) {
const status = (error as any)?.status;
const retryable =
error instanceof Anthropic.RateLimitError ||
status === 408 ||
status === 500 ||
status === 502 ||
status === 503 ||
status === 504 ||
status === 529;
if (!retryable || attempt === maxRetries - 1) {
throw error;
}
const retryAfter = retryAfterMs(error);
const jitter = Math.random() * 1000;
const wait =
retryAfter ?? Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);
console.log(
`Temporary API error. Retry ${attempt + 1}/${maxRetries} in ${(wait / 1000).toFixed(1)}s`
);
await sleep(wait);
}
}
throw new Error("Max retries exceeded");
}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.CLAUDE_API_KEY,
baseURL: "https://gw.claudeapi.com",
});
function sleep(ms: number) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
function retryAfterMs(error: unknown): number | null {
const headers = (error as any)?.headers;
const value = headers?.["retry-after"] ?? headers?.get?.("retry-after");
if (!value) return null;
const seconds = Number(value);
return Number.isFinite(seconds) ? seconds * 1000 : null;
}
export async function callWithRetry(
messages: Anthropic.MessageParam[],
model = "claude-sonnet-4-6",
maxRetries = 5
): Promise<Anthropic.Message> {
const baseDelay = 1000;
const maxDelay = 60000;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create({
model,
max_tokens: 1024,
messages,
});
} catch (error: unknown) {
const status = (error as any)?.status;
const retryable =
error instanceof Anthropic.RateLimitError ||
status === 408 ||
status === 500 ||
status === 502 ||
status === 503 ||
status === 504 ||
status === 529;
if (!retryable || attempt === maxRetries - 1) {
throw error;
}
const retryAfter = retryAfterMs(error);
const jitter = Math.random() * 1000;
const wait =
retryAfter ?? Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);
console.log(
`Temporary API error. Retry ${attempt + 1}/${maxRetries} in ${(wait / 1000).toFixed(1)}s`
);
await sleep(wait);
}
}
throw new Error("Max retries exceeded");
}
Concurrency control with semaphores
Retries help after you hit a limit. Pre-throttling helps you avoid hitting the limit in the first place.
For Python async workloads, start with a semaphore:
import asyncio
import os
import anthropic
MAX_CONCURRENCY = 10
client = anthropic.AsyncAnthropic(
api_key=os.environ["CLAUDE_API_KEY"],
base_url="https://gw.claudeapi.com",
)
async def process_single(sem: asyncio.Semaphore, text: str) -> str:
async with sem:
response = await client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": text}],
)
return response.content[0].text
async def batch_process(texts: list[str]) -> list[str]:
sem = asyncio.Semaphore(MAX_CONCURRENCY)
tasks = [process_single(sem, text) for text in texts]
return await asyncio.gather(*tasks)
results = asyncio.run(
batch_process(["summarize item 1", "summarize item 2", "summarize item 3"])
)
import asyncio
import os
import anthropic
MAX_CONCURRENCY = 10
client = anthropic.AsyncAnthropic(
api_key=os.environ["CLAUDE_API_KEY"],
base_url="https://gw.claudeapi.com",
)
async def process_single(sem: asyncio.Semaphore, text: str) -> str:
async with sem:
response = await client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": text}],
)
return response.content[0].text
async def batch_process(texts: list[str]) -> list[str]:
sem = asyncio.Semaphore(MAX_CONCURRENCY)
tasks = [process_single(sem, text) for text in texts]
return await asyncio.gather(*tasks)
results = asyncio.run(
batch_process(["summarize item 1", "summarize item 2", "summarize item 3"])
)
A local semaphore is only local. If you deploy eight worker replicas and each has MAX_CONCURRENCY = 10, your global concurrency is 80.
For production, move rate-limit state into a shared layer:
- Redis token bucket
- queue worker concurrency controls
- API gateway throttling
- per-service API keys
- separate keys for offline and user-facing workloads
Monitoring metrics
Track these metrics from day one:
| Metric | How to measure | Why it matters |
|---|---|---|
429_rate |
429 responses / total requests in a 1-minute window | Shows whether throttling is occasional or systemic |
retry_count |
Number of retry attempts per logical request | Often rises before total failures rise |
requests_remaining_min |
Lowest remaining request capacity per minute | Detects burst pressure |
input_tokens_remaining_min |
Lowest remaining input-token capacity per minute | Detects long-context pressure |
output_tokens_remaining_min |
Lowest remaining output-token capacity per minute | Detects verbose-output pressure |
p95_latency |
P95 latency for successful requests | Helps tune concurrency |
daily_token_burn |
Cumulative daily token use | Prevents offline jobs from consuming the day |
If you use ClaudeAPI, also watch the ClaudeAPI console during rollout. Spikes in retry count or token burn are often visible before users report failures.
Model routing reduces pressure
Not every request needs the strongest model. Routing easy work to smaller models can increase throughput and reduce cost.
| Model | Good fit |
|---|---|
claude-haiku-4-5-20251001 |
Classification, short summaries, formatting, simple Q&A |
claude-sonnet-4-6 |
Coding, multi-step reasoning, content generation, RAG |
claude-opus-4-8 |
Complex analysis, long documents, high-precision tasks |
A practical routing pattern:
- Use Haiku for intent classification or lightweight preprocessing.
- Send normal product work to Sonnet.
- Reserve Opus for tasks that truly need deeper reasoning.
This is not just a cost strategy. It is also a capacity strategy. Smaller models and shorter outputs reduce token pressure.
Prompt caching can reduce input-token pressure
Prompt caching is useful when requests share a large stable prefix: a system prompt, policy text, codebase context, document, or repeated examples.
Anthropic’s prompt caching documentation describes cache reads as much cheaper than normal input tokens, and cache writes as more expensive than normal input tokens. Caching pays off when the cached content is reused.
Example:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": (
"You are a professional code review assistant. "
"Here is the stable project context:\n\n"
+ large_codebase_context
),
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{
"role": "user",
"content": "Review src/auth.py for security issues.",
}
],
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": (
"You are a professional code review assistant. "
"Here is the stable project context:\n\n"
+ large_codebase_context
),
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{
"role": "user",
"content": "Review src/auth.py for security issues.",
}
],
)
Caching is strongest when:
- the same prefix is reused across multiple calls
- the repeated context is large enough to justify cache writes
- your application can preserve a stable prompt prefix
- you monitor cache write and cache read tokens separately
Error handling checklist
| Error | Likely cause | Retry? |
|---|---|---|
429 RateLimitError |
Request or token limit reached | Yes, with retry-after or backoff |
529 OverloadedError |
Server overloaded | Yes, with backoff |
408 RequestTimeout |
Timeout or transient network problem | Limited retry |
500/502/503/504 |
Temporary server or gateway issue | Limited retry |
401 AuthenticationError |
Invalid or expired API key | No |
400 InvalidRequestError |
Bad request, invalid model, invalid parameters | No |
Do not retry errors that require code or configuration changes. Retrying a bad request just turns a bug into a bill.
FAQ
How long should I wait after a 429?
Use the retry-after response header when it is present. If it is not present, use exponential backoff with jitter and a maximum wait such as 60 seconds.
Where do I find my actual RPM and token limits?
For Anthropic direct accounts, check the Anthropic Console and official rate-limit documentation. For ClaudeAPI usage, check the ClaudeAPI console for the limits and usage visible to your account.
Do multiple services sharing one key share the same limit?
Yes. Treat the API key as the quota boundary unless your provider documentation says otherwise. If several services use one key, they need shared rate limiting.
Does streaming reduce rate-limit usage?
Streaming does not reduce request count or token usage by itself. It can improve perceived latency because the user sees output earlier. Pair streaming with max_tokens and concise prompts to control output size.
Does prompt caching reduce token pressure?
It can reduce input-token pressure for repeated context. Cache hits are much cheaper than normal input processing, but cache writes have a cost. Measure cache hit rate before assuming savings.
Does ClaudeAPI support Message Batches API?
The source article states that ClaudeAPI does not currently support Message Batches API. For large asynchronous workloads, use controlled queues, model routing, prompt caching, and contact ClaudeAPI support for current alternatives.
Next steps
- Create or review your API key in the ClaudeAPI console.
- Confirm the right base URL:
- Anthropic SDK:
https://gw.claudeapi.com - OpenAI-compatible clients:
https://gw.claudeapi.com/v1
- Anthropic SDK:
- Measure real input and output token usage before raising concurrency.
- Add retries with exponential backoff and jitter.
- Add a global concurrency limiter.
- Route simple work to Haiku and reserve stronger models for harder tasks.
- Use prompt caching when large stable context repeats.



