Designing Resilient APIs: Retry Strategies, Circuit Breakers, and Graceful Degradation
In a monolith, a function call either works or throws. In a distributed system, a request can succeed, fail, hang indefinitely, succeed on the server but fail on the network back, or arrive twice. Resilience isn’t about preventing failure, it’s about designing systems that behave predictably when failure inevitably happens.
This article covers the patterns that make the difference between “the upstream service is slow” and “the upstream service is slow and now everything is down.”
Why APIs Fail
Before building defenses, understand the failure modes:
| Failure Mode | What Happens | Example |
|---|---|---|
| Transient | Temporary blip, succeeds on retry | Network timeout, 503 from a scaling event |
| Sustained | Service is down for an extended period | Database outage, bad deployment |
| Partial | Some requests fail, others succeed | One replica is unhealthy, regional DNS issue |
| Cascading | One failure triggers failures in dependent services | Slow upstream causes thread pool exhaustion downstream |
| Byzantine | Service responds but with incorrect data | Stale cache, split-brain, silent data corruption |
The first three are common. The fourth is what takes down production. The fifth is what keeps you up at night, and the hardest to defend against. The patterns in this article primarily address transient through cascading failures. Byzantine failures require different tools: checksums, contract testing, read-after-write verification, and careful cache invalidation. The stale-cache pattern in the degradation section touches on this; serving last-known-good data is one defense when you can’t trust the current response, but a full treatment of Byzantine failure is its own article.
The examples below assume a few common abstractions:
- an HTTP client that throws
HttpErrorwith astatusCodeproperty, - a
ConnectionPoolconcurrency limiter (e.g.p-limitor HTTP agentmaxSockets), - a
cacheinterface (e.g. Redis client withget/set), - a
metricsclient (e.g. Datadog, Prometheus), - a
classifyErrorfunction that maps errors to metric labels, and - a
logger(e.g. pino, winston).
These are left unimplemented to keep focus on the resilience patterns themselves.
Pattern 1: Retry with Backoff
The simplest resilience pattern; if a request fails, try again. But naive retries cause more problems than they solve.
The Wrong Way
// Don't do this
async function fetchUser(id: string): Promise<User> {
for (let i = 0; i < 5; i++) {
try {
return await httpClient.get(`/users/${id}`);
} catch {
// Immediate retry with no delay
continue;
}
}
throw new Error("Failed after 5 retries");
}
If the service is overloaded, five clients each firing five immediate retries means 25x the load, exactly when the service can least handle it.
The Right Way: Exponential Backoff with Jitter
// Helper: promise-based delay
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));
// Helper: classify whether an error is worth retrying.
// HttpError is your HTTP client's error class (e.g. AxiosError, got.HTTPError).
function isRetryable(error: unknown, retryableStatuses: Set<number>): boolean {
if (error instanceof HttpError) {
return retryableStatuses.has(error.statusCode);
}
// Node.js network errors (ECONNRESET, ETIMEDOUT, ECONNREFUSED) are
// generally retryable. In browser/edge runtimes, fetch throws a
// TypeError on network failure, but not all TypeErrors are network
// errors (e.g. a bug in your serialization code), so check the
// message to distinguish.
const code = (error as Record<string, unknown>)?.code;
if (typeof code === "string" && ["ECONNRESET", "ETIMEDOUT", "ECONNREFUSED"].includes(code)) {
return true;
}
return false;
}
interface RetryConfig {
maxRetries: number;
baseDelayMs: number;
maxDelayMs: number;
retryableStatuses: Set<number>;
}
const DEFAULT_RETRY_CONFIG: RetryConfig = {
maxRetries: 3,
baseDelayMs: 200,
maxDelayMs: 10_000,
retryableStatuses: new Set([408, 429, 500, 502, 503, 504]),
};
async function fetchWithRetry<T>(
fn: () => Promise<T>,
config: RetryConfig = DEFAULT_RETRY_CONFIG
): Promise<T> {
let lastError: Error | undefined;
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
if (!isRetryable(error, config.retryableStatuses)) {
throw error; // 400, 401, 404 - retrying won't help
}
if (attempt === config.maxRetries) break;
const delay = calculateDelay(attempt, config);
await sleep(delay);
}
}
throw lastError;
}
Three details matter here:
1. Only retry what’s retryable. A 400 Bad Request will fail identically on retry. A 503 Service Unavailable might not. The retryableStatuses set makes this explicit.
2. Exponential backoff. Each retry waits longer than the last, giving the upstream service time to recover.
3. Jitter. Without jitter, all clients that failed at the same time will retry at the same time, creating synchronized thundering herds.
function calculateDelay(attempt: number, config: RetryConfig): number {
// Exponential: 200ms, 400ms, 800ms, 1600ms...
const exponential = config.baseDelayMs * Math.pow(2, attempt);
// Cap at maxDelay to avoid absurd wait times
const capped = Math.min(exponential, config.maxDelayMs);
// Full jitter: random value between 0 and the capped delay
// This decorrelates retries across clients
return Math.random() * capped;
}
Full jitter is a robust default with strong theoretical properties. AWS’s architecture blog published the analysis, decorrelated jitter performs comparably in some workloads, but full jitter is simpler to implement and reason about. The key insight is that spreading retries across the delay window reduces contention regardless of which jitter variant you choose.
Respect Retry-After Headers
Some APIs tell you exactly when to retry via the Retry-After header. The generic fetchWithRetry above doesn’t have access to the raw HTTP response, so Retry-After handling lives in your HTTP-client-specific retry wrapper. For example, as an Axios interceptor or a custom fetch wrapper:
// This is an HTTP-layer concern, not part of fetchWithRetry.
// Wire it into your HTTP client's retry interceptor.
function getRetryDelay(
response: Response,
attempt: number,
config: RetryConfig
): number {
const retryAfter = response.headers.get("Retry-After");
if (retryAfter) {
const seconds = parseInt(retryAfter, 10);
if (!isNaN(seconds)) return seconds * 1000;
// Retry-After can also be an HTTP date
const date = new Date(retryAfter);
if (!isNaN(date.getTime())) {
return Math.max(0, date.getTime() - Date.now());
}
}
// No header? Fall back to exponential backoff with jitter
return calculateDelay(attempt, config);
}
When a 429 comes with Retry-After: 30, respect it. Your backoff formula is a guess; the server’s header is informed.
Retries and Idempotency
This is the most dangerous footgun with retries: never retry a non-idempotent operation unless you’ve made it safe to repeat. If POST /orders times out, the server may have processed it, and retrying could create a duplicate order and double-charge a user.
The fix is an idempotency key, a client-generated ID that the server uses to deduplicate:
async function createOrder(order: Order): Promise<OrderResult> {
// Generate ONCE, reuse across all retry attempts
const idempotencyKey = crypto.randomUUID();
return fetchWithRetry(() =>
httpClient.post("/orders", order, {
headers: { "Idempotency-Key": idempotencyKey },
})
);
}
The server stores the idempotency key alongside the result. If the same key arrives again, it returns the stored result instead of processing a second time.
Pattern 2: Circuit Breaker
Retries help with transient failures. But when a service is down for minutes or hours, retries just add load to an already struggling system. A circuit breaker detects sustained failure and stops sending requests entirely.
The State Machine
A circuit breaker has three states:
┌─────────┐ failure threshold ┌──────┐
│ CLOSED ├───────────────────────►│ OPEN │
│(normal) │ │(fail)│
└────┬────┘ └──┬───┘
▲ │
│ success in │ timeout
│ half-open │ expires
│ ▼
│ ┌──────────┐
└──────────────────────────┤HALF-OPEN │
│ (probe) │
└──────────┘
- Closed: Requests flow normally. Failures are counted.
- Open: Requests fail immediately without calling the upstream. No load on a broken service.
- Half-Open: After a timeout, one probe request is allowed through. If it succeeds, close the circuit. If it fails, reopen.
Implementation
type CircuitState = "closed" | "open" | "half-open";
interface CircuitBreakerConfig {
failureThreshold: number; // Failures before opening
resetTimeoutMs: number; // How long to stay open before probing
monitorWindowMs: number; // Window for counting failures
}
class CircuitBreaker {
private state: CircuitState = "closed";
private failureTimestamps: number[] = []; // Sliding window of failure times
private nextProbeTime = 0;
private halfOpenInFlight = false;
constructor(
private readonly name: string,
private readonly config: CircuitBreakerConfig
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (Date.now() < this.nextProbeTime) {
throw new CircuitOpenError(
`Circuit "${this.name}" is open. ` +
`Next probe in ${this.nextProbeTime - Date.now()}ms`
);
}
// Timeout expired — transition to half-open
this.state = "half-open";
this.halfOpenInFlight = false;
}
if (this.state === "half-open") {
// Only allow one probe request through at a time.
// In a concurrent environment (e.g. multiple in-flight requests),
// without this guard multiple requests slip through simultaneously
// and a batch of failures could call trip() redundantly.
if (this.halfOpenInFlight) {
throw new CircuitOpenError(
`Circuit "${this.name}" is half-open, probe already in flight`
);
}
this.halfOpenInFlight = true;
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
if (this.state === "half-open") {
// Probe succeeded — service is back
this.state = "closed";
this.failureTimestamps = [];
this.halfOpenInFlight = false;
logger.info(`Circuit "${this.name}" closed — service recovered`);
}
// In closed state, a success doesn't reset the window
// we let old failures expire naturally via monitorWindowMs
}
private onFailure(): void {
const now = Date.now();
if (this.state === "half-open") {
// Probe failed — reopen
this.halfOpenInFlight = false;
this.trip("probe failed, circuit reopened");
return;
}
// Record failure and evict timestamps outside the monitor window
this.failureTimestamps.push(now);
const windowStart = now - this.config.monitorWindowMs;
this.failureTimestamps = this.failureTimestamps.filter(
(t) => t >= windowStart
);
if (this.failureTimestamps.length >= this.config.failureThreshold) {
this.trip(
`${this.failureTimestamps.length} failures in ` +
`${this.config.monitorWindowMs}ms, circuit opened`
);
}
}
private trip(reason: string): void {
this.state = "open";
this.nextProbeTime = Date.now() + this.config.resetTimeoutMs;
logger.warn(
`Circuit "${this.name}": ${reason}. ` +
`Will probe in ${this.config.resetTimeoutMs}ms`
);
}
getState(): CircuitState {
return this.state;
}
}
class CircuitOpenError extends Error {
constructor(message: string) {
super(message);
this.name = "CircuitOpenError";
}
}
Choosing Thresholds
The failure threshold and reset timeout aren’t arbitrary — they represent a trade-off:
| Setting | Too Low | Too High |
|---|---|---|
| Failure threshold | Opens on transient blips, causing false positives | Takes too long to detect a real outage |
| Reset timeout | Floods a recovering service with probes | Stays open long after the service has recovered |
A reasonable starting point for most internal services:
const config: CircuitBreakerConfig = {
failureThreshold: 5, // 5 failures within the monitor window
resetTimeoutMs: 30_000, // 30 seconds before probing
monitorWindowMs: 60_000, // Only count failures within the last minute
};
Tune from there based on your service’s traffic volume and typical recovery time. A service that recovers in seconds (autoscaling) needs a shorter reset timeout than one that requires manual intervention (database failover).
Using Circuit Breakers with Retries
Retries and circuit breakers work together, but order matters:
// Retry wraps the circuit breaker, not the other way around
const userService = new CircuitBreaker("user-service", circuitConfig);
async function getUser(id: string): Promise<User> {
return fetchWithRetry(
() => userService.execute(() => httpClient.get(`/users/${id}`)),
retryConfig
);
}
The retry logic handles transient failures (one bad request). The circuit breaker handles sustained failures (the service is down). If you put the circuit breaker outside the retry, you’d count each retry attempt as a separate failure and trip the circuit too early.
Pattern 3: Timeouts
Every outbound call needs a timeout. Without one, a hanging upstream ties up your threads, connections, and memory indefinitely.
async function fetchWithTimeout<T>(
fn: (signal: AbortSignal) => Promise<T>,
timeoutMs: number
): Promise<T> {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeoutMs);
try {
// Pass the signal to fn so the underlying request is actually
// cancelled, without this, aborting only stops us waiting in JS
// while the network request continues consuming resources.
return await fn(controller.signal);
} finally {
clearTimeout(timer);
}
}
// Usage: the caller threads the signal into the HTTP client
const user = await fetchWithTimeout(
(signal) => httpClient.get(`/users/${id}`, { signal }),
500
);
Setting Timeout Values
A common mistake is setting timeouts based on how long the request should take. Instead, set them based on how long you can afford to wait.
// These values should reflect your SLA, not your optimism
const TIMEOUT_CONFIG = {
// Fast, cached lookups — fail fast
userProfileMs: 500,
// Heavier operations, more tolerance
searchResultsMs: 3_000,
// Background jobs, can wait longer
reportGenerationMs: 30_000,
};
The rule of thumb: Your timeout should be slightly above the p99 latency of the upstream service under normal conditions. If the p99 is 200ms, a 500ms timeout gives headroom for slow requests without letting truly hung connections tie up resources.
Timeout Budgets
In a chain of service calls, timeouts should cascade:
async function handleRequest(req: Request): Promise<Response> {
// Total budget: 2 seconds
const deadline = Date.now() + 2000;
const user = await fetchWithTimeout(
(signal) => getUser(req.userId, { signal }),
Math.max(0, deadline - Date.now()) // Remaining budget
);
const recommendations = await fetchWithTimeout(
(signal) => getRecommendations(user, { signal }),
Math.max(0, deadline - Date.now()) // Whatever's left
);
return buildResponse(user, recommendations);
}
If getUser takes 1.8 seconds of your 2-second budget, getRecommendations only gets 200ms. Without budget propagation, the second call would start with its own full timeout; consuming a connection, a thread, and upstream resources for work that the original caller will never see the result of (because its own 2-second timeout will fire first). Budget propagation ensures that downstream calls don’t waste resources on requests that are already doomed to timeout at the caller.
Pattern 4: Bulkheads
Named after ship compartments that contain flooding, bulkheads isolate failures so one struggling dependency doesn’t take down everything else.
The Problem
// Shared thread/connection pool, one bad dependency poisons all
class ApiGateway {
private pool = new ConnectionPool({ maxConnections: 100 });
async getUser(id: string) {
return this.pool.execute(() => userService.get(id));
}
async getOrders(userId: string) {
return this.pool.execute(() => orderService.get(userId));
}
async getRecommendations(userId: string) {
return this.pool.execute(() => recoService.get(userId));
}
}
If recoService starts responding slowly, all 100 connections get consumed waiting for recommendations. Now getUser and getOrders can’t get a connection either, even though those services are perfectly healthy.
The Solution: Isolated Pools
class ApiGateway {
// Each dependency gets its own pool, failures stay contained
private userPool = new ConnectionPool({ maxConnections: 40 });
private orderPool = new ConnectionPool({ maxConnections: 40 });
private recoPool = new ConnectionPool({ maxConnections: 20 });
async getUser(id: string) {
return this.userPool.execute(() => userService.get(id));
}
async getOrders(userId: string) {
return this.orderPool.execute(() => orderService.get(userId));
}
async getRecommendations(userId: string) {
return this.recoPool.execute(() => recoService.get(userId));
}
}
Now if recoService exhausts its 20 connections, getUser and getOrders still have their 40 connections each. The blast radius is contained.
Sizing Bulkheads
Allocate pool sizes based on criticality, not equal distribution:
- Critical path (user auth, checkout): Larger pools, higher priority
- Enhancement (recommendations, analytics): Smaller pools, acceptable to degrade
- Background (logging, metrics): Smallest pools, best-effort
This connects directly to the next pattern, graceful degradation.
Pattern 5: Graceful Degradation
When a dependency is unavailable, don’t return a 500. Return a degraded response that still serves the user.
Not all data is equally important. Define what’s critical vs. what’s optional:
| Tier | Behavior on Failure | Example |
|---|---|---|
| Critical | Return error to caller | User authentication, payment processing |
| Important | Return cached/stale data | Product pricing (serve last-known price) |
| Optional | Return empty/default | Recommendations, recently viewed, ads |
Critical calls don’t need a wrapper, just let them throw naturally. Important-tier data uses getWithStaleCache (shown below) to serve last-known-good values, withFallback handles the Optional tier:
// withFallback only handles non-critical tiers. For critical calls,
// just let them throw naturally without a wrapper.
// I use { value: T } wrapper instead of T | (() => T) to avoid
// ambiguity when T itself is a function type.
// Lazy: () => [] as Order[] Static: { value: myDefault }
type Fallback<T> = { value: T } | (() => T);
async function withFallback<T>(
primary: () => Promise<T>,
fallback: Fallback<T>,
options: { name: string }
): Promise<T> {
try {
return await primary();
} catch (error) {
logger.warn(`${options.name} degraded to fallback`, {
error: (error as Error).message,
});
return typeof fallback === "function"
? fallback()
: fallback.value;
}
}
// Usage: getUserProfile with tiered degradation
interface UserProfile {
user: User;
orders: Order[];
recommendations: Product[];
}
async function getUserProfile(userId: string): Promise<UserProfile> {
// Critical — no fallback, let it throw
const user = await getUser(userId);
// Optional, serve empty arrays rather than failing the whole page
const [orders, recommendations] = await Promise.all([
withFallback(
() => getOrders(userId),
() => [] as Order[],
{ name: "orders" }
),
withFallback(
() => getRecommendations(userId),
() => [] as Product[],
{ name: "recommendations" }
),
]);
return { user, orders, recommendations };
}
Cache as a Degradation Layer
For “important” tier data, stale cache is better than nothing:
async function getWithStaleCache<T>(
key: string,
fetcher: () => Promise<T>,
ttlMs: number
): Promise<T> {
try {
const fresh = await fetcher();
await cache.set(key, fresh, ttlMs);
return fresh;
} catch (error) {
// Primary failed? Try serving stale cache
const stale = await cache.get(key);
if (stale !== undefined) {
logger.warn(`Serving stale cache for ${key}`, {
error: (error as Error).message,
});
return stale;
}
// No cache either? Propagate the failure
throw error;
}
}
This is the “stale-while-error” pattern. The cache serves as a safety net, not a performance optimization. In production, this is the difference between “the pricing service is slow” and “users see no prices at all.”
Putting It All Together
These patterns compose. A production-grade service call typically layers them:
Request
→ Retry with backoff (handle transient failures)
→ Circuit breaker (stop hammering dead services)
→ Bulkhead (isolate from other dependencies)
→ Timeout (don't wait forever)
→ Upstream service call
→ Graceful degradation (serve what you can on failure)
// ConnectionPool: a concurrency limiter that caps in-flight requests
// per dependency (e.g. p-limit, custom semaphore, or HTTP agent maxSockets).
class ResilientServiceClient {
private circuitBreaker: CircuitBreaker;
private pool: ConnectionPool;
private timeoutMs: number;
private retryConfig: RetryConfig;
constructor(
name: string,
config: {
circuit: CircuitBreakerConfig;
pool: { maxConnections: number };
timeout: number;
retry: RetryConfig;
}
) {
this.circuitBreaker = new CircuitBreaker(name, config.circuit);
this.pool = new ConnectionPool(config.pool);
this.timeoutMs = config.timeout;
this.retryConfig = config.retry;
}
async call<T>(fn: (signal: AbortSignal) => Promise<T>): Promise<T> {
return fetchWithRetry(
() =>
this.circuitBreaker.execute(() =>
this.pool.execute(() =>
fetchWithTimeout(fn, this.timeoutMs)
)
),
this.retryConfig
);
}
}
// Usage
const userClient = new ResilientServiceClient("user-service", {
circuit: { failureThreshold: 5, resetTimeoutMs: 30_000, monitorWindowMs: 60_000 },
pool: { maxConnections: 40 },
timeout: 500,
retry: { maxRetries: 2, baseDelayMs: 100, maxDelayMs: 2_000, retryableStatuses: new Set([502, 503, 504]) },
});
const user = await userClient.call(
(signal) => httpClient.get(`/users/${id}`, { signal })
);
Observability: Knowing When Patterns Activate
Resilience patterns are invisible until they fire. Without observability, your circuit breaker could be open for an hour and nobody would know.
What to Measure
// Emit metrics at every decision point
async function instrumentedRetry<T>(
fn: () => Promise<T>,
config: RetryConfig
): Promise<T> {
let attempt = 0;
const start = Date.now();
try {
const result = await fetchWithRetry(async () => {
attempt++;
if (attempt > 1) {
// Only count actual retries (attempt 2+), not the initial request
metrics.increment("api.retry", { retryNumber: String(attempt - 1) });
}
return fn();
}, config);
// Total duration including all retry delays
metrics.histogram("api.latency", Date.now() - start, {
status: "success",
retries: String(attempt - 1),
});
return result;
} catch (error) {
// Failed requests need latency data too, without this,
// your dashboards silently exclude the requests you most
// need to understand.
metrics.histogram("api.latency", Date.now() - start, {
status: "failure",
retries: String(attempt - 1),
reason: classifyError(error),
});
throw error;
}
}
Key metrics to track:
- Retry rate: If retries exceed 10% of requests, something is degrading
- Circuit breaker state changes: Alert on open, log on close
- Timeout rate: Sustained timeouts mean your timeout value is wrong or the upstream is degraded
- Bulkhead saturation: If a pool is consistently above 80% utilization, it’s too small
- Degradation activations: How often are you serving fallback data?
Alert on Patterns, Not Symptoms
Don’t alert on every individual timeout or retry, that’s noise. Alert on rates and state changes that indicate a systemic problem:
- Circuit breaker opened (warning) something is failing hard enough to trip the breaker. Investigate the upstream.
- Retry rate exceeds 10% of total requests (warning) transient failures are becoming less transient. The upstream may be degrading.
- Degraded responses sustained for 10+ minutes (critical) you’re serving fallback data long enough that users are noticeably affected. Time for human intervention.
The key principle: a single retry is expected behavior. A sustained pattern of retries is a signal. Configure your alerting tool (Prometheus, Datadog, New Relic) to evaluate rates over time windows, not individual events.
Common Mistakes
Setting the same timeout for every call. A cache lookup and a report generation have wildly different latency profiles. Use per-operation timeouts that reflect the actual work being done.
Circuit breaker per instance vs. per service. If one replica is unhealthy and you circuit-break the entire service, you’ve stopped traffic to healthy replicas too. Consider per-endpoint or per-instance circuit breakers when your upstream has multiple replicas behind a load balancer.
Not considering hedged requests for latency-sensitive paths. Retries only fire after a failure. For latency-critical operations, consider hedging, send a duplicate request after the p95 latency threshold and take whichever response arrives first. This trades extra load (typically 5% more requests) for significantly tighter tail latency. It’s not always appropriate (only use with idempotent reads, and only when the backend can absorb the extra traffic), but for paths like search or pricing lookups where p99 latency directly impacts revenue, hedging is a powerful complement to retry-on-failure.
Catching too broadly. Not every error is a connectivity issue. A NullPointerException in your serialization code shouldn’t trigger a retry, it will fail identically every time. Classify errors before deciding how to handle them.
Use Existing Libraries First
In production, most teams reach for battle-tested libraries rather than implementing these primitives themselves: cockatiel and opossum in Node.js, resilience4j on the JVM, or Polly in .NET. What matters is understanding the semantics, retry budgets, circuit breaker state transitions, timeout propagation, so you can configure them correctly and debug them when they fire.
Key Takeaways
- Only retry what’s retryable: 4xx errors (except 408, 429) won’t succeed on retry
- Exponential backoff with full jitter: Decorrelates retries and prevents thundering herds
- Circuit breakers stop cascading failure: Fail fast instead of waiting for a dead service
- Every outbound call needs a timeout: Set based on what you can afford to wait, not what you expect
- Bulkheads contain blast radius: One slow dependency shouldn’t starve the rest
- Degrade gracefully: Serve what you can, log what you can’t, return errors only for critical data
- Instrument everything: Resilience patterns are invisible without observability
These patterns aren’t theoretical, they’re the difference between “service X had a bad deploy and served slightly degraded results for 20 minutes” and “service X had a bad deploy and took down the entire platform.” In distributed systems, failure is a feature of the architecture. Design for it.