Rate limiting: algorithms and trade-offs

Rate limiting protects an API from overload, abusive clients, accidental loops, and expensive spikes. The hard part is not adding a limit. The hard part is choosing an algorithm a…

Rate limiting protects an API from overload, abusive clients, accidental loops, and expensive spikes. The hard part is not adding a limit. The hard part is choosing an algorithm and a response contract that fit the product, the traffic pattern, and the fairness model.

Decide what is being limited

Start with the unit of control. Limits can apply by IP address, authenticated user, organisation, API key, OAuth application, endpoint, tenant, region, or a combination. Public unauthenticated APIs often begin with IP-based limits, but IPs are a weak identity behind NAT, mobile networks, proxies, and attackers.

Authenticated APIs should usually limit by principal and by application. Multi-tenant APIs often need tenant-wide limits as well, because one customer can create many users or tokens. Expensive endpoints may need stricter per-operation limits than cheap reads.

Fixed window counters

A fixed window counter allows a set number of requests per time window, such as 100 requests per minute. It is simple, cheap, and easy to explain. The downside is boundary burst. A client can send 100 requests at the end of one minute and 100 more at the start of the next minute.

Fixed windows are useful for coarse quotas and low-risk APIs. They are less suitable when sharp bursts can overload dependencies.

Sliding window logs

A sliding window log stores request timestamps and counts only those within the current rolling period. It is accurate and avoids fixed boundary bursts. The cost is storage and cleanup, especially for high-cardinality identities and high request rates.

Use sliding logs when precision matters and request volume is moderate. For very high volume systems, the memory and write overhead can become the limiting factor.

Sliding window counters

A sliding window counter approximates a rolling window by combining the current and previous fixed windows with weighting. It is cheaper than a full timestamp log and smoother than a basic fixed window.

The trade-off is approximation. For most APIs, that approximation is good enough. It gives better burst control without storing every request timestamp.

Token bucket

A token bucket refills at a steady rate up to a maximum capacity. Each request consumes one or more tokens. When the bucket is empty, the request is limited. The capacity controls burst size. The refill rate controls sustained throughput.

Token buckets are a strong default for APIs because they allow short bursts while enforcing a long-term rate. They also support weighted costs, where an expensive request consumes more tokens than a cheap request.

Leaky bucket

A leaky bucket processes requests at a steady rate, often through a queue. Bursts are smoothed into a constant drain rate until the queue fills. Once full, new requests are rejected or delayed.

Leaky buckets are useful when a downstream system needs smooth traffic. They can add latency because requests wait in a queue. They are a poor fit for interactive APIs when clients expect immediate responses and retry behaviour is easier than server-side waiting.

Concurrency limits

Rate limits count requests over time. Concurrency limits cap work in progress. They are useful for expensive operations, long polling, report generation, uploads, and endpoints that hold database connections or worker slots.

Concurrency limits should usually be combined with rate limits. A client can stay under a per-minute quota and still create too many simultaneous expensive requests.

Return useful limit responses

When a request is limited, return 429 Too Many Requests. Include a clear error body and a retry signal. The Retry-After header can tell the client when to try again. Limit headers can expose remaining quota and reset times, but they must be accurate enough for clients to rely on.

Do not return 500 for deliberate rate limiting. That makes clients treat a policy decision as a server fault.

Make clients part of the design

Good clients back off, add jitter, respect Retry-After, avoid polling when webhooks are available, cache where allowed, and stop retrying non-retryable errors. Document this behaviour. SDKs should implement it by default.

Do not encourage clients to race the limit. If every response exposes an exact reset second, clients may stampede at the boundary. Jitter and token-based smoothing reduce that effect.

Protect fairness and cost

Rate limiting is a security, reliability, and cost control. OWASP classifies unrestricted resource consumption as an API security risk because API requests consume CPU, memory, network bandwidth, and storage, and sometimes paid third-party resources such as email and SMS.

Set stricter limits for high-cost operations, authentication attempts, exports, search, webhook retries, and endpoints that trigger email, SMS, payment, or AI workloads. A single global request count is rarely enough.

Conclusion

Fixed windows are simple, sliding logs are accurate, sliding counters are efficient, token buckets balance burst and sustained traffic, leaky buckets smooth downstream load, and concurrency limits protect scarce workers. A good API usually combines these controls, exposes clear 429 responses, and gives clients enough guidance to slow down safely.