Idempotency and retries in distributed systems
Retries are necessary in distributed systems because networks, processes, and dependencies fail in partial and ambiguous ways. The hard part is not deciding whether to retry. The…
Retries are necessary in distributed systems because networks, processes, and dependencies fail in partial and ambiguous ways. The hard part is not deciding whether to retry. The hard part is making a retry safe when the caller does not know whether the previous attempt failed before execution, during execution, after execution, or only while returning the response. The answer is to design for repeated intent, with idempotency on the server and disciplined retries on the client.
Retries solve transient failure
A transient failure is a temporary condition such as a timeout, a dropped connection, a throttled dependency, or a service that is briefly unavailable. Retrying can turn that failure into success without user involvement.
Retries are harmful when they are unbounded, immediate, or applied to operations that are not safe to repeat. They can amplify load during an outage, keep a failing dependency overloaded, and create duplicate side effects. Retry policy is therefore part of the API contract and the capacity model, not just client convenience.
Timeouts create ambiguity
A timeout does not prove that the operation failed. It only proves that the caller did not receive a response in time. The service might still be processing the request, the response might have been lost, or the operation might have completed successfully.
This ambiguity is why idempotency belongs on the server side. A client cannot make a non-idempotent server operation safe by retrying carefully. The server has to recognise repeated intent and produce a stable outcome.
Idempotency makes repetition safe
An operation is idempotent when making the same request more than once has the same intended effect as making it once. HTTP already defines this for some methods. Under RFC 9110, GET, HEAD, OPTIONS, and TRACE are safe, and PUT and DELETE are idempotent even though they change state. POST is neither safe nor idempotent by default.
Many business operations use POST, and that is where duplicate side effects appear. Creating a payment, submitting an order, sending an invite, provisioning infrastructure, or starting a job can all repeat when retried. For these operations, the usual design is to require an idempotency key, sometimes called a client request token. The server records the key, the request identity, and the result of the first accepted operation. A later request with the same key returns the recorded result instead of executing the side effect again.
Require keys where duplicate execution would be harmful or expensive. Do not require them for simple reads, and do not use them to hide non-deterministic behaviour in endpoints that should have been modelled with PUT or a stable resource URI.
Define the key contract
The client generates a unique key for one logical operation and sends the same key on every retry of that operation. A version 4 UUID is a practical format because it is widely supported and needs no central coordination. A random key generated by the caller is often safer than deriving the key from request parameters, because identical parameters do not always mean the same business operation.
The key needs a scope. Scoping it to the authenticated account, endpoint, method, and a hash of the request body prevents two clients from colliding by accident and prevents a key used for one operation being replayed against a different one. The server must detect when the same key is reused with a different intent and reject it.
Document the contract so clients can rely on it. State the maximum key length, the allowed characters, whether keys are case sensitive, the retention period, and what happens when the same key arrives with a different request body. As a real reference point, Stripe carries the key in the Idempotency-Key request header, accepts keys up to 255 characters, recommends UUIDs, and prunes keys after they are at least 24 hours old.
Store the result, not just the fact
A robust implementation records the in-flight or completed result for a key. When the first request completes, later retries with the same key return the same outcome, or an equivalent representation of it.
Storing only that a key was seen is not enough. If the server marks a key as seen before completing the operation, every retry can be blocked while the side effect never happened. If it marks the key after the side effect but before saving the result, the client can still get an unclear response. The record should hold a canonical representation or digest of the relevant request fields, the operation state, the final response, and an expiry time.
Use a transactional boundary where possible: reserve the key, perform the side effect, and persist the result so it can be replayed safely. The token reservation must be atomic with the decision to proceed. Otherwise two concurrent attempts can both pass the check and duplicate the side effect.
Handle concurrent retries
Clients, load balancers, and SDKs can retry concurrently. Two requests with the same key may reach different server instances at the same time, so the idempotency store must enforce uniqueness atomically.
When one request is already in progress for a key, return a clear conflict or retryable response. Do not run the operation twice and hope downstream systems deduplicate it.
For long running operations, return an operation resource or status handle. A retried create request can then return the same operation reference while work continues. That is clearer than pretending a long operation is always synchronous.
Choose retry rules carefully
A retry policy should define which errors are retryable, how many attempts are allowed, how long the client waits between attempts, and when the caller gives up. Common retryable cases include timeouts, connection resets, temporary unavailability, and explicit throttling such as 429 and 503. Validation errors, authentication failures, and permanent business rule failures should not be retried without a change.
Use exponential backoff with jitter for shared dependencies. Backoff reduces pressure. Jitter spreads attempts so large numbers of clients do not retry in lockstep, which is the pattern the Amazon Builders' Library recommends to avoid synchronised retry storms. Add a total deadline so the retry loop cannot outlive the user's request, the queue visibility timeout, or the business operation window. Clients should also respect Retry-After when the protocol provides it.
Servers should make retry decisions easier. Return 429 for rate limiting, 503 for temporary overload or maintenance, and a clear error body that says whether the request can be retried.
Avoid retry storms
When many callers share a dependency, retry traffic becomes part of that dependency's load. Retries need budgets, concurrency limits, and observability so operators can see when retry behaviour is making an incident worse. A policy that looks fine in unit tests can still fail under load if it multiplies traffic during an outage.
Background jobs
Jobs should be designed as if they can run more than once. A worker can crash after performing a side effect but before acknowledging the message. A broker can redeliver. A deployment can stop a worker mid task. The right response is not to hope for exactly-once execution. It is to make the job idempotent and transactional where possible.
Use business keys and database constraints for externally visible effects. For example, record that a notification for a specific event and recipient has already been sent, or that a payment capture has already been submitted with a specific operation key.
Idempotency is not exactly once
Idempotency keys reduce duplicate effects at the API boundary. They do not guarantee exactly-once execution across every downstream system. Message brokers, webhooks, email providers, and payment processors can still retry or duplicate work.
Downstream consumers should deduplicate using stable event IDs, job IDs, resource IDs, or provider IDs. Idempotency should be layered, not assumed to exist in one place. If the operation is high value, consider returning the created resource URI so clients can recover by lookup after a key expires.
What to measure
Measure retry counts, retry success rates, timeout rates, throttling responses, duplicate key reuse, mismatched key reuse, and dependency saturation. Log the idempotency key or a safe correlation identifier so you can trace repeated intent. Do not log secrets, payment data, or raw personal data.
Conclusion
Retries are only safe when repeated intent is part of the design. Use idempotency keys for side-effecting operations such as POST, with a clear key contract, atomic storage, request matching, result replay, and a sensible retention period. Use bounded retries with backoff and jitter for transient faults, and observability that shows when retries help or harm. In distributed systems, duplicate attempts are normal. Design the operation so duplicates do not become duplicate business effects.
