Idempotency Is Not Just a Header

  • Idempotency
  • Distributed Systems

Idempotency sounds boring until the first duplicate order shows up.

For a long time, my mental model was simple: if an API consumer sends an Idempotency-Key header, we store it somewhere and make sure the same request does not create the same thing twice.

That is not completely wrong, but it is incomplete.

The header is the easy part. The hard part is deciding what the key is allowed to mean, what it is bound to, what happens when the payload is slightly different and what the system should do when two identical requests race each other at the same time.

This came up while working on an order creation API. Think of something like:

POST /orders

Assume that the above endpoint creates a trading order in a brokerage system. If the client times out after sending the request, they might retry. If our system already created the order but the response got lost somewhere between us and the client, retrying should not create another order.

In payments, brokerage, billing and most money-adjacent systems, “try again” cannot secretly mean “do it again”.

The retry problem

A normal POST endpoint is not safe to retry. For our brokerage service, let’s assume the client sends:

POST /orders
Idempotency-Key: retry-abc

{
  "instrument": "US0378331005",
  "side": "buy",
  "amount": "100.00",
  "currency": "EUR"
}

The server creates an order and returns 201 Created.

Now imagine the response never reaches the client. Maybe a load balancer cuts the connection. Maybe the client times out after 2 seconds. Maybe mobile internet does mobile internet things.

The client does the reasonable thing and retries the same request with the same key. There are two possible worlds:

  • The first request never reached us, so we should create the order now.
  • The first request did reach us and created the order, so we should return that order again.

The client cannot know which world it is in.

That is the whole point of idempotency: move the ambiguity to the server, because the server is the only place that can resolve it correctly.

The easy version

The first version people usually reach for is this:

CREATE TABLE idempotency_keys (
  id TEXT PRIMARY KEY,
  idempotency_key TEXT NOT NULL UNIQUE,
  created_at TIMESTAMPTZ NOT NULL
);

then in the endpoint:

if key already exists:
  return existing order

create order
store key
return order

This looks fine for about five minutes (or thirty, if you’re taking a smoking break!)

Then the questions start.

What if two different clients use the same key? What if the same client uses the same key on two different endpoints? What if the first payload says amount = 100.00, and the retry says amount = 50.00? What if two identical requests arrive at the same time and both pass the “does key exist?” check before either one writes the row?

The key alone is not enough information.

An idempotency key is not globally meaningful. It is meaningful inside a scope, for a method, for a path, and for one exact request shape.

The behavior I care about

For an order creation endpoint, I want these semantics:

  • Same client, same key, same endpoint, same payload: return the original order
  • Same client, same key, different payload: reject it.
  • Same client, same key, different endpoint: reject it.
  • Different client, same key: unrelated.
  • Concurrent retries: exactly one request wins, the others return the winner’s result.
  • Validation failure: do not burn the idempotency key.

The last one matters more than it first appears.

If a client sends an invalid request with an idempotency key, and we store the key before validation, they are stuck. They fix the payload and retry, but now we say “this key was already used with a different payload”. IMO, that is a bad API experience.

I only want to store the key once the domain operation actually succeeds.

Give the key a memory

A more useful table looks like this:

CREATE TABLE request_replays (
  id TEXT PRIMARY KEY,
  client_id TEXT NOT NULL,
  request_key TEXT NOT NULL,
  method TEXT NOT NULL,
  path TEXT NOT NULL,
  request_hash TEXT NOT NULL,
  response_body JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at TIMESTAMPTZ NOT NULL,

  CONSTRAINT uq_request_replays_client_key
    UNIQUE (client_id, request_key)
);

The names are not important, rather, the shape is.

  • client_id scopes the key. Two clients can both send abc123 and we treat those as separate keys.
  • method and path prevent accidental reuse across endpoints. A retry key used for POST /orders should not later work for POST /withdrawals.
  • request_hash binds the key to the payload. Same key with a different body is not a retry. It is a different request trying to reuse the same retry token.
  • response_body stores the domain object or response we want to return on replay. For order creation, this can be the order we created the first time.
  • expires_at keeps the table finite. Idempotency is usually a retry window, not an eternal contract. A 24-hour window is common enough for APIs like this, though the exact value depends on the product.

Fingerprinting the body

The request hash should be boring and deterministic. The mistake is hashing raw JSON bytes. These two payloads are the same request:

{ "side": "buy", "amount": "100.00" }

{ "amount": "100.00", "side": "buy" }

Raw string hashing would treat them differently because the key order changed. So I prefer a canonical representation:

def fingerprint(body):
    normalized = remove_nulls(body)
    canonical = json_dumps(normalized, sort_keys=True, compact=True)
    return sha256(canonical)

There are a few choices hidden in there. Sorting keys makes object order irrelevant. Removing nulls makes these two equivalent:

{ "amount": "100.00", "limit_price": null }
{ "amount": "100.00" }

That matches how many JSON APIs treat optional fields. If our API treats explicit null differently from missing, then we should not remove nulls. The point is not that null stripping is always right. The point is that the rule should be explicit.

List order still matters. These are not the same:

{ "legs": ["buy", "sell"] }
{ "legs": ["sell", "buy"] }

That is usually what we want.

First request

On the first valid request, the flow is roughly:

read idempotency key from header
compute request fingerprint

validate request
create order
store replay record with the order response

commit
return order

The storage of the replay record must happen in the same transaction as the order creation.

This is the part that is easy to get subtly wrong. If we create the order and then fail to store the replay record, the client can retry and create a second order because the retry key is not known. If we store the replay record and then fail to create the order, the client can retry and get a fake “already processed” response for something that never happened.

Both rows need to commit together or neither should.

Replay

When the same request comes back later:

find replay record by (client_id, request_key)

if not found:
  continue as a new request

if method or path differs:
  reject

if request_hash differs:
  reject

return stored response_body

This is why storing the response is useful. We do not have to reconstruct the response from current database state.

That matters if the order changed after creation. Maybe it moved from new to accepted. Maybe a workflow enriched it. Maybe the response format has fields that were only true at creation time.

A replay of the create request should answer what the first successful request created and not what the order looks like at that moment.

There are cases where returning current state is acceptable, but it should be a conscious product decision. I tend to prefer returning the original creation response because it gives the cleanest retry semantics.

When two requests race

The annoying case is two identical requests arriving at the same time. Both have:

client_id = cli_123
request_key = retry-abc
request_hash = same hash

Request A starts creating the order. Request B starts creating the order too.

If the application only does “check then insert”, both can pass the check before either one writes the replay row.

This is why database has to be part of the design. The unique constraint on (client_id, request_key) is the lock. One request wins and inserts the replay row. The other request hits the unique constraint.

When that happens, the loser should not crash with a database error. It should read the winning row and apply the same replay rules:

try:
  create order
  insert replay row
  commit
  return order

except unique_violation on (client_id, request_key):
  winner = read replay row from primary database

  if winner.path != path or winner.method != method:
    reject

  if winner.request_hash != request_hash:
    reject

  return winner.response_body

Two small details matter here.

  • First, only catch the unique constraint you expect. If the order insert fails because of some unrelated database constraint, that is not an idempotency replay. Bubble it up.
  • Second, read the winning row from the primary database, not a read replica. During a race, the winning row may not be visible on a replica yet. This is exactly the wrong moment to ask an eventually consistent copy what happened. (and don’t ask me how I know this :sob:)

Failed requests

I do not store failed validation responses.

If the request fails because the account is closed, the market is closed, the instrument is not tradable, or the amount is invalid, the key remains unused. The client can fix the request and try again with the same key.

There is a different design where you store every response, including 4xx errors. It is a valid choice, especially if you want “same key always gives same response” at the HTTP layer.

For internal product APIs, I often prefer storing only successful domain creation. It keeps validation errors from poisoning a key. But the decision needs to be explicit, because clients will build retry logic around whatever behavior you expose.

Status codes

I do not care too much whether payload mismatch returns 409 Conflict or 422 Unprocessable Entity. Both can be defended.

  • 409 says the key conflicts with an existing operation.
  • 422 says the request cannot be processed because the key was already used with a different payload.

Pick one and make the error message boring:

Idempotency key already used with a different request payload

The message should tell the client exactly what they did wrong. This is not the place for clever API prose.

Easy mistakes

The design is small, but a few things are easy to miss.

  • Do not make the key global. Scope it by client or tenant.
  • Do not compare raw request bodies. Canonicalize first.
  • Do not store the replay row outside the domain transaction.
  • Do not catch every database integrity error and call it an idempotency conflict.
  • Do not read the winning row from a replica during a race.
  • Do not store the key before validation unless you intentionally want failed attempts to consume keys.
  • Do not keep replay rows forever unless there is a real reason. They are operational retry state, not usually business records.

None of this is complicated by itself. The bugs come from treating idempotency as middleware instead of part of the write model. I have learnt some of these the hard way.

The shape I like

I like to think that the useful way to think about idempotency is not did the client send a header? , rather it is: for this client, did this exact request already produce a successful result?

If yes, return that result. If no, try to produce it once. If another request is producing it at the same time, let the database pick the winner and make everyone else read from that winner.

That is the whole pattern.

The header is just the handle. The real work is the contract around it.