Back to Blog
·6 min

Atomic Redis counters and the race condition that took me three tries

I had a rate limiter. It capped most users at 100 emails per hour. A few got to send 105. The bug: INCR was atomic. My code wasn't.

RedisConcurrencyBackend

I had a rate limiter. It was supposed to cap each user at 100 emails per hour. It capped most users at 100. A few got to send 101. Some got to send 105.

The bug: INCR was atomic. My code wasn't.

Take 1: the naive version

async function tryConsume(userId: string) {
  const key = `quota:${userId}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, 3600);
  return count <= 100;
}

Looks reasonable. Two workers running this for the same user at count = 99 will see INCR return 100 and 101 respectively. Only the first one passes the check. That part's fine.

The overshoot came from a different place: the rollback.

Take 2: INCR with rollback on failure

I added rollback logic because sometimes the send itself failed (SMTP rejection), and I didn't want a failed send to count against quota:

async function tryConsume(userId: string) {
  const count = await redis.incr(`quota:${userId}`);
  if (count > 100) {
    await redis.decr(`quota:${userId}`);  // ← the bug lives here
    return false;
  }
  return true;
}

Imagine a worker dies between the INCR and the DECR. The counter sticks at 101 even though no email was sent. Or two workers DECR concurrently for the same key: INCR → 101 → 102 → DECR → 101 → DECR → 100. Now the counter says 100 but only 1 email actually went out.

The fundamental problem: check-and-increment is two operations. Anything you do between them is a race.

Take 3: Lua

Redis ships with EVAL, which runs a Lua script atomically. The whole script executes as one operation. No interleaving.

const consumeScript = `
  local key = KEYS[1]
  local limit = tonumber(ARGV[1])
  local ttl = tonumber(ARGV[2])
  local current = tonumber(redis.call('GET', key) or '0')
  if current >= limit then
    return 0
  end
  redis.call('INCR', key)
  if current == 0 then
    redis.call('EXPIRE', key, ttl)
  end
  return 1
`;

async function tryConsume(userId: string) {
  const allowed = await redis.eval(
    consumeScript,
    1,                            // KEYS count
    `quota:${userId}`,            // KEYS[1]
    "100",                         // ARGV[1]
    "3600"                         // ARGV[2]
  );
  return allowed === 1;
}

Read, check, increment. All inside one atomic Lua call. Concurrent workers can't interleave because Redis runs the script start-to-finish before doing anything else for any client.

What I should have done first

I should have written the test case before writing the production code. Two workers, same user, both calling tryConsume from Promise.all. The race would have shown up the first time. Instead I shipped, monitored, got pinged, and rebuilt twice.

There's a version of me that would have shipped take 1, hit the bug, switched to take 2, hit the second bug, and ended up writing this post anyway. I just hope you can skip a step.