Atomic Redis counters and the race condition that took me three tries
I had a rate limiter. It capped most users at 100 emails per hour. A few got to send 105. The bug: INCR was atomic. My code wasn't.
I had a rate limiter. It was supposed to cap each user at 100 emails per hour. It capped most users at 100. A few got to send 101. Some got to send 105.
The bug: INCR was atomic. My code wasn't.
Take 1: the naive version
async function tryConsume(userId: string) {
const key = `quota:${userId}`;
const count = await redis.incr(key);
if (count === 1) await redis.expire(key, 3600);
return count <= 100;
}Looks reasonable. Two workers running this for the same user at count = 99 will see INCR return 100 and 101 respectively. Only the first one passes the check. That part's fine.
The overshoot came from a different place: the rollback.
Take 2: INCR with rollback on failure
I added rollback logic because sometimes the send itself failed (SMTP rejection), and I didn't want a failed send to count against quota:
async function tryConsume(userId: string) {
const count = await redis.incr(`quota:${userId}`);
if (count > 100) {
await redis.decr(`quota:${userId}`); // ← the bug lives here
return false;
}
return true;
}Imagine a worker dies between the INCR and the DECR. The counter sticks at 101 even though no email was sent. Or two workers DECR concurrently for the same key: INCR → 101 → 102 → DECR → 101 → DECR → 100. Now the counter says 100 but only 1 email actually went out.
The fundamental problem: check-and-increment is two operations. Anything you do between them is a race.
Take 3: Lua
Redis ships with EVAL, which runs a Lua script atomically. The whole script executes as one operation. No interleaving.
const consumeScript = `
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local ttl = tonumber(ARGV[2])
local current = tonumber(redis.call('GET', key) or '0')
if current >= limit then
return 0
end
redis.call('INCR', key)
if current == 0 then
redis.call('EXPIRE', key, ttl)
end
return 1
`;
async function tryConsume(userId: string) {
const allowed = await redis.eval(
consumeScript,
1, // KEYS count
`quota:${userId}`, // KEYS[1]
"100", // ARGV[1]
"3600" // ARGV[2]
);
return allowed === 1;
}Read, check, increment. All inside one atomic Lua call. Concurrent workers can't interleave because Redis runs the script start-to-finish before doing anything else for any client.
What I should have done first
I should have written the test case before writing the production code. Two workers, same user, both calling tryConsume from Promise.all. The race would have shown up the first time. Instead I shipped, monitored, got pinged, and rebuilt twice.
There's a version of me that would have shipped take 1, hit the bug, switched to take 2, hit the second bug, and ended up writing this post anyway. I just hope you can skip a step.