Token Bucket
Token Bucket#
Window-based algorithms control average rate but can't control instantaneous bursts. Token Bucket solves this with two separate knobs — one for burst size, one for sustained rate.
The Problem It Solves#
With Sliding Window Counter, the limit is say 60 requests per minute. A user can send all 60 in the first second of the window — all allowed. Then silence for 59 seconds. Then another 60 in the first second of the next window.
second 0 : 60 requests → all allowed (window just reset)
second 1-59: silence
second 60 : 60 requests → all allowed (new window)
The backend sees 120 requests in 2 seconds. The rate limiter said "60 per minute" but the backend got hammered with a spike it wasn't designed for.
This creates two problems:
Backend gets overwhelmed — a backend designed for 1 req/sec average suddenly absorbs 60x load for 1 second. Circuit breakers trip, latency spikes, cascading failures start.
Bad user experience — user sends 60 requests at second 0, all go through. Sends 1 more at second 30 — blocked. They've waited 30 full seconds but the window hasn't reset yet. From their perspective the rate limiter is broken.
Window-based algorithms control average rate but not instantaneous burst. Token Bucket controls both.
The Core Idea#
Imagine a bucket that holds tokens. The bucket has a maximum capacity — say 20 tokens. Tokens refill at a steady rate — say 5 tokens per second.
Every incoming request consumes 1 token. If the bucket has tokens — allow, decrement. If the bucket is empty — block.
Bucket capacity : 20 tokens (maximum burst a user can ever send)
Refill rate : 5 tokens/sec (sustained average rate over time)
User arrives, bucket full (20 tokens):
Request 1 → bucket = 19 → allow
Request 2 → bucket = 18 → allow
...
Request 20 → bucket = 0 → allow
Request 21 → bucket = 0 → BLOCK
1 second passes → 5 tokens refilled → bucket = 5
Request 22 → bucket = 4 → allow
...
Two separate knobs:
This is the fundamental advantage over window-based algorithms. You can say: "allow bursts of up to 20 requests, but sustain no more than 5 per second on average." Window-based algorithms can only express one number — requests per window.
Why Background Job Refill Fails#
The naive approach to refilling: run a background job every second that increments every active user's token count by the refill rate.
At 1M active users this becomes:
1M users × 1 INCR per second = 1,000,000 Redis writes per second
just for refilling — before any actual requests
Our entire request budget is 400K QPS. The refill job alone would need 1M ops/sec — 2.5× our total request capacity, just to keep tokens topped up. And this runs constantly whether users are making requests or not.
Beyond the Redis load, you'd need a distributed scheduler to coordinate which worker refills which users, handle failures, avoid double-refills. An entire infrastructure component just to increment counters.
This is the wrong approach entirely.
Lazy Refill — No Background Job#
Instead of refilling on a timer, calculate the refill lazily — only when a request actually arrives.
Store two things per user in Redis:
current_tokens : how many tokens the user currently has
last_refill_time : Unix timestamp of the last time tokens were calculated
When a request arrives at timestamp T:
time_passed = T - last_refill_time
tokens_to_add = time_passed × refill_rate
new_tokens = min(current_tokens + tokens_to_add, capacity)
min(..., capacity) is critical — the bucket can never overflow beyond capacity. Even if a user goes idle for 10 hours:
capacity = 20, refill = 5/sec, idle for 10 hours = 36,000 seconds
tokens_to_add = 36,000 × 5 = 180,000
new_tokens = min(0 + 180,000, 20) = 20 ← capped at capacity
No background job. No scheduler. No distributed coordination. Just arithmetic at request time. A user who hasn't made a request in an hour gets exactly the same treatment as one who just started — their bucket refills to capacity and stops there.
The Full Algorithm Step by Step#
User abc hitting /search. Capacity = 20, refill rate = 5 tokens/sec.
Redis state before request:
Request arrives at T = 1745000145:
time_passed = 1745000145 - 1745000100 = 45 seconds
tokens_to_add = 45 × 5 = 225
new_tokens = min(3 + 225, 20) = 20 ← bucket full, user was idle
new_tokens > 0:
decrement → 20 - 1 = 19
update Redis:
current_tokens = 19
last_refill_time = 1745000145
→ ALLOW
If new_tokens == 0:
→ BLOCK (return retry_after = 1/refill_rate = 0.2 seconds)
Two Knobs — Capacity vs Refill Rate#
This is what makes Token Bucket expressive for real production systems.
Capacity = maximum burst the system tolerates
Refill rate = sustained throughput you're willing to allow
Examples:
/search endpoint (generous):
capacity = 100 (allow a burst of 100 for autocomplete UX)
refill rate = 10/sec (sustain 10 req/sec average)
/login endpoint (strict):
capacity = 5 (never allow more than 5 attempts at once)
refill rate = 0.08/sec (5 per minute sustained = 0.083/sec)
/payment endpoint (very strict):
capacity = 3
refill rate = 0.016/sec (1 per minute)
Window-based algorithms can only say "N requests per window." Token Bucket can say "allow a burst of N, then throttle to M per second." Far more expressive for real API behaviour.
Why AWS and Stripe Use This#
Real API clients are naturally bursty. A mobile app on startup makes 10 API calls in 2 seconds — profile fetch, feed load, notification count, settings sync. Then it goes quiet for minutes.
Window-based algorithms see the startup burst as an attack pattern and throttle it. Token Bucket sees it as a legitimate burst within capacity and allows it, then smoothly limits the sustained rate.
Stripe's API allows bursts for webhook retries — a payment failure triggers multiple rapid retries in quick succession. Token Bucket handles this gracefully. A sliding window would block the retries mid-sequence.
AWS API Gateway uses Token Bucket at the account level — burst limit (capacity) and rate limit (refill rate) are two separately configurable parameters, exactly matching the two knobs.
Atomicity#
Token Bucket requires four Redis operations per request:
1. GET current_tokens
2. GET last_refill_time
3. SET new_tokens (after calculation)
4. SET last_refill_time = T
These four operations must be atomic. Two rate limiter nodes serving the same user simultaneously would both read the same token count, both calculate the same new count, both decrement — letting through one extra request.
The fix is a Lua script — all four operations wrapped in a single atomic unit. Covered in full in the Distributed Rate Limiting deep dive.
Summary#
Storage : 2 values per user (current_tokens + last_refill_time)
Redis ops : GET + GET + SET + SET = 4 ops (Lua script for atomicity)
Accuracy : exact — no approximation
Burst control : capacity knob controls maximum burst size
Rate control : refill rate knob controls sustained average
Refill model : lazy — calculated at request time, no background job
Best for : APIs with legitimate bursty clients (mobile apps, webhooks)
used by AWS API Gateway, Stripe in production
Not for : cases where you want strictly zero burst tolerance
(use Leaky Bucket for that)