Throughput#
How many requests can your system handle per second?
That's throughput. Unlike latency which measures one request, throughput measures the system's total capacity.
Latency vs Throughput — the distinction#
You already know latency — how long one request takes, end to end, from the moment the client sends the request to the moment the response arrives back.
Throughput is also measured end to end — a request is only counted as "served" when the full response is back at the client.
The difference is what they're counting: - Latency — how long did that round trip take for one request - Throughput — how many complete round trips happened per second
The units#
| Term | Used for |
|---|---|
| RPS (Requests Per Second) | General APIs and services |
| QPS (Queries Per Second) | Databases specifically |
| bps / Mbps / Gbps (bits per second) | Data transfer — video streaming, file uploads |
RPS and QPS mean the same thing
Just different terms depending on context. In interviews you'll hear both — they're interchangeable.
Start simple — one request at a time#
At its most basic, a server processes one request at a time.
Request comes in → server works on it → sends response back → only then picks up the next one.
If one request takes 100ms:
That's it. 10 RPS. Any extra requests arriving during that time just sit in a queue and wait.
Now introduce threads#
In reality, servers don't work like this. They use threads.
A thread is like a worker inside the server. Each thread handles one request independently. Multiple threads run in parallel — so multiple requests get processed at the same time.
Add 10 threads to that same server:
Same server. Same 100ms latency per request. Just 10 workers running in parallel instead of 1.
The formula
Step 1 — how many requests can one thread handle per second? 1000ms / latency per request = requests per second per thread
Step 2 — multiply by number of threads: Throughput = Threads × (1000ms / Latency per request)
| Threads | Latency | Requests/sec per thread | Throughput |
|---|---|---|---|
| 1 | 100ms | 1000/100 = 10 | 1 × 10 = 10 RPS |
| 10 | 100ms | 1000/100 = 10 | 10 × 10 = 100 RPS |
| 100 | 100ms | 1000/100 = 10 | 100 × 10 = 1,000 RPS |
What actually happens under real load#
Every web server runs with a thread pool — a fixed number of threads ready to handle requests simultaneously.
Here's what happens as traffic grows:
- Requests arrive → free threads pick them up → processed in parallel ✅
- More requests arrive than free threads → they sit in a queue and wait ⏳
- Queue fills up completely → server starts rejecting requests ❌
That rejection is the 503 Service Unavailable error. The server isn't broken — every thread is busy and the queue is full. It has no choice but to turn requests away.
This is exactly what happens when a big sale hits Amazon or tickets go on sale for a concert — servers get overwhelmed not because they're broken but because all threads are occupied.
Threads are not free
Each thread consumes RAM. You can't just set threads to 10,000 and call it done. There's a hard limit based on how much memory the server has. This is a real constraint engineers tune in production.
How to increase throughput#
Two levers:
- Reduce latency — if each request is faster, each thread handles more requests per second
- Add more threads / more servers — more parallel workers = more concurrent requests handled
Real world numbers#
| System | Throughput |
|---|---|
| A basic single-thread server | ~10–50 RPS |
| A typical production app server (50–200 threads) | ~500–5,000 RPS |
| Nginx (reverse proxy, event-driven) | ~50,000+ RPS on a single machine |
| Google Search | ~100,000+ RPS globally |
| WhatsApp at peak | ~500,000+ messages per second |
| Kafka (message queue) | Millions of messages per second per broker |
Why Nginx handles so much more than a typical app server
Nginx doesn't use one thread per request. It uses an event-driven model — a small number of threads handle thousands of connections by never blocking and waiting. This is an advanced topic we'll cover later.