ISR
ISR is a dynamic list of replicas that are caught up with the leader within a configurable time threshold. With acks=all, the leader only waits for ISR members — not all replicas. This prevents one slow broker from tanking write latency for everyone.
The problem with acks=all#
acks=all says wait for ALL replicas to confirm before ACKing the producer. With replication factor = 3, that means waiting for 3 disk writes across 3 brokers on 3 different machines.
What if one of those brokers is slow — overloaded, garbage collecting, network hiccup? Every single write now waits for the slowest broker. Your p99 write latency is determined by your worst broker at any given moment.
At 100,000 events/sec, one slow broker makes the entire system feel sluggish.
What ISR is#
ISR (In-Sync Replicas) is a dynamic list maintained by the leader. It contains only the replicas that are currently caught up with the leader within a time threshold (default: 10 seconds).
graph TD
L[Broker 1 - Leader P0]
R1[Broker 2 - Follower]
R2[Broker 3 - Follower - SLOW]
L -->|replicating fine| R1
L -->|lagging behind| R2
ISR[ISR List]
ISR --> L
ISR --> R1
Note[Broker 3 removed from ISR<br/>catches up in background] Broker 2 is keeping up → in ISR ✓
Broker 3 is lagging → removed from ISR ✗ → catches up in background
With acks=all, the leader now waits only for ISR members — Broker 1 and Broker 2. Broker 3 is excluded until it catches up.
Before ISR removal:
Producer write → wait for Broker 1, 2, AND slow Broker 3 → high latency
After ISR removal:
Producer write → wait for Broker 1 and Broker 2 only → fast
Broker 3 → catches up quietly in the background
The safety trade-off#
Removing Broker 3 from ISR means if Broker 1 and Broker 2 both die simultaneously, the data that Broker 3 hasn't replicated yet is lost. You've reduced your failure tolerance by 1.
This is acceptable because: 1. Two brokers dying at the exact same time is rare 2. Broker 3 is still replicating — it's just slightly behind, not completely gone 3. The alternative (waiting for slow replicas) causes sustained high latency for every write
min.insync.replicas — the safety floor#
What if Broker 2 also falls behind? ISR shrinks to just the leader — Broker 1 alone. Now acks=all is effectively acks=1 — only the leader confirms. If Broker 1 crashes, data is lost.
Kafka has a config to prevent this: min.insync.replicas.
min.insync.replicas = 2
ISR = [Broker 1, Broker 2, Broker 3] → writes accepted ✓
ISR = [Broker 1, Broker 2] → writes accepted ✓ (meets minimum of 2)
ISR = [Broker 1 only] → writes REJECTED ✗
→ producer gets NotEnoughReplicasException
→ Kafka prefers unavailability over data loss
When ISR shrinks below the minimum, Kafka stops accepting writes entirely. It chooses consistency over availability — a deliberate CAP theorem decision. Better to reject writes and alert the team than to silently accept writes that might be lost.
The production config#
The standard production setup for critical data:
This means: - 3 copies of every partition (1 leader + 2 followers) - Producer waits for all ISR members (at least 2) to confirm - If ISR drops to 1, writes are rejected - Can survive 1 broker failure with zero data loss - Can survive 2 broker failures without losing the cluster (just reduced durability)
graph LR
P[Producer] -->|acks=all| L[Leader]
L -->|replicate| F1[Follower 1 - ISR]
L -->|replicate| F2[Follower 2 - ISR]
L -->|ACK after 2 confirms| P min.insync.replicas only has effect when acks=all. With acks=1, the leader ACKs after its own write regardless — min.insync.replicas is ignored.
A common misconfiguration: setting min.insync.replicas = 1 with acks=all. This means the leader alone is enough to satisfy acks=all — you get none of the durability benefits. Always set min.insync.replicas = 2 for production critical topics.
Interview framing: "I'd use replication factor 3, acks=all, and min.insync.replicas=2. ISR ensures that a slow replica doesn't block writes — Kafka dynamically excludes lagging replicas. The min.insync.replicas floor ensures we never accidentally drop below safe durability. If ISR shrinks to 1, Kafka rejects writes — we'd rather page an engineer than silently lose billing data."