Crash Recovery
The Problem#
Stateful stream processing keeps state in local memory — fast, but volatile. If the node crashes, all state is lost.
You need a way to recover without reprocessing everything from the beginning of time.
Checkpointing#
The processor periodically snapshots its local state and its current Kafka offset to durable storage (S3, HDFS).
Every 30 seconds → Node 2 writes to S3:
{
state: { card-456: 3, card-789: 1 },
offset: { partition: 2, offset: 1045 }
}
Both pieces are saved together atomically. You need both to recover correctly.
Crash Recovery: 3 Steps#
Node 2 crashes
↓
New node spins up
↓
1. Load state snapshot from S3
→ card-456: 3, card-789: 1
2. Resume Kafka consumption from checkpointed offset
→ partition 2, offset 1045
3. Reprocess events from offset 1045 onwards
→ rebuilds any state changes that happened after the last checkpoint
The gap between the last checkpoint and the crash is replayed from Kafka. State is fully restored.
Visualized#
sequenceDiagram
participant N as Node 2
participant S as S3 (checkpoint store)
participant K as Kafka Partition 2
N->>S: checkpoint (state + offset 1045)
Note over N: processes events 1045→1090
N-->>N: CRASH at offset 1090
Note over N: new node starts
N->>S: load checkpoint
S-->>N: state + offset 1045
N->>K: resume from offset 1045
K-->>N: replay events 1045→1090
Note over N: state fully restored Why Kafka Log Retention Matters#
Crash recovery replays events from the checkpointed offset. This only works if those events still exist in Kafka.
If Kafka has already deleted events older than 1 day, but your checkpoint is 3 days old — recovery fails.
Rule: Kafka retention must be longer than the maximum time between a valid checkpoint and a crash. In practice, keep retention at least 7 days for any stateful job.
Key Properties#
| Property | Detail |
|---|---|
| Checkpoint stores | State snapshot + Kafka offset |
| Storage target | S3, HDFS, or RocksDB (local disk) |
| Checkpoint interval | Configurable — tradeoff between recovery time and overhead |
| Recovery guarantee | At-least-once (events since last checkpoint are replayed) |
| Exactly-once | Possible with two-phase commit (Flink supports this) |
One-Line Summary#
Checkpointing saves state + offset to durable storage so a crashed node can resume exactly where it left off, replaying only the gap from Kafka.