Hinted Handoff
Hinted Handoff — The Neighbor Holding Your Package#
Think of it like home delivery. You ordered a package, but you're not home. The delivery driver gives it to your neighbor with a note: "This is for Node D at apartment 4B. Give it to them when they're back."
That's exactly how hinted handoff works.
How it works#
The coordinator knows Node D missed the write (no ack, or connection refused). Instead of giving up, the coordinator picks another healthy node — say Node E — and sends the write there with a hint attached.
sequenceDiagram
participant Coord as Coordinator
participant B as Node B
participant C as Node C
participant D as Node D
participant E as Node E
Coord->>B: write("user:123", value)
Coord->>C: write("user:123", value)
Coord->>D: write("user:123", value)
B->>Coord: ack ✓
C->>Coord: ack ✓
D--xCoord: no response (down)
Note over Coord: W=2 met, but Node D missed it
Coord->>E: write("user:123", value) + hint: "this belongs to Node D"
E->>Coord: ack ✓
Note over Coord: Hint stored — Node D's data is safe on Node E The hint contains: - The actual key-value data - The target node (Node D) — who this data really belongs to - A timestamp — when the write happened
Node E stores this data in a separate hints directory, not in its regular data store. Node E doesn't own this key on the ring. It's just holding it temporarily.
What happens when Node D comes back#
Node E detects that Node D is alive again (through gossip protocol or heartbeat). Node E then sends all hinted data to Node D:
sequenceDiagram
participant E as Node E (holding hints)
participant D as Node D (recovered)
Note over D: Node D comes back online
E->>E: detect Node D is alive (gossip)
E->>D: "I have 47 writes that were meant for you"
D->>D: store all 47 writes
D->>E: ack — all received
E->>E: delete hints for Node D Once Node D confirms it received everything, Node E deletes the hints. Node D is now caught up, and the replication factor is back to 3.
Why not just write to Node E as a permanent replica?#
Because Node E doesn't own this key on the ring. If we made it a permanent replica, the ring mapping would be wrong — future reads for "user:123" would go to B, C, D (based on the ring), and they'd never check Node E. The data would be orphaned.
Hinted handoff is a temporary holding pattern, not a change to the ring. Node E holds the data until the rightful owner (Node D) can take it.
Choosing the hint target#
Which node becomes the hint holder? The coordinator picks the next healthy node clockwise on the ring after the failed node. This is a simple, deterministic rule — no coordination needed.
Ring: ... → Node B → Node C → Node D (down) → Node E → ...
Node D is down → coordinator sends hint to Node E (next clockwise after D)
Limitations of hinted handoff#
Hinted handoff isn't a perfect solution. It has limits:
The hint holder can also die. If Node E dies before handing off to Node D, the hints are lost. Now you're back to 2 copies with no way to automatically recover the third. This is where anti-entropy (Merkle tree comparison) kicks in as a background safety net — covered in a later deep dive.
Hints can't be held forever. If Node D is down for days, Node E accumulates a massive backlog of hints. Systems typically set a hint TTL (e.g., Cassandra defaults to 3 hours). If Node D isn't back within 3 hours, the hints expire and are deleted. Again, anti-entropy handles the long-term recovery.
Hinted handoff doesn't count toward quorum. The hint on Node E doesn't satisfy W=2. The coordinator still needs 2 acks from actual replica nodes (B, C, D). The hint is extra insurance, not a quorum substitute. If only 1 of the 3 real replicas acks, the write fails even if a hint is stored.
Hinted handoff summary:
✓ Provides temporary durability when a replica is down
✓ Automatic — no human intervention
✓ Fast recovery when the node comes back
✗ Hint holder can also fail → hints lost
✗ Hints expire after TTL (typically 3 hours)
✗ Doesn't count toward quorum — can't replace a real replica
✗ Only helps with short-term outages, not permanent node loss
Hinted handoff handles short-term failures
It's the first line of defense — fast and automatic. For long-term failures (node gone for days) or cases where the hint holder also dies, the system needs anti-entropy repair using Merkle trees. That's a separate deep dive.
Interview framing
"Writes go to all N=3 nodes simultaneously. Coordinator waits for W=2 acks — response latency is the second-fastest node, not the slowest. If the third node is down, we use hinted handoff: the coordinator sends the data to another healthy node with a tag saying 'this belongs to Node D.' When Node D recovers, the hint holder forwards the data and deletes the hint. It's like a neighbor holding your package. Limitations: hints expire after a TTL, the hint holder can also die, and hints don't count toward quorum. For long-term recovery, we rely on anti-entropy with Merkle trees."