Leases & TTL

The core idea

A distributed lock in etcd is just a key. The problem — if the server holding the lock crashes, the key is never deleted and the lock is held forever. etcd solves this with leases — every lock has a TTL and auto-expires unless the holder keeps renewing it. Crash = no more renewals = lock auto-released.

The stuck lock problem#

Server 3 acquires a lock to run a background job:

Server 3 → writes /locks/job = "server-3"
Server 3 crashes mid-job
→ /locks/job still exists
→ no other server can ever acquire it
→ job never runs again

Without a TTL, the lock is held forever. The only fix is a human manually deleting the key — not acceptable in production.

Leases — TTL on keys#

When Server 3 acquires the lock, it attaches a TTL (time-to-live) — say 10 seconds. etcd automatically deletes the key after 10 seconds unless Server 3 renews it.

Server 3 keeps renewing the lease every few seconds while it is alive and working:

T=0s  → Server 3 acquires /locks/job with TTL=10s
T=3s  → Server 3 renews → TTL resets to 10s
T=6s  → Server 3 renews → TTL resets to 10s
T=9s  → Server 3 crashes ← no more renewals
T=19s → TTL expires → etcd deletes /locks/job automatically
T=19s → Server 7 acquires lock → picks up the job ✓

No human intervention. No stuck locks. The system heals itself.

What about a slow server — not crashed, just slow?#

If Server 3 is alive but its network is congested, its renewal message may arrive late. If the TTL expires before the renewal gets through — etcd deletes the key and another server acquires the lock.

Now both Server 3 and Server 7 think they hold the lock. This is the false expiry problem — and it is handled by fencing tokens, covered in the next file.

TTL is a trade-off

Too short a TTL → false expiries on slow networks, two servers think they hold the lock Too long a TTL → when a server genuinely crashes, the lock stays stuck for too long before another server can pick it up Typical production TTL: 10–30 seconds depending on how quickly you need failover