Skip to content

Interview Cheatsheet

When to mention sharding#

Estimation says storage > few TB?          → shard
Write QPS > 100K sustained?                → shard
Single DB primary can't keep up?           → shard
Interviewer asks "how do you scale this?"  → shard

The move — replication first, sharding second#

Always say replication before sharding

"I'd first scale reads with read replicas. If the write bottleneck or storage limit becomes the constraint, I'd introduce sharding."

Sharding is operationally complex. Interviewers want to see that you reach for it only when necessary, not as a first instinct.


Decisions to state explicitly#

1. Shard key choice — always justify it

"I'd shard by user_id — it's high cardinality, immutable, evenly distributed, and present in every query."

Never just say "I'd shard by user_id." Say why it's a good shard key using the four rules.

2. Strategy

"I'd use consistent hashing so adding shards only remaps ~1/N of keys instead of causing a full remapping."

3. Cross-shard joins

"I'd co-locate a user's profile, posts, and follows under the same shard key so most queries stay on a single shard. Cross-shard access only happens for specific operations like viewing another user's profile, which I'd handle at the application layer."

4. Resharding plan

"I'd over-shard upfront — start with 256 virtual shards mapped to physical servers. Adding capacity means moving whole virtual shards, not migrating individual rows."


One-line definitions#

Sharding

Horizontally splitting rows across multiple servers so each holds a fraction of the data. Solves write bottleneck and storage limits that replication cannot.

Shard key

The column used to route every row to its shard. Must be high cardinality, immutable, evenly distributed, and present in every query.

Consistent hashing

Placing shards on a ring so adding/removing a shard only remaps ~1/N of keys. Avoids the ~80% remapping problem of naive modulo hashing.

Co-location

Designing the data model so related data (user + their posts + their follows) all share the same shard key and land on the same shard. Eliminates cross-shard joins.

Virtual nodes (vnodes)

Each physical shard owns multiple positions on the consistent hashing ring. Ensures even key distribution and spreads load when a shard is added or removed.


The hotspot trap#

A good shard key distributes rows evenly — not request volume

user_id is a perfect shard key. But if one user (a celebrity) gets 10 million reads per second, their shard is still a hotspot. Fix: cache the hot row in Redis. The shard only handles cache misses.


Quick comparison#

Strategy Good for Problem
Range-based Data locality, archive queries Sequential inserts hotspot latest shard
Hash-based Even distribution, general use No co-location control, naive % breaks on topology change
Directory-based Full placement control SPOF + extra network hop on every query
Consistent hashing Any distributed system with changing topology Slightly more complex than modulo