Redis Data Structures#
Redis is an in-memory data store. Everything lives in RAM. That's why it's fast — no disk, no file system, just memory.
flowchart LR
A["DB query<br/>~10ms<br/>disk I/O + query planning + network"] -->|"100x faster"| B["Redis GET<br/>~0.1ms<br/>RAM lookup + network hop"]
style A fill:#f8d7da,stroke:#dc3545,color:#000
style B fill:#d4edda,stroke:#28a745,color:#000 A natural question: isn't a file sitting on the same server's disk faster than a network hop to Redis?
flowchart LR
A["Same server disk<br/>1–10ms<br/>seek time + OS page cache + storage controller"]
B["Redis over network<br/>~0.1ms<br/>network + RAM lookup"]
A -->|"Redis still wins"| B
style A fill:#f8d7da,stroke:#dc3545,color:#000
style B fill:#d4edda,stroke:#28a745,color:#000 Disk I/O has to physically move through layers — storage controller, kernel buffers, OS scheduling. A network hop to a RAM store skips all of that.
The only thing faster than Redis is local in-process cache — that's why two-level caching exists:
flowchart LR
A["Local RAM<br/>nanoseconds<br/>no network at all"] -->|slower| B["Redis<br/>~0.1ms<br/>shared across all servers"] -->|slower| C["Same server disk<br/>1–10ms"] -->|slower| D["DB<br/>10–100ms<br/>disk + query planning"]
style A fill:#d4edda,stroke:#28a745,color:#000
style B fill:#c3e6cb,stroke:#28a745,color:#000
style C fill:#fff3cd,stroke:#ffc107,color:#000
style D fill:#f8d7da,stroke:#dc3545,color:#000 Redis is essentially shared RAM over the network. All your servers read and write to one place, so they all see consistent data. Local cache can't do that — if Server 1 updates a value, Server 2 still has the old one.
Redis isn't just a key-value store that holds strings. It has built-in data structures — each one solves a specific problem at scale that a plain string or a DB query can't handle efficiently.
String#
The simplest structure. Key maps to a single value.
Strings can also act as atomic counters:
One command, atomic. No race conditions. This is how Redis handles like counts, view counts, and rate limiting counters.
Use for: cached values, session tokens, counters.
Hash#
Key maps to a mini key-value map. Perfect for objects with multiple fields.
HSET user:123 name "John" age 28 city "NYC"
HGET user:123 name → "John"
HGETALL user:123 → { name: John, age: 28, city: NYC }
HSET user:123 city "LA" ← update one field, nothing else touched
Why not just serialize the whole object to JSON and store as a String?
At 10 million users, 1 million city updates per day:
flowchart LR
subgraph String["String approach — wasteful"]
S1["GET user:123<br/>fetch entire 1KB JSON"] --> S2["Deserialize + update city<br/>in app memory"]
S2 --> S3["SET user:123<br/>write entire 1KB back"]
S3 --> S4["1M updates × 2KB = 2GB<br/>wasted network traffic/day"]
style S4 fill:#f8d7da,stroke:#dc3545,color:#000
end
subgraph Hash["Hash approach — efficient"]
H1["HSET user:123 city 'LA'<br/>~10 bytes on the wire"]
H1 --> H2["1M updates × 10 bytes = 10MB/day<br/>200x less data ✓"]
style H2 fill:#d4edda,stroke:#28a745,color:#000
end Use for: user profiles, product details, shopping carts.
List#
An ordered list of strings. Push to front or back, read a range.
LPUSH notifications:user:123 "someone liked your post"
LPUSH notifications:user:123 "someone followed you"
LRANGE notifications:user:123 0 9 → last 10 notifications, newest first
Already ordered because you push newest items to the front. No sorting needed at read time.
Why not a DB table?
50 million users × 100 notification events/day = 5 billion rows.
flowchart LR
subgraph DB["DB approach"]
D1["SELECT * FROM notifications<br/>WHERE user_id = 123<br/>ORDER BY created_at DESC LIMIT 10"]
D1 --> D2["index lookup + disk read + sort<br/>~10ms per request"]
D2 --> D3["at 1M reads/sec → DB hammered ✗"]
style D3 fill:#f8d7da,stroke:#dc3545,color:#000
end
subgraph List["List approach"]
L1["LRANGE notifications:user:123 0 9"]
L1 --> L2["already ordered, already in RAM<br/>~0.1ms"]
L2 --> L3["100x faster, DB never touched ✓"]
style L3 fill:#d4edda,stroke:#28a745,color:#000
end Use for: notification feeds, activity logs, recent history, task queues.
Sorted Set#
Like a Set but every member has a score. Redis keeps members sorted by score automatically.
ZADD leaderboard 9500 "alice"
ZADD leaderboard 8200 "bob"
ZADD leaderboard 9800 "charlie"
ZRANGE leaderboard 0 2 WITHSCORES → charlie(9800), alice(9500), bob(8200)
Why not a List?
Alice scores 500 more points:
flowchart LR
subgraph L["List"]
L1["fetch entire list"] --> L2["find alice"] --> L3["update score"] --> L4["re-sort entire list<br/>O(n log n) ✗"]
style L4 fill:#f8d7da,stroke:#dc3545,color:#000
end
subgraph SS["Sorted Set"]
S1["ZADD leaderboard 10000 'alice'<br/>position updated automatically<br/>O(log n) ✓"]
style S1 fill:#d4edda,stroke:#28a745,color:#000
end Why not a DB?
50 million players, scores updating every few seconds:
flowchart LR
subgraph DB["DB"]
D1["UPDATE scores SET score = 10000 WHERE user = 'alice'"]
D1 --> D2["SELECT ORDER BY score DESC LIMIT 10"]
D2 --> D3["disk write + sort on every update<br/>collapses at scale ✗"]
style D3 fill:#f8d7da,stroke:#dc3545,color:#000
end
subgraph SortedSet["Sorted Set"]
S1["ZADD leaderboard 10000 'alice'"] --> S2["ZRANGE leaderboard 0 9"]
S2 --> S3["all in RAM, always sorted<br/>milliseconds ✓"]
style S3 fill:#d4edda,stroke:#28a745,color:#000
end The Z prefix has no deep meaning — "S" was already taken by Sets. The original author needed a letter. Historical accident, don't look for logic.
Use for: leaderboards, trending posts, rate limiting (score = timestamp).
Set#
An unordered collection of unique strings. Duplicates ignored automatically.
SADD post:123:likes "alice"
SADD post:123:likes "bob"
SADD post:123:likes "alice" ← duplicate, silently ignored
SCARD post:123:likes → 2
Why not a List?
Post has 1 million likes. Alice tries to like it again:
flowchart LR
subgraph L["List"]
L1["Scan 1 million entries<br/>to check if alice already exists<br/>O(n) ✗"]
style L1 fill:#f8d7da,stroke:#dc3545,color:#000
end
subgraph S["Set"]
S1["SADD handles uniqueness automatically<br/>O(1) ✓"]
style S1 fill:#d4edda,stroke:#28a745,color:#000
end The killer feature — set operations:
SINTER followers:alice followers:bob → mutual followers
SUNION followers:alice followers:bob → everyone either follows
SDIFF followers:alice followers:bob → alice follows but bob doesn't
Finding mutual friends via a DB join is expensive at scale. SINTER in Redis is a RAM operation — milliseconds.
Use for: likes, tags, mutual friends, unique visitors.
HyperLogLog#
Counts unique items approximately using fixed memory regardless of input size.
Why not just store every viewer in a Set?
flowchart LR
A["1 video, 5M viewers<br/>5M × 50 bytes = 250MB"] --> B["YouTube has 800M videos<br/>1M popular videos × 100K viewers × 50 bytes"]
B --> C["= 5 petabytes<br/>just for view counts ✗"]
style C fill:#f8d7da,stroke:#dc3545,color:#000 How HyperLogLog works — the leading zeros trick:
Every user ID gets hashed into a binary number. Hash functions produce random-looking output, so leading zeros follow a probability pattern:
hash("user:1") → 00110101... (2 leading zeros)
hash("user:2") → 00001101... (4 leading zeros)
hash("user:3") → 10110010... (0 leading zeros)
hash("user:99999") → 00000001... (7 leading zeros)
Leading zeros get rarer the longer they are:
flowchart LR
A["1 leading zero<br/>1 in 2 hashes<br/>common"] --> B["5 leading zeros<br/>1 in 32 hashes<br/>rare"] --> C["10 leading zeros<br/>1 in 1,024 hashes<br/>very rare"] --> D["20 leading zeros<br/>1 in 1M hashes<br/>extremely rare"]
style A fill:#d4edda,stroke:#28a745,color:#000
style B fill:#c3e6cb,stroke:#28a745,color:#000
style C fill:#fff3cd,stroke:#ffc107,color:#000
style D fill:#f8d7da,stroke:#dc3545,color:#000 HyperLogLog only remembers one number — the maximum leading zeros seen so far:
Processed 1,000,000 user IDs
→ rarest hash seen had 20 leading zeros
→ 20 leading zeros happens ~1 in 1,000,000 times
→ estimate: ~1,000,000 unique users ✓
Memory used: just the number "20" — a few bytes
Why 12KB and not just a few bytes?
Tracking one global max is too noisy:
flowchart LR
A["Only 100 people watched<br/>but one hash had 20 leading zeros by luck"] --> B["Estimate: 1,000,000 viewers<br/>Actual: 100 viewers ✗"]
style B fill:#f8d7da,stroke:#dc3545,color:#000 The fix: split users into 16,384 buckets, track max per bucket, then average:
Bucket 1 → max = 18
Bucket 2 → max = 19
Bucket 3 → max = 3 ← lucky outlier, averaged out
Bucket 4 → max = 20
...
Bucket 16,384 → max = 17
Average across all buckets → stable, accurate estimate
One lucky outlier barely moves the average
16,384 buckets × 6 bits per bucket (leading zeros on 64-bit hash fits in 6 bits)
= 98,304 bits = 12KB — fixed forever, whether 1 user or 1 billion
Why Redis and not just write to a DB?
flowchart LR
subgraph DB["DB"]
D1["Every page load fires INSERT"] --> D2["Millions of writes/sec"] --> D3["DB collapses ✗"]
style D3 fill:#f8d7da,stroke:#dc3545,color:#000
end
subgraph Redis["Redis"]
R1["Every page load fires PFADD"] --> R2["RAM write, millions/sec easy"] --> R3["Read count once at end of day ✓"]
style R3 fill:#d4edda,stroke:#28a745,color:#000
end PFADD visitors:2026-04-04 "user:1" ... "user:5000000"
PFCOUNT visitors:2026-04-04 → ~5,000,000
Memory: always 12KB — whether 1 user or 1 billion
Error: ~0.81% — 5,000,000 might come back as 5,040,500. Acceptable for view counts.
HyperLogLog tells you HOW MANY unique items — not WHICH ones. You get the count, never the list.
More buckets = more accuracy = more memory. Redis chose 16,384 as the sweet spot — 0.81% error at just 12KB.
Use for: unique visitors, distinct search queries, video view counts — anything where approximate is fine and memory matters.
Bitmap#
The problem: track which users logged in today. Three possible solutions.
Solution 1 — Redis Set
SADD active:2026-04-04 "user:123"
SADD active:2026-04-04 "user:456"
Memory: 5M users × 50 bytes = 250MB/day
Solution 2 — User ID as key, 1/0 as value
SET active:2026-04-04:123 1
SET active:2026-04-04:456 1
SET active:2026-04-04:789 0
Memory: 5M users × ~60 bytes = 300MB/day ← worse than Set
Each key carries Redis overhead — key string + value + metadata per entry.
Solution 3 — Bitmap
One key, one array. The array is indexed by user ID — just like a boolean array in DSA:
boolean[] active = new boolean[100_000_000];
active[123] = true; // user 123 logged in
active[456] = true; // user 456 logged in
active[789] = false; // user 789 didn't log in
Redis Bitmap is exactly this — but instead of boolean (1 byte each in Java), it uses actual bits (8x smaller):
flowchart LR
K["active:2026-04-04"] --> A["pos 0<br/>0"]
K --> B["pos 1<br/>0"]
K --> C["pos 2<br/>0"]
K --> D["..."]
K --> E["pos 123<br/>1 ✓"]
K --> F["..."]
K --> G["pos 456<br/>1 ✓"]
K --> H["..."]
K --> I["pos N<br/>0"]
style E fill:#d4edda,stroke:#28a745,color:#000
style G fill:#d4edda,stroke:#28a745,color:#000
style A fill:#f8f9fa,stroke:#adb5bd,color:#000
style B fill:#f8f9fa,stroke:#adb5bd,color:#000
style C fill:#f8f9fa,stroke:#adb5bd,color:#000
style I fill:#f8f9fa,stroke:#adb5bd,color:#000 SETBIT active:2026-04-04 123 1 ← user 123 logged in
SETBIT active:2026-04-04 456 1 ← user 456 logged in
GETBIT active:2026-04-04 123 → 1 ← did user 123 log in? yes
GETBIT active:2026-04-04 789 → 0 ← did user 789 log in? no
BITCOUNT active:2026-04-04 → 2 ← total active users today
How big is the array?
The array size is determined by the highest user ID, not the number of logged-in users:
Once sized, adding 1 active user or 80M active users costs zero extra memory:
All three compared:
flowchart LR
subgraph Comparison["Memory — 5M users, highest ID = 100M"]
A["Set<br/>250MB<br/>scales with active users today"]
B["Key/Value<br/>300MB<br/>scales with active users today"]
C["Bitmap<br/>12.5MB fixed<br/>scales only with highest user ID ✓"]
end
style A fill:#f8d7da,stroke:#dc3545,color:#000
style B fill:#f8d7da,stroke:#dc3545,color:#000
style C fill:#d4edda,stroke:#28a745,color:#000 The position IS the user ID. You never store the ID itself — user 123's answer simply lives at position 123 in the array. That's where all the savings come from.
Whenever you see a DSA problem using a boolean array indexed by ID — that's a bitmap in disguise.
The one constraint: user IDs must be integers. You can't do SETBIT active "alice" — you need SETBIT active 123. UUID-based systems need a mapping layer first.
Use for: daily active users, feature flags per user (is user 123 in the beta?), any yes/no fact per user.
HyperLogLog vs Bitmap#
Two different questions — don't mix them up.
flowchart LR
subgraph HLL["HyperLogLog"]
H1["How many unique users visited today?<br/>→ gives you a count<br/>cannot ask 'did user 123 visit?'"]
end
subgraph BM["Bitmap"]
B1["Did user 123 visit today?<br/>→ yes/no per specific user<br/>+ total count via BITCOUNT"]
end
style HLL fill:#fff3cd,stroke:#ffc107,color:#000
style BM fill:#d4edda,stroke:#28a745,color:#000 Summary#
| Data Structure | Best for | Key commands |
|---|---|---|
| String | Single value, atomic counters | SET / GET / INCR |
| Hash | Object fields, update one at a time | HSET / HGET / HGETALL |
| List | Ordered, push/pop, feeds, queues | LPUSH / LRANGE |
| Sorted Set | Scored + auto-sorted, leaderboards | ZADD / ZRANGE |
| Set | Unique members, set operations | SADD / SCARD / SINTER |
| HyperLogLog | Approximate unique count, fixed 12KB | PFADD / PFCOUNT |
| Bitmap | Yes/no per user, minimal memory | SETBIT / GETBIT / BITCOUNT |