Access Patterns — Why They Drive Everything#
The core idea#
Schema serves queries. Access patterns are those queries — the actual reads your system needs to support. Every schema decision flows from them:
- Which columns to index
- What order columns go in a composite index
- Whether to normalize or denormalize
- Whether to embed or reference (document stores)
- What becomes the partition key (Cassandra)
- What to cache in Redis
Design schema without knowing your access patterns and you'll end up with a schema that looks reasonable on paper but requires expensive full table scans on your hottest read paths.
How access patterns determine indexes#
Every access pattern maps directly to an index decision.
Pattern: "load all posts by user_123, newest first"
→ needs composite index on(user_id, created_at DESC) → leftmost column = filter, second column = sort order Pattern: "load all comments on post_456, oldest first"
→ needs composite index on(post_id, created_at ASC) Pattern: "has user_123 liked post_456?"
→ composite PK on(user_id, post_id) handles this in O(1) The rule: your WHERE clause tells you what to index. Your ORDER BY tells you the column order.
How access patterns determine denormalization#
Normalization removes duplication — every fact lives in one place. But at scale, joins on hot read paths become expensive.
Normalized approach:
-- to render a post in the feed, you need:
SELECT posts.*, users.username, users.profile_pic
FROM posts
JOIN users ON users.id = posts.user_id
WHERE posts.post_id = '456';
One join per post. For 20 posts in a feed — 20 joins. At 500 million users all loading feeds simultaneously — the users table becomes a bottleneck.
Denormalized approach:
posts (
post_id UUID,
user_id UUID,
username VARCHAR(30), -- duplicated from users
profile_pic VARCHAR(500), -- duplicated from users
caption TEXT,
image_url VARCHAR(500),
created_at TIMESTAMP
)
Now rendering a post needs zero joins. The trade-off: if a user changes their username, you need to update it in both users and all their posts. The write path gets more complex.
Denormalization rule
Denormalize only when a specific read query would require expensive joins at scale. Always state the trade-off: "I'm duplicating username into posts to avoid a join on the hot feed read path — the cost is maintaining consistency on writes."
How access patterns determine caching#
The most expensive access pattern for Instagram is the feed:
SELECT posts.*, users.username, users.profile_pic
FROM posts
JOIN follows ON follows.following_id = posts.user_id
JOIN users ON users.id = posts.user_id
WHERE follows.follower_id = current_user
ORDER BY posts.created_at DESC
LIMIT 20;
A user follows 500 people. Each has thousands of posts. Two joins. Sorted by timestamp. At 500 million users — this query cannot run against the DB on every feed load.
Access pattern tells you this is the hottest read in the system → cache it.
Redis sorted set per user:
Feed load becomes:
Then fetch post details for those 20 IDs. No joins, no full table scan.
How access patterns determine partition keys (Cassandra)#
In Cassandra, the access pattern directly dictates your data model — one table per query pattern.
Access pattern: "last 50 messages in conversation X"
Partition key → conversation_id (route to the right node)
Clustering key → timestamp DESC (sort within the partition)
SELECT * FROM messages
WHERE conversation_id = 'abc'
ORDER BY timestamp DESC
LIMIT 50;
→ pure index scan on one node, no cross-partition query
If you had partitioned by user_id instead — fetching a conversation's messages would require hitting multiple nodes and merging results. Expensive and slow.
In Cassandra: partition key = "which node?", clustering key = "in what order?" Both are driven entirely by your access pattern.