Skip to content

Access Patterns — Why They Drive Everything#

The core idea#

Schema serves queries. Access patterns are those queries — the actual reads your system needs to support. Every schema decision flows from them:

  • Which columns to index
  • What order columns go in a composite index
  • Whether to normalize or denormalize
  • Whether to embed or reference (document stores)
  • What becomes the partition key (Cassandra)
  • What to cache in Redis

Design schema without knowing your access patterns and you'll end up with a schema that looks reasonable on paper but requires expensive full table scans on your hottest read paths.


How access patterns determine indexes#

Every access pattern maps directly to an index decision.

Pattern: "load all posts by user_123, newest first"

SELECT * FROM posts WHERE user_id = '123' ORDER BY created_at DESC LIMIT 20;
→ needs composite index on (user_id, created_at DESC) → leftmost column = filter, second column = sort order

Pattern: "load all comments on post_456, oldest first"

SELECT * FROM comments WHERE post_id = '456' ORDER BY created_at ASC LIMIT 20;
→ needs composite index on (post_id, created_at ASC)

Pattern: "has user_123 liked post_456?"

SELECT 1 FROM likes WHERE user_id = '123' AND post_id = '456';
→ composite PK on (user_id, post_id) handles this in O(1)

The rule: your WHERE clause tells you what to index. Your ORDER BY tells you the column order.


How access patterns determine denormalization#

Normalization removes duplication — every fact lives in one place. But at scale, joins on hot read paths become expensive.

Normalized approach:

-- to render a post in the feed, you need:
SELECT posts.*, users.username, users.profile_pic
FROM posts
JOIN users ON users.id = posts.user_id
WHERE posts.post_id = '456';

One join per post. For 20 posts in a feed — 20 joins. At 500 million users all loading feeds simultaneously — the users table becomes a bottleneck.

Denormalized approach:

posts (
    post_id      UUID,
    user_id      UUID,
    username     VARCHAR(30),   -- duplicated from users
    profile_pic  VARCHAR(500),  -- duplicated from users
    caption      TEXT,
    image_url    VARCHAR(500),
    created_at   TIMESTAMP
)

Now rendering a post needs zero joins. The trade-off: if a user changes their username, you need to update it in both users and all their posts. The write path gets more complex.

Denormalization rule

Denormalize only when a specific read query would require expensive joins at scale. Always state the trade-off: "I'm duplicating username into posts to avoid a join on the hot feed read path — the cost is maintaining consistency on writes."


How access patterns determine caching#

The most expensive access pattern for Instagram is the feed:

SELECT posts.*, users.username, users.profile_pic
FROM posts
JOIN follows ON follows.following_id = posts.user_id
JOIN users ON users.id = posts.user_id
WHERE follows.follower_id = current_user
ORDER BY posts.created_at DESC
LIMIT 20;

A user follows 500 people. Each has thousands of posts. Two joins. Sorted by timestamp. At 500 million users — this query cannot run against the DB on every feed load.

Access pattern tells you this is the hottest read in the system → cache it.

Redis sorted set per user:

KEY   → feed:user_123
VALUE → sorted set { post_id → timestamp, post_id → timestamp, ... }

Feed load becomes:

ZREVRANGE feed:user_123 0 19  → top 20 post IDs in O(log n)

Then fetch post details for those 20 IDs. No joins, no full table scan.


How access patterns determine partition keys (Cassandra)#

In Cassandra, the access pattern directly dictates your data model — one table per query pattern.

Access pattern: "last 50 messages in conversation X"

Partition key  → conversation_id   (route to the right node)
Clustering key → timestamp DESC    (sort within the partition)
SELECT * FROM messages
WHERE conversation_id = 'abc'
ORDER BY timestamp DESC
LIMIT 50;
 pure index scan on one node, no cross-partition query

If you had partitioned by user_id instead — fetching a conversation's messages would require hitting multiple nodes and merging results. Expensive and slow.

In Cassandra: partition key = "which node?", clustering key = "in what order?" Both are driven entirely by your access pattern.