Inverted Index
The Key Insight — Flip the Question#
A normal database index answers: "Does this document contain this word?"
An inverted index flips the question: "Which documents contain this word?"
This sounds like a small change. The performance difference is enormous.
With a normal approach, to answer "which products mention wireless?" you scan every product. With an inverted index, you look up "wireless" and instantly get the list of documents — like looking up a word in a book's index at the back, instead of reading the entire book.
Building the Index#
Say you have 3 product descriptions:
Doc 1: "wireless headphones bluetooth"
Doc 2: "noise cancelling headphones"
Doc 3: "wireless earbuds noise"
You build a lookup table mapping each word to the list of documents containing it:
wireless → [Doc 1, Doc 3]
headphones → [Doc 1, Doc 2]
bluetooth → [Doc 1]
noise → [Doc 2, Doc 3]
cancelling → [Doc 2]
earbuds → [Doc 3]
This is the inverted index. Each word is a key. The value is the list of document IDs that contain it.
Answering a Query#
User searches "wireless headphones". You look up both words:
Then take the intersection — documents that appear in both lists:
That's the entire lookup. No document scanning. No string comparison across 50 million rows. Just a hash map lookup followed by a set intersection.
Why this is fast
The lookup is O(1) — you go directly to the word in the index. The intersection is proportional to the size of the result lists, not the total document count. Even with 1 billion documents, the lookup time doesn't grow.
Compare this to SQL: O(n) full table scan every time, where n = total rows. The inverted index reduces this to effectively O(1) for the lookup phase.
The "Inverted" Part#
The name comes from the direction of the mapping:
It's "inverted" because you've flipped the natural direction. Instead of starting with a document and listing its words, you start with a word and list its documents.
What Gets Stored Per Entry#
In practice, each entry in the inverted index stores more than just document IDs. It also stores:
Positions matter for phrase queries — if a user searches "noise cancelling" as a phrase (not just both words anywhere), the index can verify that "noise" appears immediately before "cancelling" in the same document.
This is the core data structure that every search engine — Elasticsearch, Solr, Lucene, Google — is built on.