Geohash — Encoding 2D Location as a String#
The core idea#
What if we divide the entire world into a grid, give each cell a name, and store that name as an indexed column? Nearby locations would share the same cell name — or at least a common prefix — and finding nearby drivers becomes a simple prefix query on a B+ Tree index.
That's exactly what Geohash does.
How Geohash works — recursive splitting#
Start with a box representing the entire Earth. Split it in half vertically. Left half = "a", right half = "b":
Mumbai is on the right side → "b"
Split "b" in half horizontally. Top = "ba", bottom = "bb":
Mumbai is in the top half → "ba"
Split "ba" in half vertically again:
Mumbai is on the right → "bab"
Keep splitting. Every split makes the cell smaller and more precise. After 10-12 splits, a cell covers roughly a city block. Mumbai ends up with a code like "bab123" — a short string that encodes its precise location on Earth.
Why prefix matching works#
Two drivers both in Mumbai will both start with "bab...". A driver in Delhi will start with something completely different — "cde...". A driver in New York will start with "xyz...".
The longer the shared prefix, the closer the locations:
"bab" → somewhere in Mumbai region (large area)
"bab1" → narrower area within Mumbai
"bab12" → neighbourhood level
"bab123" → city block level
So instead of computing distance against all 1 million drivers, you:
- Compute the user's Geohash →
"bab12" - Run a prefix query:
This hits a normal B+ Tree index on the geohash column. O(log n). Only drivers in the same cell are returned — maybe 50-100 instead of 1 million.
The boundary edge case#
There's one problem. Two drivers can be physically 10 metres apart but sit on opposite sides of a cell boundary — giving them completely different Geohash prefixes:
d1 and d2 are right next to each other. But a prefix query on "ba" misses d2 entirely.
The fix — always query 8 neighbouring cells#
You don't just query your own cell. You query your cell + all 8 surrounding cells:
+-----+-----+-----+
| ba1 | ba2 | bb1 |
+-----+-----+-----+
| ba3 | YOU | bb2 |
+-----+-----+-----+
| ba4 | ba5 | bb3 |
+-----+-----+-----+
Geohash libraries expose a neighbors() function that returns the 8 surrounding cell codes in O(1):
import geohash
center = geohash.encode(19.08, 72.86, precision=5)
# → "bab12"
neighbours = geohash.neighbors("bab12")
# → ["bab13", "bab11", "bab09", "bab15", "bab0f", "bab0g", "bab18", "bab19"]
You don't need to know how the math works — the library handles it. You just query all 9 cells:
SELECT * FROM drivers
WHERE geohash IN ('bab12', 'bab13', 'bab11', 'bab09', 'bab15', 'bab0f', 'bab0g', 'bab18', 'bab19');
9 index lookups on the B+ Tree. Fast.
The full Uber query flow#
User opens app at (19.08, 72.86)
↓
Compute user's Geohash → "bab12"
↓
Get 8 neighbouring cell codes via geohash.neighbors()
↓
Query: WHERE geohash IN (9 cells) → hits B+ Tree index → returns ~50-100 drivers
↓
For those 50-100 drivers, compute exact distance
↓
Filter to drivers within 2km
↓
Return results to user
The expensive distance calculation only runs on ~50-100 candidates, not 1 million. That's the win.
Store both raw lat/lng and Geohash
Store raw lat and lng for display and exact distance calculation. Store the Geohash for proximity indexing. Both are needed.