Design Google Maps: The System Behind Real-Time Global Navigation
System design of Google Maps: geocoding, routing, traffic, and real-time updates at planetary scale.
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
Google Maps uses a tile-based map rendering system, a graph of road segments for routing, and real-time data pipelines for traffic. The routing engine runs Dijkstra/A* on a precomputed contraction hierarchy to return routes in milliseconds.
Think of Google Maps as a giant puzzle where every road is a puzzle piece. The pieces are stored in a massive grid (tiles). When you ask for directions, the system builds a custom path by connecting pieces, but it has already pre-solved the most common connections (contraction hierarchies) so it can give you an answer instantly. Traffic data comes from a separate system that watches how fast the pieces are moving.
Here's what everyone gets wrong: Google Maps doesn't compute your route from scratch. That would take seconds, not milliseconds. The secret is a precomputed hierarchy that shortcuts the entire graph. If you think it's just Dijkstra on OpenStreetMap data, you're missing the entire production playbook. The real challenge isn't the algorithm — it's the data. 20 petabytes of map data, updated in real-time from satellites, street view cars, and user reports. One stale road closure and you reroute an entire city into gridlock. By the end of this, you'll understand the architecture that serves 1 billion monthly active users with sub-100ms routing queries. You'll know how they handle traffic spikes during earthquakes, how they version map data without downtime, and why your GPS sometimes thinks you're in a river.
Why Tiles? The Map as a Distributed Cache
Before tiles, maps were monolithic images. Every zoom level required a new image. Google Maps popularized the tile grid: each tile is a 256x256 PNG at a specific zoom level. The key insight: tiles are immutable and cacheable. A tile for zoom level 12 at coordinates (1234,567) never changes unless the underlying map data changes. This allows aggressive CDN caching. The tile server just stitches vector data into PNGs on the fly, but the real magic is the tile pyramid: precompute tiles for popular zoom levels and serve from blob storage. For zoom levels 0-10, tiles are static. For 11-20, they're generated on demand and cached. The failure mode: a tile stampede when a new map version is deployed. All CDN caches invalidate simultaneously, and the tile server gets hammered. Mitigation: staggered cache invalidation and pre-warming.
Geocoding: From Address to Lat/Lng in Milliseconds
Geocoding is the reverse of what most people think. You don't parse the address and look up coordinates. You tokenize the address into a hierarchy: country, state, city, street, number. Each level is a key in a sorted string table (SSTable). The trick: use a trie for prefix matching. When someone types '1600 Amphitheatre Parkway', the system returns candidates in O(k) time where k is address length. The gotcha: ambiguous addresses. 'Main Street' exists in 1000 cities. The system uses geohash of the user's IP to bias results. The failure mode: a geocoding query that matches millions of results (e.g., '1' in Manhattan). The fix: limit results to 5 and require more specificity.
Routing: The Contraction Hierarchy Secret
Dijkstra on a graph with 500 million nodes is too slow. Google uses Contraction Hierarchies (CH). The idea: precompute 'shortcuts' between important nodes (highways, intersections). During query, you run bidirectional Dijkstra on the contracted graph. The result: sub-millisecond routing for long distances. The trade-off: precomputation takes hours and must be updated when roads change. The gotcha: CH doesn't handle real-time traffic well. Traffic changes edge weights, invalidating shortcuts. Solution: run CH on the static graph, then apply traffic as a post-processing penalty. This is why traffic-aware routes are slightly slower to compute.
Real-Time Traffic: The Stream Processing Pipeline
Traffic data comes from three sources: 1) GPS pings from Android phones (anonymized), 2) road sensors, 3) incident reports. The pipeline ingests millions of events per second into Kafka. A stream processor (Apache Beam) aggregates events into 5-minute windows per road segment. The output is a 'speed' value for each segment. The challenge: handling sparse data. Rural roads may have zero pings. Solution: use historical baselines and interpolate from nearby segments. The failure mode: a traffic jam causes all phones to stop moving, reducing pings. The system interprets this as 'no traffic' and clears the jam. Fix: use a minimum ping threshold; if pings drop below threshold, fall back to historical data.
Map Data Versioning: How to Update Without Downtime
Map data changes daily: new roads, closed roads, changed speed limits. Google uses a versioned tile system. Each tile has a version number. When new data is published, tiles are regenerated with a new version. The client requests tiles with a version parameter. If the version is stale, the server returns a redirect to the new version. This allows gradual rollout: 1% of users see new map, then 10%, etc. The gotcha: version mismatch between tiles. A user might see a new tile for one area and an old tile for another, causing visual seams. Fix: version per region, not per tile. All tiles in a region share the same version.
Handling Traffic Spikes: The Super Bowl Problem
When a major event happens (earthquake, Super Bowl), traffic queries spike 100x. The routing service must not fall over. Solution: 1) Use a CDN for static assets (tiles). 2) For routing, use a queue with backpressure. If queue depth exceeds threshold, return cached routes for popular origins/destinations. 3) Degrade gracefully: disable traffic-aware routing during peak. The failure mode: the queue grows unbounded, causing OOM. Fix: bounded queue with rejection. Return HTTP 503 with Retry-After header. The client retries with exponential backoff.
When Not to Use This Architecture
If you're building a navigation app for a single city with <100k users, you don't need CH or distributed tile servers. A single PostGIS database with pgRouting and a tile server like TileServer GL will work fine. The Google Maps architecture is overkill for small scale. The cost of maintaining a CH precomputation pipeline and a Kafka-based traffic pipeline is justified only at planetary scale. For most startups, use Mapbox or OpenStreetMap with a managed service. Don't reinvent the wheel.
The 4GB Container That Kept Dying
- Always profile connection pool memory overhead before tuning heap size.
- A 4GB container with 100 connections is a ticking bomb.
kubectl top pods -l app=routingkubectl logs -l app=routing --tail=100Key takeaways
Interview Questions on This Topic
How does Google Maps handle routing at global scale without OOM?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
That's Real World. Mark it forged?
4 min read · try the examples if you haven't