Senior 4 min · June 25, 2026

Design Google Maps: The System Behind Real-Time Global Navigation

System design of Google Maps: geocoding, routing, traffic, and real-time updates at planetary scale.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Google Maps uses a tile-based map rendering system, a graph of road segments for routing, and real-time data pipelines for traffic. The routing engine runs Dijkstra/A* on a precomputed contraction hierarchy to return routes in milliseconds.

✦ Definition~90s read
What is Design Google Maps?

Google Maps is a distributed system that ingests map data, geocodes addresses, computes routes, and serves real-time traffic overlays to billions of users. It combines graph databases, spatial indexing, and stream processing.

Think of Google Maps as a giant puzzle where every road is a puzzle piece.
Plain-English First

Think of Google Maps as a giant puzzle where every road is a puzzle piece. The pieces are stored in a massive grid (tiles). When you ask for directions, the system builds a custom path by connecting pieces, but it has already pre-solved the most common connections (contraction hierarchies) so it can give you an answer instantly. Traffic data comes from a separate system that watches how fast the pieces are moving.

Here's what everyone gets wrong: Google Maps doesn't compute your route from scratch. That would take seconds, not milliseconds. The secret is a precomputed hierarchy that shortcuts the entire graph. If you think it's just Dijkstra on OpenStreetMap data, you're missing the entire production playbook. The real challenge isn't the algorithm — it's the data. 20 petabytes of map data, updated in real-time from satellites, street view cars, and user reports. One stale road closure and you reroute an entire city into gridlock. By the end of this, you'll understand the architecture that serves 1 billion monthly active users with sub-100ms routing queries. You'll know how they handle traffic spikes during earthquakes, how they version map data without downtime, and why your GPS sometimes thinks you're in a river.

Why Tiles? The Map as a Distributed Cache

Before tiles, maps were monolithic images. Every zoom level required a new image. Google Maps popularized the tile grid: each tile is a 256x256 PNG at a specific zoom level. The key insight: tiles are immutable and cacheable. A tile for zoom level 12 at coordinates (1234,567) never changes unless the underlying map data changes. This allows aggressive CDN caching. The tile server just stitches vector data into PNGs on the fly, but the real magic is the tile pyramid: precompute tiles for popular zoom levels and serve from blob storage. For zoom levels 0-10, tiles are static. For 11-20, they're generated on demand and cached. The failure mode: a tile stampede when a new map version is deployed. All CDN caches invalidate simultaneously, and the tile server gets hammered. Mitigation: staggered cache invalidation and pre-warming.

TileServer.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — System Design tutorial

// Simplified tile server logic
class TileServer {
    // Cache: LRU with 10k entries per node
    private final Cache<TileKey, byte[]> tileCache = CacheBuilder.newBuilder()
        .maximumSize(10000)
        .expireAfterWrite(1, TimeUnit.HOURS)
        .build();

    public byte[] getTile(int zoom, int x, int y) {
        TileKey key = new TileKey(zoom, x, y);
        byte[] cached = tileCache.getIfPresent(key);
        if (cached != null) return cached;

        // Check blob storage (e.g., GCS) with CDN fallback
        byte[] tile = blobStorage.get(key);
        if (tile == null) {
            tile = renderTile(zoom, x, y); // expensive
            blobStorage.put(key, tile);
        }
        tileCache.put(key, tile);
        return tile;
    }
}
Output
Returns PNG bytes for tile. Cache hit ratio >99% for popular tiles.
Production Trap: Tile Stampede
When you deploy a new map version, all CDN caches invalidate. The tile server sees 100x traffic. Mitigate with staggered rollout and pre-warming: generate tiles for top 10% of requests before cutover.
Google Maps System Architecture Overview THECODEFORGE.IO Google Maps System Architecture Overview From geocoding to routing with real-time traffic and tile caching Tile-Based Map Cache Pre-rendered tiles as distributed cache Geocoding Service Address to lat/lng in milliseconds Contraction Hierarchy Routing Precomputed shortest path algorithm Real-Time Traffic Pipeline Stream processing for live data Map Data Versioning Zero-downtime updates via versioned layers ⚠ Traffic spikes (e.g., Super Bowl) can overwhelm routing Use auto-scaling and pre-warmed tile caches THECODEFORGE.IO
thecodeforge.io
Google Maps System Architecture Overview
Design Google Maps

Geocoding: From Address to Lat/Lng in Milliseconds

Geocoding is the reverse of what most people think. You don't parse the address and look up coordinates. You tokenize the address into a hierarchy: country, state, city, street, number. Each level is a key in a sorted string table (SSTable). The trick: use a trie for prefix matching. When someone types '1600 Amphitheatre Parkway', the system returns candidates in O(k) time where k is address length. The gotcha: ambiguous addresses. 'Main Street' exists in 1000 cities. The system uses geohash of the user's IP to bias results. The failure mode: a geocoding query that matches millions of results (e.g., '1' in Manhattan). The fix: limit results to 5 and require more specificity.

Geocoder.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — System Design tutorial

// Simplified geocoding with trie
class Geocoder {
    private final TrieNode root = new TrieNode();

    public List<GeoResult> geocode(String address) {
        // Tokenize and traverse trie
        String[] tokens = address.toLowerCase().split(",\\s*|\\s+");
        TrieNode node = root;
        for (String token : tokens) {
            node = node.children.get(token);
            if (node == null) return Collections.emptyList();
        }
        // Return top 5 results by population bias
        return node.results.stream()
            .sorted(Comparator.comparingInt(r -> -r.population))
            .limit(5)
            .collect(Collectors.toList());
    }
}
Output
List of up to 5 GeoResult objects with lat/lng and address.
Senior Shortcut: Geohash Bias
Use the user's IP geohash prefix to filter results. For example, if user is in geohash '9q8', only return results with geohash starting with '9q8'. This reduces ambiguity by 90%.

Routing: The Contraction Hierarchy Secret

Dijkstra on a graph with 500 million nodes is too slow. Google uses Contraction Hierarchies (CH). The idea: precompute 'shortcuts' between important nodes (highways, intersections). During query, you run bidirectional Dijkstra on the contracted graph. The result: sub-millisecond routing for long distances. The trade-off: precomputation takes hours and must be updated when roads change. The gotcha: CH doesn't handle real-time traffic well. Traffic changes edge weights, invalidating shortcuts. Solution: run CH on the static graph, then apply traffic as a post-processing penalty. This is why traffic-aware routes are slightly slower to compute.

Router.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — System Design tutorial

// Simplified CH-based router
class Router {
    private final Graph contractedGraph; // precomputed

    public Route computeRoute(Node from, Node to, TrafficModel traffic) {
        // Bidirectional Dijkstra on contracted graph
        PriorityQueue<Node> forwardQueue = new PriorityQueue<>();
        PriorityQueue<Node> backwardQueue = new PriorityQueue<>();
        // ... standard bidirectional Dijkstra
        // After path found, apply traffic penalties
        return applyTraffic(path, traffic);
    }

    private Route applyTraffic(Route path, TrafficModel traffic) {
        for (Edge edge : path.edges) {
            double penalty = traffic.getDelay(edge.id); // seconds
            edge.cost += penalty;
        }
        return path;
    }
}
Output
Route object with list of edges and total travel time including traffic.
Interview Gold: Why Not A*?
A with Euclidean heuristic works for small graphs but fails on road networks with obstacles (rivers, mountains). CH gives exact results and is faster for long distances. A is used for local routing within a city.
Contraction Hierarchy Query FlowTHECODEFORGE.IOContraction Hierarchy Query FlowBidirectional Dijkstra on a precomputed shortcut graphPrecompute ShortcutsAdd edges between important nodes (highways)Bidirectional DijkstraRun forward/backward on contracted graphMeet in MiddleCombine paths via shared shortcutsReturn RouteSub-millisecond response with full path⚠ Without CH, Dijkstra on 500M nodes takes secondsTHECODEFORGE.IO
thecodeforge.io
Contraction Hierarchy Query Flow
Design Google Maps

Real-Time Traffic: The Stream Processing Pipeline

Traffic data comes from three sources: 1) GPS pings from Android phones (anonymized), 2) road sensors, 3) incident reports. The pipeline ingests millions of events per second into Kafka. A stream processor (Apache Beam) aggregates events into 5-minute windows per road segment. The output is a 'speed' value for each segment. The challenge: handling sparse data. Rural roads may have zero pings. Solution: use historical baselines and interpolate from nearby segments. The failure mode: a traffic jam causes all phones to stop moving, reducing pings. The system interprets this as 'no traffic' and clears the jam. Fix: use a minimum ping threshold; if pings drop below threshold, fall back to historical data.

TrafficPipeline.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
// io.thecodeforge — System Design tutorial

// Simplified traffic aggregation
Pipeline p = Pipeline.create();
p.apply("ReadPings", KafkaIO.read().withTopic("gps-pings"))
 .apply("Window", Window.into(FixedWindows.of(Duration.standardMinutes(5))))
 .apply("Aggregate", Mean.<String>perKey()) // key = road segment ID
 .apply("WriteTraffic", KafkaIO.write().withTopic("traffic-updates"));
p.run();
Output
Kafka topic 'traffic-updates' contains key-value pairs: segmentId -> average speed (km/h).
Never Do This: Ignore Sparse Data
If a road segment has <10 pings in a window, don't output a speed. Use historical baseline instead. Otherwise, you'll get random noise that causes phantom traffic jams.

Map Data Versioning: How to Update Without Downtime

Map data changes daily: new roads, closed roads, changed speed limits. Google uses a versioned tile system. Each tile has a version number. When new data is published, tiles are regenerated with a new version. The client requests tiles with a version parameter. If the version is stale, the server returns a redirect to the new version. This allows gradual rollout: 1% of users see new map, then 10%, etc. The gotcha: version mismatch between tiles. A user might see a new tile for one area and an old tile for another, causing visual seams. Fix: version per region, not per tile. All tiles in a region share the same version.

VersionedTile.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — System Design tutorial

// Versioned tile request
// Client request: GET /tile/12/1234/567?version=20240301
// Server response:
// If version matches current: 200 OK with tile
// If version is old: 302 Redirect to /tile/12/1234/567?version=20240315
class TileController {
    @GetMapping("/tile/{z}/{x}/{y}")
    public ResponseEntity<byte[]> getTile(
            @PathVariable int z, @PathVariable int x, @PathVariable int y,
            @RequestParam(defaultValue = "0") int version) {
        int currentVersion = versionManager.getCurrentVersion(z, x, y);
        if (version != currentVersion) {
            return ResponseEntity.status(HttpStatus.FOUND)
                .location(URI.create("/tile/" + z + "/" + x + "/" + y + "?version=" + currentVersion))
                .build();
        }
        byte[] tile = tileService.getTile(z, x, y);
        return ResponseEntity.ok(tile);
    }
}
Output
HTTP 200 with tile bytes or HTTP 302 redirect to new version.
Senior Shortcut: Staged Rollout
Use a feature flag to control which users see the new map version. Start with 1% of users, monitor error rates, then ramp up. Rollback is instant by flipping the flag.
Tile Versioning vs. Monolithic UpdateTHECODEFORGE.IOTile Versioning vs. Monolithic UpdateZero-downtime map data deploymentVersioned TilesEach tile has a version numberClient requests with version paramStale tiles served from cacheNew tiles generated offlineMonolithic UpdateEntire map image regeneratedAll clients must re-downloadDowntime during regenerationCache invalidated globallyVersioned tiles enable gradual rollout and cache reuseTHECODEFORGE.IO
thecodeforge.io
Tile Versioning vs. Monolithic Update
Design Google Maps

Handling Traffic Spikes: The Super Bowl Problem

When a major event happens (earthquake, Super Bowl), traffic queries spike 100x. The routing service must not fall over. Solution: 1) Use a CDN for static assets (tiles). 2) For routing, use a queue with backpressure. If queue depth exceeds threshold, return cached routes for popular origins/destinations. 3) Degrade gracefully: disable traffic-aware routing during peak. The failure mode: the queue grows unbounded, causing OOM. Fix: bounded queue with rejection. Return HTTP 503 with Retry-After header. The client retries with exponential backoff.

RoutingQueue.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — System Design tutorial

// Bounded queue with backpressure
ExecutorService executor = new ThreadPoolExecutor(
    10, 100, 60, TimeUnit.SECONDS,
    new ArrayBlockingQueue<>(1000), // bounded queue
    new ThreadPoolExecutor.AbortPolicy() // reject if full
);

public CompletableFuture<Route> routeAsync(RouteRequest request) {
    return CompletableFuture.supplyAsync(() -> {
        if (executor.isShutdown()) throw new RejectedExecutionException();
        return router.computeRoute(request);
    }, executor);
}
Output
CompletableFuture that completes with Route or throws RejectedExecutionException.
Production Trap: Unbounded Queue
Never use an unbounded queue (e.g., LinkedBlockingQueue without capacity). Under load, it grows until OOM. Always use a bounded queue with a rejection policy.

When Not to Use This Architecture

If you're building a navigation app for a single city with <100k users, you don't need CH or distributed tile servers. A single PostGIS database with pgRouting and a tile server like TileServer GL will work fine. The Google Maps architecture is overkill for small scale. The cost of maintaining a CH precomputation pipeline and a Kafka-based traffic pipeline is justified only at planetary scale. For most startups, use Mapbox or OpenStreetMap with a managed service. Don't reinvent the wheel.

Senior Shortcut: Start Simple
For MVP, use Mapbox GL JS with a backend that queries OSRM for routing. That's a weekend project. Only build your own when you hit 10M daily active users.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Routing service containers in one region were OOM-killed every 30 minutes during peak hours.
Assumption
Assumed a memory leak in the routing algorithm. Spent days profiling heap dumps.
Root cause
The graph database client had a default connection pool of 100 connections per container. Each connection cached 50MB of precomputed route data. With 4GB heap, 80 connections consumed 4GB, leaving no room for actual routing. The OOM killer struck when a burst of requests triggered connection growth.
Fix
Reduced connection pool to 20 per container. Set connection idle timeout to 60 seconds. Added circuit breaker to fail fast when pool exhausted. Memory usage dropped to 1.5GB.
Key lesson
  • Always profile connection pool memory overhead before tuning heap size.
  • A 4GB container with 100 connections is a ticking bomb.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Routing returns 'No route found' for valid locations
Fix
1. Check graph database connectivity. 2. Verify that both origin and destination nodes exist in the graph. 3. Ensure contraction hierarchy is built for the region. 4. Check for disconnected components due to missing edges.
Symptom · 02
Tile server returning 404s for a region
Fix
1. Verify tile generation pipeline completed for that region. 2. Check blob storage for missing tiles. 3. Trigger manual tile regeneration for the affected zoom levels.
Symptom · 03
Traffic data showing 0 speed for all roads
Fix
1. Check Kafka consumer lag for traffic pipeline. 2. Verify GPS ping ingestion rate. 3. Ensure stream processor is not stuck on a bad event. 4. Restart the stream processor if lag > 10 minutes.
★ Google Maps System Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Routing 5xx errors
Immediate action
Check routing service CPU and memory
Commands
kubectl top pods -l app=routing
kubectl logs -l app=routing --tail=100
Fix now
kubectl scale deployment routing --replicas=10
Tile loading slow+
Immediate action
Check CDN cache hit ratio
Commands
curl -s -o /dev/null -w "%{http_code}" https://tiles.example.com/12/1234/567.png
kubectl exec -it tile-server-0 -- curl localhost:8080/metrics | grep tile_cache
Fix now
Purge CDN cache for the region
Traffic data stale+
Immediate action
Check Kafka consumer lag
Commands
kubectl exec kafka-client -- kafka-consumer-groups --bootstrap-server kafka:9092 --group traffic-group --describe
kubectl logs -l app=traffic-processor --tail=50
Fix now
kubectl delete pod -l app=traffic-processor (forces restart)
Geocoding returns empty results+
Immediate action
Check geocoding service health
Commands
curl http://geocoder:8080/health
kubectl logs -l app=geocoder --tail=20
Fix now
kubectl rollout restart deployment geocoder
FeatureGoogle MapsOpenStreetMap + OSRM
Routing speed<100ms for global routes~500ms for city-scale
Traffic dataReal-time from 1B+ devicesRequires third-party feed
Map data freshnessDaily updatesCommunity-driven, variable
CostPay per API callFree but self-hosted infrastructure

Key takeaways

1
Google Maps uses Contraction Hierarchies for sub-millisecond routing, not plain Dijkstra.
2
Tile pyramids with versioning enable zero-downtime map updates and aggressive CDN caching.
3
Real-time traffic requires a stream processing pipeline with fallback to historical data for sparse roads.
4
Always use bounded queues and backpressure to handle traffic spikes gracefully.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Google Maps handle routing at global scale without OOM?
Q02SENIOR
When would you choose A* over Contraction Hierarchies for routing?
Q03SENIOR
What happens when a traffic spike causes routing queue to overflow?
Q04JUNIOR
What is a tile pyramid and why is it used?
Q05SENIOR
How would you debug a sudden spike in routing latency?
Q06SENIOR
Design a system to serve real-time traffic updates to 1 billion users.
Q01 of 06SENIOR

How does Google Maps handle routing at global scale without OOM?

ANSWER
They use Contraction Hierarchies (CH) to precompute shortcuts. During query, bidirectional Dijkstra runs on the contracted graph, which has far fewer nodes. The precomputation is done offline and takes hours, but queries are sub-millisecond.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How does Google Maps calculate route time so fast?
02
What's the difference between Google Maps and OpenStreetMap routing?
03
How do I implement real-time traffic in my app?
04
What happens when Google Maps traffic data is wrong?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't

Previous
Design Zoom
33 / 40 · Real World
Next
Design Airbnb