Senior 4 min · June 25, 2026

Design Google Maps: The System Behind Real-Time Global Navigation

Q: How does Google Maps calculate route time so fast?

It uses Contraction Hierarchies, a precomputed graph that shortcuts long distances. The query runs bidirectional Dijkstra on this contracted graph, which is orders of magnitude smaller than the full road network.

Q: What's the difference between Google Maps and OpenStreetMap routing?

Google Maps uses proprietary traffic data from 1B+ devices and a global Contraction Hierarchy. OpenStreetMap relies on community data and OSRM, which is slower and lacks real-time traffic.

Q: How do I implement real-time traffic in my app?

Ingest GPS pings into Kafka, aggregate by road segment in 5-minute windows using Apache Beam, and store results in a key-value store. Serve via a CDN with short TTL. Use historical baselines for roads with sparse data.

Q: What happens when Google Maps traffic data is wrong?

The system uses multiple sources (GPS, sensors, reports) and cross-validates. If a source is anomalous, it's downweighted. For example, if GPS pings drop suddenly, the system falls back to historical data to avoid clearing a real traffic jam.

System design of Google Maps: geocoding, routing, traffic, and real-time updates at planetary scale.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Google Maps uses a tile-based map rendering system, a graph of road segments for routing, and real-time data pipelines for traffic. The routing engine runs Dijkstra/A* on a precomputed contraction hierarchy to return routes in milliseconds.

✦ Definition~90s read

What is Design Google Maps?

Google Maps is a distributed system that ingests map data, geocodes addresses, computes routes, and serves real-time traffic overlays to billions of users. It combines graph databases, spatial indexing, and stream processing.

★

Think of Google Maps as a giant puzzle where every road is a puzzle piece.

Plain-English First

Think of Google Maps as a giant puzzle where every road is a puzzle piece. The pieces are stored in a massive grid (tiles). When you ask for directions, the system builds a custom path by connecting pieces, but it has already pre-solved the most common connections (contraction hierarchies) so it can give you an answer instantly. Traffic data comes from a separate system that watches how fast the pieces are moving.

Here's what everyone gets wrong: Google Maps doesn't compute your route from scratch. That would take seconds, not milliseconds. The secret is a precomputed hierarchy that shortcuts the entire graph. If you think it's just Dijkstra on OpenStreetMap data, you're missing the entire production playbook. The real challenge isn't the algorithm — it's the data. 20 petabytes of map data, updated in real-time from satellites, street view cars, and user reports. One stale road closure and you reroute an entire city into gridlock. By the end of this, you'll understand the architecture that serves 1 billion monthly active users with sub-100ms routing queries. You'll know how they handle traffic spikes during earthquakes, how they version map data without downtime, and why your GPS sometimes thinks you're in a river.

Why Tiles? The Map as a Distributed Cache

Before tiles, maps were monolithic images. Every zoom level required a new image. Google Maps popularized the tile grid: each tile is a 256x256 PNG at a specific zoom level. The key insight: tiles are immutable and cacheable. A tile for zoom level 12 at coordinates (1234,567) never changes unless the underlying map data changes. This allows aggressive CDN caching. The tile server just stitches vector data into PNGs on the fly, but the real magic is the tile pyramid: precompute tiles for popular zoom levels and serve from blob storage. For zoom levels 0-10, tiles are static. For 11-20, they're generated on demand and cached. The failure mode: a tile stampede when a new map version is deployed. All CDN caches invalidate simultaneously, and the tile server gets hammered. Mitigation: staggered cache invalidation and pre-warming.

TileServer.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified tile server logic
class TileServer {
    // Cache: LRU with 10k entries per node
    private final Cache<TileKey, byte[]> tileCache = CacheBuilder.newBuilder()
        .maximumSize(10000)
        .expireAfterWrite(1, TimeUnit.HOURS)
        .build();

    public byte[] getTile(int zoom, int x, int y) {
        TileKey key = new TileKey(zoom, x, y);
        byte[] cached = tileCache.getIfPresent(key);
        if (cached != null) return cached;

        // Check blob storage (e.g., GCS) with CDN fallback
        byte[] tile = blobStorage.get(key);
        if (tile == null) {
            tile = renderTile(zoom, x, y); // expensive
            blobStorage.put(key, tile);
        }
        tileCache.put(key, tile);
        return tile;
    }
}

Output

Returns PNG bytes for tile. Cache hit ratio >99% for popular tiles.

Production Trap: Tile Stampede

When you deploy a new map version, all CDN caches invalidate. The tile server sees 100x traffic. Mitigate with staggered rollout and pre-warming: generate tiles for top 10% of requests before cutover.

thecodeforge.io

Google Maps System Architecture Overview

Design Google Maps

Geocoding: From Address to Lat/Lng in Milliseconds

Geocoding is the reverse of what most people think. You don't parse the address and look up coordinates. You tokenize the address into a hierarchy: country, state, city, street, number. Each level is a key in a sorted string table (SSTable). The trick: use a trie for prefix matching. When someone types '1600 Amphitheatre Parkway', the system returns candidates in O(k) time where k is address length. The gotcha: ambiguous addresses. 'Main Street' exists in 1000 cities. The system uses geohash of the user's IP to bias results. The failure mode: a geocoding query that matches millions of results (e.g., '1' in Manhattan). The fix: limit results to 5 and require more specificity.

Geocoder.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified geocoding with trie
class Geocoder {
    private final TrieNode root = new TrieNode();

    public List<GeoResult> geocode(String address) {
        // Tokenize and traverse trie
        String[] tokens = address.toLowerCase().split(",\\s*|\\s+");
        TrieNode node = root;
        for (String token : tokens) {
            node = node.children.get(token);
            if (node == null) return Collections.emptyList();
        }
        // Return top 5 results by population bias
        return node.results.stream()
            .sorted(Comparator.comparingInt(r -> -r.population))
            .limit(5)
            .collect(Collectors.toList());
    }
}

Output

List of up to 5 GeoResult objects with lat/lng and address.

Senior Shortcut: Geohash Bias

Use the user's IP geohash prefix to filter results. For example, if user is in geohash '9q8', only return results with geohash starting with '9q8'. This reduces ambiguity by 90%.

Routing: The Contraction Hierarchy Secret

Dijkstra on a graph with 500 million nodes is too slow. Google uses Contraction Hierarchies (CH). The idea: precompute 'shortcuts' between important nodes (highways, intersections). During query, you run bidirectional Dijkstra on the contracted graph. The result: sub-millisecond routing for long distances. The trade-off: precomputation takes hours and must be updated when roads change. The gotcha: CH doesn't handle real-time traffic well. Traffic changes edge weights, invalidating shortcuts. Solution: run CH on the static graph, then apply traffic as a post-processing penalty. This is why traffic-aware routes are slightly slower to compute.

Router.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified CH-based router
class Router {
    private final Graph contractedGraph; // precomputed

    public Route computeRoute(Node from, Node to, TrafficModel traffic) {
        // Bidirectional Dijkstra on contracted graph
        PriorityQueue<Node> forwardQueue = new PriorityQueue<>();
        PriorityQueue<Node> backwardQueue = new PriorityQueue<>();
        // ... standard bidirectional Dijkstra
        // After path found, apply traffic penalties
        return applyTraffic(path, traffic);
    }

    private Route applyTraffic(Route path, TrafficModel traffic) {
        for (Edge edge : path.edges) {
            double penalty = traffic.getDelay(edge.id); // seconds
            edge.cost += penalty;
        }
        return path;
    }
}

Output

Route object with list of edges and total travel time including traffic.

Interview Gold: Why Not A*?

A with Euclidean heuristic works for small graphs but fails on road networks with obstacles (rivers, mountains). CH gives exact results and is faster for long distances. A is used for local routing within a city.

thecodeforge.io

Contraction Hierarchy Query Flow

Design Google Maps

Real-Time Traffic: The Stream Processing Pipeline

Traffic data comes from three sources: 1) GPS pings from Android phones (anonymized), 2) road sensors, 3) incident reports. The pipeline ingests millions of events per second into Kafka. A stream processor (Apache Beam) aggregates events into 5-minute windows per road segment. The output is a 'speed' value for each segment. The challenge: handling sparse data. Rural roads may have zero pings. Solution: use historical baselines and interpolate from nearby segments. The failure mode: a traffic jam causes all phones to stop moving, reducing pings. The system interprets this as 'no traffic' and clears the jam. Fix: use a minimum ping threshold; if pings drop below threshold, fall back to historical data.

TrafficPipeline.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified traffic aggregation
Pipeline p = Pipeline.create();
p.apply("ReadPings", KafkaIO.read().withTopic("gps-pings"))
 .apply("Window", Window.into(FixedWindows.of(Duration.standardMinutes(5))))
 .apply("Aggregate", Mean.<String>perKey()) // key = road segment ID
 .apply("WriteTraffic", KafkaIO.write().withTopic("traffic-updates"));
p.run();

Output

Kafka topic 'traffic-updates' contains key-value pairs: segmentId -> average speed (km/h).

Never Do This: Ignore Sparse Data

If a road segment has <10 pings in a window, don't output a speed. Use historical baseline instead. Otherwise, you'll get random noise that causes phantom traffic jams.

Map Data Versioning: How to Update Without Downtime

Map data changes daily: new roads, closed roads, changed speed limits. Google uses a versioned tile system. Each tile has a version number. When new data is published, tiles are regenerated with a new version. The client requests tiles with a version parameter. If the version is stale, the server returns a redirect to the new version. This allows gradual rollout: 1% of users see new map, then 10%, etc. The gotcha: version mismatch between tiles. A user might see a new tile for one area and an old tile for another, causing visual seams. Fix: version per region, not per tile. All tiles in a region share the same version.

VersionedTile.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Versioned tile request
// Client request: GET /tile/12/1234/567?version=20240301
// Server response:
// If version matches current: 200 OK with tile
// If version is old: 302 Redirect to /tile/12/1234/567?version=20240315
class TileController {
    @GetMapping("/tile/{z}/{x}/{y}")
    public ResponseEntity<byte[]> getTile(
            @PathVariable int z, @PathVariable int x, @PathVariable int y,
            @RequestParam(defaultValue = "0") int version) {
        int currentVersion = versionManager.getCurrentVersion(z, x, y);
        if (version != currentVersion) {
            return ResponseEntity.status(HttpStatus.FOUND)
                .location(URI.create("/tile/" + z + "/" + x + "/" + y + "?version=" + currentVersion))
                .build();
        }
        byte[] tile = tileService.getTile(z, x, y);
        return ResponseEntity.ok(tile);
    }
}

Output

HTTP 200 with tile bytes or HTTP 302 redirect to new version.

Senior Shortcut: Staged Rollout

Use a feature flag to control which users see the new map version. Start with 1% of users, monitor error rates, then ramp up. Rollback is instant by flipping the flag.

thecodeforge.io

Tile Versioning vs. Monolithic Update

Design Google Maps

Handling Traffic Spikes: The Super Bowl Problem

When a major event happens (earthquake, Super Bowl), traffic queries spike 100x. The routing service must not fall over. Solution: 1) Use a CDN for static assets (tiles). 2) For routing, use a queue with backpressure. If queue depth exceeds threshold, return cached routes for popular origins/destinations. 3) Degrade gracefully: disable traffic-aware routing during peak. The failure mode: the queue grows unbounded, causing OOM. Fix: bounded queue with rejection. Return HTTP 503 with Retry-After header. The client retries with exponential backoff.

RoutingQueue.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Bounded queue with backpressure
ExecutorService executor = new ThreadPoolExecutor(
    10, 100, 60, TimeUnit.SECONDS,
    new ArrayBlockingQueue<>(1000), // bounded queue
    new ThreadPoolExecutor.AbortPolicy() // reject if full
);

public CompletableFuture<Route> routeAsync(RouteRequest request) {
    return CompletableFuture.supplyAsync(() -> {
        if (executor.isShutdown()) throw new RejectedExecutionException();
        return router.computeRoute(request);
    }, executor);
}

Output

CompletableFuture that completes with Route or throws RejectedExecutionException.

Production Trap: Unbounded Queue

Never use an unbounded queue (e.g., LinkedBlockingQueue without capacity). Under load, it grows until OOM. Always use a bounded queue with a rejection policy.

When Not to Use This Architecture

If you're building a navigation app for a single city with <100k users, you don't need CH or distributed tile servers. A single PostGIS database with pgRouting and a tile server like TileServer GL will work fine. The Google Maps architecture is overkill for small scale. The cost of maintaining a CH precomputation pipeline and a Kafka-based traffic pipeline is justified only at planetary scale. For most startups, use Mapbox or OpenStreetMap with a managed service. Don't reinvent the wheel.

Senior Shortcut: Start Simple

For MVP, use Mapbox GL JS with a backend that queries OSRM for routing. That's a weekend project. Only build your own when you hit 10M daily active users.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Routing service containers in one region were OOM-killed every 30 minutes during peak hours.

Assumption

Assumed a memory leak in the routing algorithm. Spent days profiling heap dumps.

Root cause

The graph database client had a default connection pool of 100 connections per container. Each connection cached 50MB of precomputed route data. With 4GB heap, 80 connections consumed 4GB, leaving no room for actual routing. The OOM killer struck when a burst of requests triggered connection growth.

Fix

Reduced connection pool to 20 per container. Set connection idle timeout to 60 seconds. Added circuit breaker to fail fast when pool exhausted. Memory usage dropped to 1.5GB.

Key lesson

Always profile connection pool memory overhead before tuning heap size.
A 4GB container with 100 connections is a ticking bomb.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Routing returns 'No route found' for valid locations

→

Fix

1. Check graph database connectivity. 2. Verify that both origin and destination nodes exist in the graph. 3. Ensure contraction hierarchy is built for the region. 4. Check for disconnected components due to missing edges.

Symptom · 02

Tile server returning 404s for a region

→

Fix

1. Verify tile generation pipeline completed for that region. 2. Check blob storage for missing tiles. 3. Trigger manual tile regeneration for the affected zoom levels.

Symptom · 03

Traffic data showing 0 speed for all roads

→

Fix

1. Check Kafka consumer lag for traffic pipeline. 2. Verify GPS ping ingestion rate. 3. Ensure stream processor is not stuck on a bad event. 4. Restart the stream processor if lag > 10 minutes.

★ Google Maps System Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Routing 5xx errors−

Immediate action

Check routing service CPU and memory

Commands

kubectl top pods -l app=routing

kubectl logs -l app=routing --tail=100

Fix now

kubectl scale deployment routing --replicas=10

Tile loading slow+

Traffic data stale+

Geocoding returns empty results+

Feature	Google Maps	OpenStreetMap + OSRM
Routing speed	<100ms for global routes	~500ms for city-scale
Traffic data	Real-time from 1B+ devices	Requires third-party feed
Map data freshness	Daily updates	Community-driven, variable
Cost	Pay per API call	Free but self-hosted infrastructure

Key takeaways

Google Maps uses Contraction Hierarchies for sub-millisecond routing, not plain Dijkstra.

Tile pyramids with versioning enable zero-downtime map updates and aggressive CDN caching.

Real-time traffic requires a stream processing pipeline with fallback to historical data for sparse roads.

Always use bounded queues and backpressure to handle traffic spikes gracefully.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Google Maps handle routing at global scale without OOM?

Q02SENIOR

When would you choose A* over Contraction Hierarchies for routing?

Q03SENIOR

What happens when a traffic spike causes routing queue to overflow?

Q04JUNIOR

What is a tile pyramid and why is it used?

Q05SENIOR

How would you debug a sudden spike in routing latency?

Q06SENIOR

Design a system to serve real-time traffic updates to 1 billion users.

Q01 of 06SENIOR

How does Google Maps handle routing at global scale without OOM?

ANSWER

They use Contraction Hierarchies (CH) to precompute shortcuts. During query, bidirectional Dijkstra runs on the contracted graph, which has far fewer nodes. The precomputation is done offline and takes hours, but queries are sub-millisecond.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How does Google Maps calculate route time so fast?

What's the difference between Google Maps and OpenStreetMap routing?

How do I implement real-time traffic in my app?

What happens when Google Maps traffic data is wrong?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't