Uber System Design — Cassandra Tombstone Staleness
Cassandra tombstones caused 10x dispatch delays in Uber's 2019 location blackout.
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
- Uber's backend is a microservices architecture handling 20M trips/day with 99.99% availability.
- Location tracking uses H3 geospatial indexing to store driver GPS pings sent every 4 seconds.
- Matching runs a real-time auction: finds nearby drivers via geohash prefix, then assigns based on ETA and surge.
- Surge pricing recalculates every 5 minutes from supply/demand curves, then broadcasts via push.
- Payment uses idempotency keys across sharded PostgreSQL shards with saga compensation on failures.
- Performance gotcha: Cassandra read-repair on stale replicas caused riders to see ghost drivers far away.
Imagine a city with thousands of taxi drivers all driving around, and millions of people raising their hands asking for a ride. Someone needs to constantly watch where every driver is, instantly find the closest one to each person, connect them, track the whole trip, and charge the right amount at the end — all in under 5 seconds, for millions of people at once. That 'someone' is the Uber backend. It's basically an incredibly fast, constantly-updated map crossed with a matchmaking engine crossed with a payment system — all stitched together without a single point of failure.
Uber processes roughly 20 million trips per day across 70+ countries. At peak hours in a city like New York, the system is simultaneously tracking hundreds of thousands of driver GPS pings per second, matching riders to drivers in under 2 seconds, calculating surge multipliers in real-time, and processing payments across dozens of currencies. Getting any one of those wrong at scale doesn't just cause a bug — it causes someone standing in the rain at 2am. That's the real pressure behind this design.
The core problem Uber solves isn't 'connecting riders to drivers' — that's too simple. The real problem is: how do you maintain a globally consistent, real-time view of moving objects (drivers), efficiently query that view by proximity, run a two-sided marketplace matching algorithm under millisecond constraints, handle partial failures gracefully, and do all of this across data centers on multiple continents while remaining cheap enough to be profitable? Each of those sub-problems alone is a PhD thesis. Together, they're one of the most instructive system design challenges you'll encounter.
By the end of this article you'll be able to walk into any senior engineering interview and articulate the full Uber architecture — from the geospatial indexing strategy that makes proximity search fast, to the matching algorithm trade-offs, to why Uber moved away from a monolith to a domain-oriented microservices architecture, and the exact database and messaging choices that make all of it work in production. More importantly, you'll understand why each decision was made, not just what it was.
What Tombstone Staleness Actually Means in Cassandra
Tombstone staleness is the accumulation of deletion markers (tombstones) that outlive their usefulness, degrading read performance and causing latency spikes. In Cassandra, a delete doesn't remove data immediately — it writes a tombstone with a timestamp. During compaction, tombstones older than gc_grace_seconds (default 10 days) are purged. Until then, every read must scan past these tombstones, increasing I/O and CPU. The core mechanic: tombstones are not garbage collected until compaction, and if compaction falls behind, stale tombstones accumulate linearly with write volume.
In practice, tombstone staleness manifests as steadily rising read latencies and timeouts, especially on range queries. Cassandra's read path must merge all SSTables for a partition — tombstones add overhead proportional to their count. A single partition with 100,000 tombstones can cause multi-second reads. The key property: tombstones are not free; they consume disk space and memory in the memtable and row cache. The compaction strategy (SizeTiered, Leveled, or TimeWindow) determines how aggressively tombstones are reclaimed — Leveled Compaction is most efficient for tombstone cleanup.
Use tombstone staleness awareness when designing systems with high delete rates — time-series data with TTLs, session stores, or queue-like patterns. It matters because ignoring tombstone accumulation leads to cascading failures: slow reads cause client retries, which increase load, which further delays compaction. The rule of thumb: keep tombstones per partition under 1,000 for predictable latency. Monitor 'org.apache.cassandra.metrics.Table.TombstoneScannedHistogram' in production.
High-Level Architecture Overview
Uber operates a domain-oriented microservices architecture. Each domain — location, matching, payment, pricing, trip management — owns its data and exposes APIs via an API gateway (Envoy). Services communicate asynchronously through Kafka topics for event-driven flows, and synchronously through gRPC for low-latency queries.
The architecture is regionally isolated: each city or metro area runs its own stack. Data centers are replicated across multiple regions, with Cassandra providing multi-master replication for location data and PostgreSQL shards for transactional trip/ payment data.
- Driver app → GPS pipeline — every 4 seconds, driver location sent via WebSocket to location-ingestion service.
- Rider app → request — HTTP request to matching service via gateway.
- Matching service — looks up nearby drivers via geospatial index, runs auction.
- Surge pricing service — consumes supply/demand Kafka topics, computes multipliers.
- Payment service — idempotent capture after trip end, uses saga pattern across payment providers.
Geospatial Indexing & Location Tracking
Every driver sends a GPS ping every 4 seconds. At 20 million trips per day, that's roughly 2.5 million pings per second at peak. Storing and querying these points in real time requires a geospatial indexing system that can answer "Who is within 500 meters of (lat, lng)?" in under 10 milliseconds.
Uber originally used Google S2 but later developed H3, a hexagonal hierarchical grid. Each driver's location is assigned an H3 cell at resolution 9 (hexagons ~0.1 km²). The matching service then queries all drivers in the same cell and adjacent cells (hex ring), then calculates ETA via OSRM (Open Source Routing Machine).
Storage: The location table in Cassandra uses driver_id as partition key and epoch_minute as clustering key, with a TTL of 10 minutes. A secondary table by h3_cell allows fast proximity searches: SELECT driver_id, lat, lng, ts FROM driver_location WHERE h3_cell = ? AND epoch_min = ?.
- Hexagons have uniform neighbor distance, unlike squares (grid distortion).
- Use resolution 9 (0.1 km²) for city-level accuracy; lower resolution for long-distance dispatch.
- TTL on location rows prevents stale data from living in Cassandra read-repair.
- Secondary index on h3_cell + epoch_min allows fast partition scans.
Matching Algorithm (Ride Dispatch)
When a rider requests a ride, the matching service must find the best driver within 2 seconds. The process is:
- Filter eligible drivers — those whose acceptance rate > 80%, not on a trip, within surge zone.
- Proximity query — find drivers in the same H3 hex ring (radius ~1 km). If too few, expand to ring 2.
- Cost computation — for each candidate, compute ETA (via OSRM routing service) and surge multiplier.
- Auction — Uber uses a second-price auction (Vickrey): the rider pays the lowest winning bid, the driver gets their bid price. This incentivizes truthful bidding.
- Dispatch — send the rider's request to the top 3 drivers simultaneously (but avoid over-dispatching by reserving the driver for 15 seconds).
The algorithm is optimized for throughput: most cities can dispatch in under 1 second at p99.
Surge Pricing Engine
Surge pricing adjusts fares based on real-time supply (available drivers) and demand (ride requests). The calculation runs every 5 minutes per geographic zone (a set of H3 cells).
Algorithm: - Compute surge_multiplier = max(1.0, demand / (supply * target_coverage)) - Where target_coverage is the desired driver-to-rider ratio (e.g., 0.5 for 1 driver per 2 riders). - The multiplier is smoothed using an exponential moving average to avoid sudden spikes. - If supply drops below a threshold, the zone is marked "surge".
Implementation: A separate Kafka stream processor consumes ride_request and driver_online events per zone, aggregates, then broadcasts the multiplier to a Redis cache. The matching service reads the multiplier from Redis, and the rider app displays the surge notification before confirming.
Uber also uses heatmaps to proactively send drivers notifications about potential surge areas.
Payment & Trip Execution
Uber's payment system processes tens of millions of transactions daily across 50+ currencies. The core challenge is exactly-once payment capture — you never want to charge a rider twice or miss a driver payout.
The solution: idempotency keys. Before initiating any payment, the client generates a UUID (the idempotency key) and sends it along with the request. The payment service stores this key in a Redis set with a short TTL. If the same key appears again, the service returns the previous response without re-executing.
For cross-region payments (e.g., rider in New York pays for a trip in Paris), the system uses a saga pattern with compensating actions. The steps: 1. Capture rider payment (source account) 2. Payout to driver (destination account) 3. Apply Uber commission 4. If any step fails, compensate: reverse capture, refund driver percentage.
The trip execution state machine runs on a Kafka-backed stream: states go from REQUESTED → MATCHED → EN_ROUTE → ON_TRIP → COMPLETED → SETTLED.
Scaling, Fault Tolerance & Real-World Incidents
Uber's system must survive: single data center failure, sudden demand spikes (New Year's Eve), driver app disconnections, network partitions, and rogue deployments. Key strategies:
- Regional isolation: each city runs independent stacks. If one region fails, others are unaffected.
- Graceful degradation: if the matching service cannot compute ETAs, it falls back to linear distance matching. If payment fails, riders can still complete the trip and pay later.
- Auto-scaling: all stateless services (matching, pricing, ETA) scale based on CPU and request queue depth. Cassandra and Redis clusters are sharded and replicated.
- Chaos engineering: Uber runs regular failure drills: kill random pods, inject latency into Kafka, throttle Cassandra nodes.
- Circuit breakers: every synchronous call (gRPC) has a circuit breaker. When error rate exceeds 50%, the circuit opens and the caller uses a fallback (e.g., cached data).
A real incident from 2020: A bug in the H3 library caused all new driver pings to be placed in the same hex cell. Suddenly, all riders in a city saw drivers at a single point. The fix required a hotpatch rolled out via the driver app's feature switch system.
- Assume every network call can fail, every dependency can slow down, every message can be lost.
- Design fallbacks that still offer a reasonable user experience (e.g., distance-only matching).
- Test failures proactively: kill a container, throttle a database, partition a network.
- Monitor the right metrics: p99 latency, error rates, consumer lag, cache hit ratio.
The 100-Millisecond Rule: Why WebSockets Beat Polling for Driver Location
Every Uber backend engineer knows that latency kills the rider experience. When you open the app and watch your driver approach, that blue car icon updates because of WebSocket push, not HTTP polling. Polling adds 300ms-1s of overhead per request. At Uber’s scale, that’s millions of wasted requests per minute. Instead, the driver’s mobile app sends GPS coordinates every 3-5 seconds via a persistent WebSocket connection. The server validates, updates Redis (for fast reads), and pushes to subscribed riders. That means 100ms end-to-end. If you use polling, your system collapses under load. WebSockets also reduce bandwidth by 80% compared to REST polling. The trade-off: connection management. You need a load balancer that supports sticky sessions or a distributed pub/sub like Kafka to fan out updates. Uber uses WebSockets with long-lived connections and fallback to Server-Sent Events for firewall-busted clients. Never poll when you can push. Your users’ thumbs will thank you.
H3 Hexagons: How Uber Discretizes the Earth for Geofencing and Dispatch
Uber doesn’t use lat/lng pairs for geofencing. That’s amateur hour. They use Uber H3, a hexagon-based spatial index. Why hexagons? They tessellate better than squares (no jagged corners) and have uniform distance properties—each edge of a hexagon is roughly the same length. That’s critical for surge pricing zones and driver dispatch. The H3 grid resolves queries like “find all drivers within 500 meters” in O(log n) by indexing hexagon IDs. Resolution 10 means each hexagon is about 0.2 square kilometers—perfect for city-level matching. Uber pre-computes hexagon neighbors, so dispatch only checks 7 hexes (center + 6 neighbors) rather than scanning all drivers. That’s a 100x speedup. They also use H3 for pricing: surge zones snap to hexagon clusters, not arbitrary lat/lng polygons. If you’re building geospatial systems, stop reinventing the wheel. Use H3. It’s open-source. It’s battle-tested. It’s the reason your Uber knows exactly how many drivers are on the next block.
The Two-Phase Commit Trap: Why Uber Uses Saga Pattern for Payments
Do not use distributed transactions for ride payments. That’s a recipe for deadlocks and downtime. Uber processes millions of payments per hour across drivers, riders, promotions, and surge adjustments. A two-phase commit across database shards would block for seconds, and any coordinator failure leaves locks forever. Instead, Uber uses the Saga pattern: a sequence of local transactions with compensating actions on failure. When a ride ends, the saga orchestrates: (1) charge rider card, (2) credit driver wallet, (3) apply promo discount. Each step writes to its own shard. If step 2 fails, step 1 is reversed via a refund transaction. No global locks. No two-phase commit. Uber’s payment saga runs on Apache Kafka for durability and ordering. Each event is idempotent—if a message is retried, the system ignores duplicates via a unique ride_id. This is how Uber processes $14 billion in payments annually without locking a single database row. Sagas are hard to debug, but they are the only way to scale payments. If you try two-phase commit at Uber’s volume, you will learn what production paging looks like at 3 AM.
The 2019 Location Data Blackout
- Eventual consistency is not safe for geo-fencing queries that need seconds-fresh data.
- Always test read-repair behavior under compaction load before upgrades.
- Add defensive timestamp validation in downstream services.
kubectl exec -it matcher-service-0 -- curl localhost:8080/metrics | grep dispatch_latencydocker compose logs matcher-service --tail=100 | grep 'TIMEOUT'Key takeaways
Common mistakes to avoid
4 patternsMemorising syntax before understanding the concept
Skipping practice and only reading theory
Ignoring constraints like latency and consistency
Assuming a monolith can't be scaled
Interview Questions on This Topic
How would you design Uber's location service to handle 5 million GPS pings per second?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
That's Real World. Mark it forged?
9 min read · try the examples if you haven't