Counting Inversions — 64-bit Overflow at 50K Items
32-bit inversion counts overflow at ~46K items.
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
- Inversion count measures how unsorted an array is: integer between 0 and n(n-1)/2
- Merge sort counts inversions for free during the merge step, adding O(1) per comparison
- Performance: O(n log n) time, O(n) auxiliary space (for merge)
- Production insight: integer overflow when n > 10^4 in 32-bit systems — use 64-bit
- Biggest mistake: using naive O(n²) n-squared algorithm for large datasets
- Real use: Kendall tau distance for ranking similarity, minimum adjacent swaps to sort
An inversion is a pair of positions (i, j) where i < j but arr[i] > arr[j] — an out-of-order pair. A sorted array has 0 inversions. A reverse-sorted array has n(n-1)/2 inversions. Counting inversions measures how "unsorted" an array is. Merge sort counts them for free during the merge step — when you take an element from the right half before the left half, every remaining element in the left half forms an inversion with it.
Counting inversions is a classic application of extracting extra information during a divide-and-conquer combine step at no asymptotic cost. The merge step of merge sort compares elements — when a right element is taken before a left element, it has to jump over every remaining left element. Adding those numbers gives you the total inversion count.
This pattern — piggybacking computation on an existing algorithm — appears across algorithm design. Inversions for collaborative filtering, closest pair during a geometric divide, maximum subarray in a recurrence. The lesson: if you already have the structure, ask what else you can compute for free.
In production, this matters. A naive O(n²) solution on an array of 100,000 items takes 5 billion comparisons. Merge sort does fewer than 2 million. That's the difference between a sub-second response and a dropped request.
What is an Inversion?
A pair (i, j) is an inversion if i < j but arr[i] > arr[j].
Example: arr = [2, 4, 1, 3, 5] Inversions: (2,1), (4,1), (4,3) → count = 3
- Measuring how similar two preference rankings are
- Minimum swaps to sort an array = number of inversions
- Kendall tau distance between two permutations
Modified Merge Sort
During the merge step of merge sort, when we pick an element from the right subarray before exhausting the left subarray, every remaining element in the left subarray forms an inversion with it.
Why This Works
When merging two sorted halves, any element right[j] that is taken before left[i] forms inversions with all of left[i..end]. This is because: 1. left is sorted → all remaining elements left[i..] are ≥ left[i] > right[j] 2. All these elements come before right[j] in the original array (they're in the left half)
So inversions += len(left) - i counts them all in O(1).
Application — Minimum Swaps to Sort
The minimum number of adjacent swaps needed to sort an array equals the number of inversions. Each adjacent swap reduces the inversion count by exactly 1.
Handling Large Data: Overflow and Recursion Depth
For arrays larger than 100,000 elements, two issues arise: integer overflow and recursion depth. Python's integers are arbitrary precision so overflow isn't an issue, but recursion depth may become a problem if the recursion limit is too low. Java's int overflows at n > 10^4; use long. Java recursion limit is typically ~10000, which is safe for arrays up to ~2^10000 but stack memory may be exhausted earlier. The iterative merge sort avoids both issues.
Real-World Application: Kendall Tau Distance
Kendall tau distance measures the similarity between two rankings. It is the number of discordant pairs divided by total pairs. The number of discordant pairs is exactly the inversion count when one ranking is reordered to match the order of the other. This metric is used in collaborative filtering, ensemble methods, and statistical analysis.
Brute Force: The O(n²) You'll Write Once (Then Never Again)
Before you reach for merge sort, understand why the naive solution exists. It's the baseline. The thing you benchmark against. And in production, the thing that gets your ticket rejected at code review.
The brute force method is exactly what it sounds like: check every pair (i, j) where i < j, and count how many satisfy arr[i] > arr[j]. Two nested loops. That's it. No cleverness. No divide-and-conquer.
Why bother showing this? Because it reveals the inversion count's lower bound. You can't solve this by accident — every pair is a potential inversion. The O(n²) loop is your proof-of-correctness for the merge sort approach. You run it on a small dataset, confirm the count matches, then switch to the log-linear version for anything over a few thousand elements.
In the real world, this O(n²) shows up in security audits and small batch processing where n < 1000. Anything larger, and you're burning CPU cycles for no reason. Know it. Test against it. Then refactor.
Merge Sort Step: Where the Real Count Happens
The merge sort approach isn't magical. It exploits one simple property: when you merge two sorted halves, if an element from the right half is smaller than one from the left, that means all remaining elements in the left half are also larger (since both halves are sorted internally). That's a bulk inversion count — no need to check each pair individually.
Here's the execution: split the array recursively until you have single elements. Then, as you merge back up, count inversions in the merge step. Each time you pick an element from the right half before finishing the left half, add the number of remaining elements in the left half to your total. This is where the O(n log n) comes from — the merge step runs in linear time per level, and there are log n levels.
In production, this is the version you deploy. It handles arrays of 10 million elements without breaking a sweat. The space overhead is O(n) for the temporary array used during merge. That's acceptable for most systems. If you're memory-constrained, consider an in-place variant, but those are trickier to get right.
The main risk? Stack overflow from recursive depth on massive arrays. Use iterative merge sort or increase stack size if recursion is your bottleneck.
Inversion Count Overflow in Ranking Similarity Service
- Always use 64-bit integers for inversion counts when n > 10^4.
- Validate assumptions about data size in production — edge cases always exist.
- Add overflow detection to critical calculations: a negative count is a dead giveaway.
- Use arbitrary-precision integers (Python, BigInteger) when possible.
python -c 'n=100000; print(n*(n-1)//2)'Check type: type(count) should be int (Python has arbitrary precision). In Java, check if count is long.Key takeaways
Common mistakes to avoid
4 patternsUsing 32-bit integer for inversion count
Counting inversions incorrectly for equal elements
Using naive O(n²) algorithm for large arrays
Assuming recursion depth is always safe
Practice These on LeetCode
Interview Questions on This Topic
How does the merge step naturally count inversions?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
That's Recursion. Mark it forged?
4 min read · try the examples if you haven't