Row-major: left-to-right, top-to-bottom. Cache-friendly. Default for most operations.
Column-major: top-to-bottom, left-to-right. 5-10x slower on large matrices due to cache misses.
Spiral: boundary shrinking (top, bottom, left, right). Used for layer-by-layer processing.
Diagonal: group by row+col invariant. Used for JPEG zigzag, DP table fill.
Image processing pipelines use row-major for filters, column-major for rotation.
Spiral traversal is used in matrix rotation and layer-based compression.
Missing guard checks (top <= bottom, left <= right) in spiral traversal. Causes duplicate visits on non-square matrices.
✦ Definition~90s read
What is Matrix Traversal Patterns?
Matrix traversal is the systematic process of visiting every element in a 2D array (matrix) in a specific order. The performance trap between row-major and column-major traversal stems from CPU cache behavior: modern processors load adjacent memory into cache lines.
★
Imagine a chessboard.
Row-major traversal (iterating over columns within each row) exploits spatial locality because matrix rows are stored contiguously in memory in languages like C, C++, and Python (NumPy). Column-major traversal (iterating over rows within each column) jumps across memory pages, causing cache misses that can degrade throughput by 10x or more on large matrices like 8K images (7680×4320 pixels, ~31 million elements).
This isn't a theoretical curiosity—it's a measurable bottleneck in image processing, computer graphics, and scientific computing where cache-efficient access patterns separate production-grade code from prototypes.
Beyond the row-vs-column dichotomy, traversal patterns define the order of element visitation for specific algorithms. The foundational patterns—row-by-row and column-by-column—are the building blocks for all others. Diagonal traversal uses mathematical invariants (constant sum of row and column indices) to visit elements along diagonals, often required in dynamic programming problems like edit distance or matrix chain multiplication.
Spiral traversal employs a boundary-shrinking technique: maintain four pointers (top, bottom, left, right) and iterate along each perimeter, then contract the boundaries inward. This pattern appears in real-world systems like image convolution kernels that process pixels in spiral order to optimize memory access patterns for specific hardware accelerators (e.g., NVIDIA's Tensor Cores).
You'll encounter these patterns in coding interviews (LeetCode medium/hard), but more importantly in production systems: rendering engines traverse framebuffers in scanline order (row-major), while some DSP algorithms require diagonal access for FFT transforms. The key insight is that traversal order isn't just an academic exercise—it directly impacts cache hit rates, memory bandwidth utilization, and ultimately wall-clock time.
When you're processing a 4K or 8K image (8.3 million to 33 million pixels), a naive traversal pattern can turn a 60 FPS pipeline into a 6 FPS slideshow. Understanding these patterns lets you choose the right order for your data layout, not just the most intuitive one.
Plain-English First
Imagine a chessboard. A matrix is just that — a grid of rows and columns. Traversal simply means visiting every square on that board in a specific order. Sometimes you scan left-to-right like reading a book. Sometimes you spiral inward like unwinding a clock. The pattern you choose depends entirely on what you're looking for and where the data lives in that grid.
Matrices are the fundamental data structure for image processing, game boards, adjacency tables, pathfinding, and spreadsheet operations. Every image filter, matrix rotation, and grid-based search relies on a specific traversal pattern. Choosing the wrong pattern does not just slow your code — it produces wrong results.
The core problem is direction and order. A flat array has one traversal: front to back. A matrix has dozens: row-major, column-major, diagonal, anti-diagonal, spiral inward, spiral outward, and BFS/DFS flood fill. Row-major traversal is 5-10x faster than column-major on large matrices due to CPU cache locality. Spiral traversal requires careful boundary management to avoid duplicate visits on non-square grids.
The common misconception is that all traversal patterns have the same performance. Row-major reads contiguous memory — cache-friendly. Column-major jumps across rows — cache-hostile. On a 4096x4096 image, the difference is 200ms vs 2 seconds. Understanding memory layout is what separates a correct solution from a production-grade one.
Why Row-Major vs Column-Major Traversal Is a 10x Performance Trap
Matrix traversal patterns define the order in which you access elements of a 2D array. The core mechanic is simple: you iterate over rows and columns, but the order — row-major (outer loop over rows, inner over columns) versus column-major (outer loop over columns, inner over rows) — determines whether your CPU cache works for you or against you. On an 8K image (7680×4320 pixels), that choice can make traversal 10x slower.
In practice, Java stores 2D arrays as arrays of arrays, laid out row by row in contiguous memory. Row-major access reads memory sequentially, hitting L1 cache lines (typically 64 bytes) with every stride. Column-major access jumps across rows, skipping cache lines and causing cache misses on nearly every access. The penalty scales with matrix size: for a 33-megapixel image, the difference between 5 ms and 50 ms per traversal is the difference between 200 fps and 20 fps.
Use row-major traversal whenever you read or write every element — image processing, matrix multiplication, convolution kernels. Column-major is only justified when your algorithm inherently needs column-wise data (e.g., column-wise statistics, certain linear algebra operations). In real systems, the wrong traversal pattern is the silent killer of throughput, especially in latency-sensitive pipelines like real-time video processing or scientific simulations.
Cache Line Blindness
A single cache miss costs ~100 cycles. Column-major traversal on an 8K image triggers millions of them — turning a 5 ms operation into 50 ms without changing a single arithmetic instruction.
Production Insight
A real-time video processing pipeline dropped from 60 fps to 12 fps after switching from row-major to column-major for a 4K color conversion step.
The symptom was high L1 cache miss rates (>80%) and stalled CPU cycles, visible in perf stat as 'cache-misses' events.
Always profile traversal order with realistic matrix sizes — the penalty is invisible on 100×100 test data but catastrophic at 8K.
Key Takeaway
Row-major traversal exploits spatial locality; column-major destroys it.
Cache misses dominate runtime for large matrices — O(n²) arithmetic is dwarfed by O(n²) memory stalls.
Always match traversal order to memory layout unless your algorithm forces column access.
Worked Example — Spiral Order Traversal of a 3x3 Matrix
Matrix M = [[1,2,3],[4,5,6],[7,8,9]]. Spiral order traversal:
Traverse left column bottom-to-top: [4]. left becomes 1.
top>bottom (1>1 is false) but top==bottom: traverse remaining top row [5]. top becomes 2.
top>bottom → stop. Result: [1,2,3,6,9,8,7,4,5].
DFS on grid for number of islands (1=land, 0=water) in [[1,1,0],[0,1,0],[0,0,1]]: 1. (0,0)=1: DFS marks (0,0),(0,1),(1,1) as visited. Count=1. 2. (0,2)=0: skip. 3. Continue scanning... (2,2)=1: DFS marks (2,2). Count=2. 4. Total islands = 2.
io/thecodeforge/traversal/WorkedExamples.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
package io.thecodeforge.traversal;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
publicclassWorkedExamples {
/** Spiral order traversal. O(m*n) time, O(1) space (excluding result list). */
publicstaticList<Integer> spiralOrder(int[][] matrix) {
List<Integer> result = newArrayList<>();
if (matrix == null || matrix.length == 0) return result;
int top = 0, bottom = matrix.length - 1;
int left = 0, right = matrix[0].length - 1;
while (top <= bottom && left <= right) {
for (int i = left; i <= right; i++) result.add(matrix[top][i]);
top++;
for (int i = top; i <= bottom; i++) result.add(matrix[i][right]);
right--;
if (top <= bottom) {
for (int i = right; i >= left; i--) result.add(matrix[bottom][i]);
bottom--;
}
if (left <= right) {
for (int i = bottom; i >= top; i--) result.add(matrix[i][left]);
left++;
}
}
return result;
}
/** Number of islands using DFS. O(m*n) time, O(m*n) space for visited array. */
publicstaticintnumIslands(int[][] grid) {
if (grid == null || grid.length == 0) return0;
int count = 0;
for (int r = 0; r < grid.length; r++) {
for (int c = 0; c < grid[0].length; c++) {
if (grid[r][c] == 1) {
dfs(grid, r, c);
count++;
}
}
}
return count;
}
privatestaticvoiddfs(int[][] grid, int r, int c) {
if (r < 0 || r >= grid.length || c < 0 || c >= grid[0].length || grid[r][c] == 0) return;
grid[r][c] = 0; // mark visiteddfs(grid, r + 1, c);
dfs(grid, r - 1, c);
dfs(grid, r, c + 1);
dfs(grid, r, c - 1);
}
publicstaticvoidmain(String[] args) {
int[][] matrix = {{1,2,3},{4,5,6},{7,8,9}};
System.out.println("Spiral: " + spiralOrder(matrix));
int[][] islands = {{1,1,0},{0,1,0},{0,0,1}};
System.out.println("Islands: " + numIslands(islands));
}
}
Output
Spiral: [1, 2, 3, 6, 9, 8, 7, 4, 5]
Islands: 2
The Four Building Blocks
Sequential scan: row-major or column-major. Default for most operations.
Boundary shrinking: spiral. Four boundaries converge inward.
Invariant grouping: diagonal. Cells grouped by row+col = constant.
Connected component: DFS/BFS. Flood fill, number of islands.
Most interview problems combine two building blocks.
Production Insight
A game engine used spiral traversal to process collision detection from the player's position outward. The player was at the center of a 100x100 grid. Spiral traversal checked nearby cells first (most likely to have collisions), avoiding the O(n^2) cost of checking every cell. Average collision checks dropped from 10,000 to 200 per frame — a 50x speedup that enabled 60fps gameplay.
Key Takeaway
Spiral traversal uses four boundaries (top, bottom, left, right) that shrink inward. DFS/BFS finds connected components in O(m*n). The guard checks (top <= bottom, left <= right) prevent duplicate visits on non-square matrices.
Matrix Traversal Patterns — Plain English
Matrix traversal problems require visiting cells in a specific order. Three main patterns:
Pattern 1 — Spiral traversal: 1. Maintain boundaries: top, bottom, left, right. 2. Traverse right along top row; shrink top. 3. Traverse down right column; shrink right. 4. Traverse left along bottom row; shrink bottom. 5. Traverse up left column; shrink left. 6. Repeat while top<=bottom and left<=right.
Worked example — spiral of [[1,2,3],[4,5,6],[7,8,9]]: Right: 1,2,3. top=1. Down: 6,9. right=1. Left: 8,7. bottom=1. Up: 4. left=1. Right: 5. top(2)>bottom(1). Stop. Result: [1,2,3,6,9,8,7,4,5].
Pattern 2 — DFS/BFS flood fill (number of islands): For each cell matching condition: run DFS marking all connected cells visited. Count DFS starts = number of islands.
Pattern 3 — Diagonal traversal: Cells on the same anti-diagonal share i+j. Group by i+j to traverse diagonals.
package io.thecodeforge.traversal;
import java.util.ArrayList;
import java.util.List;
publicclassTraversalPatterns {
/** Pattern1: Spiral traversal. */
publicstaticList<Integer> spiral(int[][] m) {
List<Integer> res = newArrayList<>();
if (m == null || m.length == 0) return res;
int top = 0, bottom = m.length - 1, left = 0, right = m[0].length - 1;
while (top <= bottom && left <= right) {
for (int i = left; i <= right; i++) res.add(m[top][i]);
top++;
for (int i = top; i <= bottom; i++) res.add(m[i][right]);
right--;
if (top <= bottom) { for (int i = right; i >= left; i--) res.add(m[bottom][i]); bottom--; }
if (left <= right) { for (int i = bottom; i >= top; i--) res.add(m[i][left]); left++; }
}
return res;
}
/** Pattern3: Diagonal traversal grouped by row+col. */
publicstaticList<Integer> diagonalOrder(int[][] m) {
List<Integer> res = newArrayList<>();
if (m == null || m.length == 0) return res;
int rows = m.length, cols = m[0].length;
for (int d = 0; d < rows + cols - 1; d++) {
int r = Math.max(0, d - cols + 1);
int rEnd = Math.min(d, rows - 1);
for (int i = r; i <= rEnd; i++) {
res.add(m[i][d - i]);
}
}
return res;
}
publicstaticvoidmain(String[] args) {
int[][] m = {{1,2,3},{4,5,6},{7,8,9}};
System.out.println("Spiral: " + spiral(m));
System.out.println("Diagonal: " + diagonalOrder(m));
}
}
Output
Spiral: [1, 2, 3, 6, 9, 8, 7, 4, 5]
Diagonal: [1, 2, 4, 7, 5, 3, 6, 8, 9]
Pattern Recognition Decision Tree
Specific order required → spiral (boundary shrinking) or diagonal (invariant grouping).
Connectivity required → DFS/BFS flood fill.
Process every cell → row-major for cache performance.
Transpose/rotate → column-major logic within row-major tiles.
Most interview problems combine two patterns.
Production Insight
A pathfinding engine used BFS on a grid to find the shortest path between two points. The grid was 10,000 x 10,000 (100 million cells). DFS would cause StackOverflowError due to recursion depth. BFS with an explicit queue processed the grid in O(m*n) time without recursion. The engine also used row-major traversal for the initial grid scan, reducing cache misses by 8x compared to column-major.
Key Takeaway
Three patterns cover 95% of matrix traversal problems: spiral (boundary shrinking), diagonal (row+col invariant), and DFS/BFS (connectivity). Pattern recognition before coding eliminates 80% of bugs.
Row-by-Row and Column-by-Column — The Foundation Every Pattern Builds On
Row-major traversal visits every element left-to-right, top-to-bottom — exactly how English speakers read a page. In Java, a 2D array is technically an 'array of arrays.' These sub-arrays (rows) are generally stored in contiguous blocks of memory.
Accessing elements row-by-row means sequential memory reads, which keeps the CPU cache happy (spatial locality) and makes your loop significantly faster. Column-major traversal flips this—the outer loop iterates over columns, the inner loop over rows. This is essential for operations like transposing an image or rotating a matrix 90 degrees, but it comes with a performance penalty on massive datasets because it forces the CPU to jump across memory addresses.
Cache Performance: Row-Major Is 5-10x Faster on Large Matrices
Row-major: sequential memory reads. L1 cache hit rate ~95% on large matrices.
Column-major: strided memory reads. L1 cache hit rate ~12% on 8K images.
Java 2D arrays are arrays of arrays. Each row is contiguous. Columns are not.
For transpose, use tiled traversal: process 64x64 blocks in row-major order.
Benchmark: row-major 200ms vs column-major 2s on 4096x4096 matrix.
Production Insight
A satellite image processing pipeline applied a Gaussian blur filter to 16K x 16K images. The initial implementation used column-major traversal for the vertical pass. Each column access jumped 16K x 8 bytes = 128KB — far exceeding the 32KB L1 cache. L1 cache hit rate: 3%. Switching to tiled traversal (process 64x64 blocks in row-major): L1 cache hit rate improved to 87%. Processing time dropped from 45 seconds to 5 seconds per image.
Key Takeaway
Row-major traversal is the10x slower on large matrices due to cache thrashing. For operations requiring column access (transpose, rotation), use tiled traversal to keep data in L1 cache.
thecodeforge.io
Spiral Traversal — Boundary Shrinking, 4 Passes per Layer
Matrix Traversal Patterns
Diagonal Traversal — Using Mathematical Invariants
Diagonal traversal is used heavily in JPEG compression (zigzag scan) and Dynamic Programming. The key to mastering this without confusing your indices is the anti-diagonal invariant: every element on the same anti-diagonal shares a constant value for row + col.
By iterating through the sum of indices (from 0 to rows + cols - 2), you can systematically visit every diagonal. For non-square matrices, we use Math.max and Math.min to 'clamp' our indices so we don't wander outside the grid boundaries.
Zigzag: alternate direction on each diagonal. d % 2 == 0 → reverse order.
Production Insight
A JPEG compression pipeline used diagonal zigzag traversal to reorder DCT coefficients. The zigzag order groups low-frequency coefficients (which carry most of the image information) before high-frequency coefficients. This enables efficient run-length encoding: the trailing zeros are clustered together. Without zigzag ordering, the compression ratio dropped by 40%. The traversal was O(m*n) with the row+col invariant — no nested loops over diagonals.
Key Takeaway
Diagonal traversal uses the row+col invariant to group cells by anti-diagonal. The number of diagonals is rows + cols - 1. Use Math.max and Math.min to clamp indices for non-square matrices. This pattern is the basis of JPEG zigzag scan and DP table fill.
Spiral Traversal — The Boundary Shrinking Technique
Spiral traversal is the ultimate test of boundary management. Instead of complex state machines, maintain four boundaries: top, bottom, left, and right. As you complete a pass (e.g., left to right), increment the top boundary. The loop terminates naturally when boundaries cross.
package io.thecodeforge.traversal;
import java.util.ArrayList;
import java.util.List;
publicclassSpiralTraversal {
publicList<Integer> getSpiralOrder(int[][] matrix) {
List<Integer> result = newArrayList<>();
if (matrix == null || matrix.length == 0) return result;
int top = 0, bottom = matrix.length - 1;
int left = 0, right = matrix[0].length - 1;
while (top <= bottom && left <= right) {
for (int i = left; i <= right; i++) result.add(matrix[top][i]);
top++;
for (int i = top; i <= bottom; i++) result.add(matrix[i][right]);
right--;
if (top <= bottom) {
for (int i = right; i >= left; i--) result.add(matrix[bottom][i]);
bottom--;
}
if (left <= right) {
for (int i = bottom; i >= top; i--) result.add(matrix[i][left]);
left++;
}
}
return result;
}
}
Output
[1, 2, 3, 6, 9, 8, 7, 4, 5]
The Crucial Guard Check
if (top <= bottom): prevents re-processing the middle row on odd-height matrices.
if (left <= right): prevents re-processing the middle column on odd-width matrices.
Test with 1x3 [[1,2,3]]: expected [1,2,3]. Without guards: [1,2,3,2,1].
Test with 3x1 [[1],[2],[3]]: expected [1,2,3]. Without guards: [1,2,3,2,1].
Test with 2x3 [[1,2,3],[4,5,6]]: expected [1,2,3,6,5,4]. Guards required.
Production Insight
A matrix rotation library used spiral traversal to rotate a matrix 90 degrees clockwise by processing layers from outside to inside. Each layer was rotated independently: swap the four corners, then the next elements inward. The spiral boundary shrinking technique naturally decomposed the matrix into layers. The library processed a 10,000 x 10,000 matrix in 50ms — O(n^2) time, O(1) space. The guard checks prevented duplicate rotations on the center layer for odd-sized matrices.
Key Takeaway
Spiral traversal uses four boundaries that shrink inward. The guard checks (top <= bottom, left <= right) are mandatory for non-square matrices. Test with 1xN and Nx1 matrices to catch missing guards. The pattern decomposes the matrix into concentric layers.
thecodeforge.io
Why Row-Major is Faster in Java — Cache Hits vs Cache Misses
Matrix Traversal Patterns
DFS and BFS on Grids — Connected Component Traversal
DFS and BFS on grids solve connectivity problems: number of islands, flood fill, shortest path in unweighted grids, and surrounded regions. The key insight is that each cell has at most 4 neighbors (up, down, left, right), and the traversal visits each cell at most once.
DFS is simpler to implement recursively but risks StackOverflowError on large grids (rec an explicit queue and is safer for production systems. Both are O(m*n) time.
The boundary check before each recursive call is critical: 0 <= r < rows AND 0 <= c < cols AND cell is unvisited AND cell matches condition. Missing any of these checks causes ArrayIndexOutOfBoundsException or infinite recursion.
io/thecodeforge/traversal/GridBFS.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
package io.thecodeforge.traversal;
import java.util.LinkedList;
import java.util.Queue;
publicclassGridBFS {
/**
* BFS shortest path in unweighted grid. O(m*n) time, O(m*n) space.
* Returns the distance from (sr,sc) to (dr,dc), orursion depth equals the grid size). BFS uses -1if unreachable.
*/
publicstaticintshortestPath(int[][] grid, int sr, int sc, int dr, int dc) {
if (grid == null || grid.length == 0) return -1;
int rows = grid.length, cols = grid[0].length;
boolean[][] visited = newboolean[rows][cols];
Queue<int[]> queue = newLinkedList<>();
queue.offer(newint[]{sr, sc, 0});
visited[sr][sc] = true;
int[][] dirs = {{0,1},{0,-1},{1,0},{-1,0}};
while (!queue.isEmpty()) {
int[] curr = queue.poll();
int r = curr[0], c = curr[1], dist = curr[2];
if (r == dr && c == dc) return dist;
for (int[] d : dirs) {
int nr = r + d[0], nc = c + d[1];
if (nr >= 0 && nr < rows && nc >= 0 && nc < cols
&& !visited[nr][nc] && grid[nr][nc] == 0) {
visited[nr][nc] = true;
queue.offer(newint[]{nr, nc, dist + 1});
}
}
}
return -1; // unreachable
}
publicstaticvoidmain(String[] args) {
int[][] grid = {
{0, 0, 1, 0},
{1, 0, 1, 0},
{0, 0, 0, 0},
{0, 1, 1, 0}
};
System.out.println("Shortest path (0,0) to (3,3): " + shortestPath(grid, 0, 0, 3, 3));
}
}
Output
Shortest path (0,0) to (3,3): 6
DFS vs BFS: WhenDFS for connectivity, to Use Each
DFS: connectivity problems. Recursive. O(mn) time, O(mn) space for visited array.
BFS: shortest path in unweighted grids. Explicit queue. O(mn) time, O(mn) space.
BFS is always production-safe. Use it unless the problem specifically requires DFS.
Boundary check: 0 <= r < rows AND 0 <= c < cols AND unvisited AND condition.
Production Insight
A flood fill tool in a paint application used recursive DFS to fill connected regions. On a 4K image (3840x2160 = 8.3 million pixels), the recursion depth exceeded the JVM stack limit (default 1MB), causing StackOverflowError. The fix: switched to iterative BFS with an explicit queue. The queue consumed O(region_size) memory but never overflowed. The team also increased the JVM stack size (-Xss4m) as a temporary workaround before the BFS rewrite BFS for shortest path. DFS risks StackOverflowError.
Key Takeaway
on large grids — use BFS for production. Always check all four conditions before visiting a neighbor: bounds, visited, and cell value.
Why Your BFS Grid Traversal Crashed in Production — The Queue Explosion Bug
You wrote a BFS to flood-fill connected components. It worked on your 50x50 test grid. Then production hit you with a 10,000x10,000 sparse matrix. Your queue held 80 million entries. The OOM killer took down the pod.
The problem isn't BFS. It's that you didn't check if a cell was already visited before enqueuing its neighbors. Every neighbor of a neighbor got enqueued twice before the first one was processed. On a dense grid, that's exponential queue growth.
Fix it: push a cell's coordinates onto the queue, then immediately mark it visited. Never let a cell enter the queue twice. This caps your queue size at the perimeter of the current frontier — worst case O(min(rows, cols)).
For truly massive grids, replace BFS with iterative DFS using an explicit stack. Same visit-before-push rule. Your memory stays flat, your latency drops, and you don't wake up to a 2 AM pager alert about pod restarts.
GridBfsFix.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// io.thecodeforge — dsa tutorialimport java.util.*;
publicclassGridBfsFix {
publicintcountComponents(int[][] grid) {
int rows = grid.length, cols = grid[0].length;
boolean[][] visited = newboolean[rows][cols];
int count = 0;
for (int r = 0; r < rows; r++) {
for (int c = 0; c < cols; c++) {
if (grid[r][c] == 1 && !visited[r][c]) {
count++; // New componentbfs(grid, visited, r, c); // visited BEFORE neighbor enqueue
}
}
}
return count;
}
privatevoidbfs(int[][] grid, boolean[][] visited, int startR, int startC) {
int[][] dirs = {{-1,0},{1,0},{0,-1},{0,1}};
Queue<int[]> q = newLinkedList<>();
q.add(newint[]{startR, startC});
visited[startR][startC] = true; // Kills queue explosionwhile (!q.isEmpty()) {
int[] cell = q.poll();
for (int[] d : dirs) {
int nr = cell[0] + d[0], nc = cell[1] + d[1];
if (nr >= 0 && nr < grid.length &&
nc >= 0 && nc < grid[0].length &&
grid[nr][nc] == 1 && !visited[nr][nc]) {
visited[nr][nc] = true;
q.add(newint[]{nr, nc});
}
}
}
}
}
Output
No output — this is a fixed utility. In production, O(n) heap vs O(2^n) heap.
Production Trap: Enqueue-then-visit
Mark visited on push, not on poll. Otherwise your queue holds every path to every cell, not just the frontier.
Key Takeaway
Visit before you push, or the queue will crush you.
The Cache Miss Nightmare — Striding Backwards Through Memory Pages
You've got a 4K matrix. You need to sum all elements. Your loop goes column-by-column, inner loop over rows. Each access jumps 4K bytes to the next row. For a 1000x1000 grid, that's 1 million L1 cache misses. You just turned a 2ms operation into 200ms.
CPU caches load contiguous chunks of memory — cache lines (64 bytes typically). When you traverse a row, you nail 8 consecutive ints in one cache line. Column traversal? Every access lands in a different cache line. You're evicting lines you just loaded.
The grid is stored in row-major order (Java, C, Python lists of lists). Your traversal pattern must match memory layout. Period. The exception: column traversal only wins when your dataset fits in L1 cache entirely — think 16x16 convolution kernels.
Fix: always row-major in outer loop, column-major in inner loop. Unless you're doing something esoteric. And if you are, you should be using a column-major storage format like Eigen or column-oriented databases.
CacheFriendlySum.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — dsa tutorialpublicclassCacheFriendlySum {
// Rows outer, cols inner — matches Java's row-major storagepublicstaticlongsumRowMajor(int[][] grid) {
long total = 0;
for (int r = 0; r < grid.length; r++) {
// This inner loop traverses contiguous memoryfor (int c = 0; c < grid[0].length; c++) {
total += grid[r][c]; // Cache line hit ratio: ~95%
}
}
return total;
}
// BAD: cols outer, rows innerpublicstaticlongsumColumnMajorBad(int[][] grid) {
long total = 0;
for (int c = 0; c < grid[0].length; c++) {
for (int r = 0; r < grid.length; r++) {
total += grid[r][c]; // Cache line miss: evicts 7 extra reads
}
}
return total;
}
}
Output
sumRowMajor runs ~20x faster than sumColumnMajorBad on a 10000x10000 matrix.
Senior Shortcut: Measure with perf stat
Run perf stat -e cache-misses on both loops. The column-major version will show 10-100x more misses.
Key Takeaway
Loop order matches memory layout, row-major wins outside L1 cache.
Parallel Traversal — Splitting a Grid Across Threads Without Losing Your Mind
You've got an 8-core machine and a 10000x10000 matrix to process. Splitting rows across threads feels natural — thread 1 gets rows 0-1249, etc. But the last thread finishes 3x faster than the first because your matrix has hot rows near the top. Load imbalance kills your speedup.
Better: use a work-stealing thread pool and divide the grid into small tiles (64x64 or 128x128). Each thread picks a tile from a concurrent queue. This naturally balances work regardless of sparsity patterns. It also preserves cache locality within tiles — your L1 cache gets reused across 64 rows instead of evicting on every row boundary.
Tile size matters. Too small (< 32) and overhead from queue operations dominates. Too large (> 1024) and you're back to load imbalance. 64-128 is the sweet spot for most grids.
One gotcha: if your operation mutates adjacent cells, you need tile boundaries that overlap by 1 cell, then merge results. Or redesign the algorithm to be embarrassingly parallel (sum, histogram, convolution with non-overlapping output).
ParallelGridSum.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// io.thecodeforge — dsa tutorialimport java.util.concurrent.*;
publicclassParallelGridSum {
publicstaticlongparallelSum(int[][] grid, int tileSize) throwsException {
int rows = grid.length, cols = grid[0].length;
ForkJoinPool pool = newForkJoinPool();
return pool.invoke(newSumTask(grid, 0, rows, 0, cols, tileSize));
}
staticclassSumTaskextendsRecursiveTask<Long> {
privatefinalint[][] g; finalint r1, r2, c1, c2, tile;
SumTask(int[][] g, int r1, int r2, int c1, int c2, int tile) {
this.g=g; this.r1=r1; this.r2=r2; this.c1=c1; this.c2=c2; this.tile=tile;
}
@OverrideprotectedLongcompute() {
int height = r2 - r1, width = c2 - c1;
if (height <= tile && width <= tile) { // Small enough: compute directlylong s = 0;
for (int r = r1; r < r2; r++)
for (int c = c1; c < c2; c++) s += g[r][c];
return s;
}
// Split horizontally first, then verticallyif (height > width) {
int mid = r1 + height / 2;
SumTask t1 = newSumTask(g, r1, mid, c1, c2, tile);
SumTask t2 = newSumTask(g, mid, r2, c1, c2, tile);
t1.fork();
return t2.compute() + t1.join();
} else {
int mid = c1 + width / 2;
SumTask t1 = newSumTask(g, r1, r2, c1, mid, tile);
SumTask t2 = newSumTask(g, r1, r2, mid, c2, tile);
t1.fork();
return t2.compute() + t1.join();
}
}
}
}
You'll never get 8x on 8 cores. Serial overhead (fork/join, queue contention, cache coherence) eats 10-20%.
Key Takeaway
Tile at 64-128 cells, fork/join for load balance, accept 70-80% of theoretical speedup.
🌀 Sliding Window — Subarray Problems in O(n)
Instead of repeatedly summing every subarray from scratch, a sliding window maintains a running aggregate as you adjust the window's boundaries. This transforms O(n²) brute-force into a single O(n) pass. The core insight: once you know the sum of elements from index i to j, you can compute the sum for i+1 to j+1 by subtracting the outgoing element and adding the incoming one. Use a fixed-size window when the problem specifies a length (e.g., maximum sum of k consecutive elements). Use a variable-size window when the condition depends on a property like sum or unique characters (e.g., smallest subarray with sum ≥ target). Always identify the invariant you are maintaining — that's the window's constraint. Expand the right pointer, shrink from the left when the invariant breaks, and update the answer at valid states. This pattern works on any linear structure, including matrices when flattened row-by-row, but be cautious: when applying to grids, the window moves across rows, not columns, which may cause cache misses if you stride vertically.
SlidingWindow.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — dsa tutorialintmaxSumK(int[] arr, int k) {
int sum = 0, max = 0;
for (int i = 0; i < k; i++) sum += arr[i];
max = sum;
for (int i = k; i < arr.length; i++) {
sum += arr[i] - arr[i - k];
max = Math.max(max, sum);
}
return max;
}
Output
23 (for arr = [1,4,2,10,23,3,1,0,20], k=4)
Production Trap:
When applying sliding window to 2D grids, avoid vertical strides — they cause cache misses. Instead, flatten the row and slide horizontally.
Key Takeaway
Always maintain the window invariant, not just the sum.
2️⃣ Two Pointers — Parallel Traversal for Sorted Data
Two pointers solve problems where you need to compare or combine elements from one or two sequences, often in sorted order. The classic use is finding a pair of numbers in a sorted array that sum to a target: start one pointer at the beginning, another at the end, and move them toward each other based on the comparison of the current sum to the target. The trick is that because the array is sorted, moving the left pointer increases the sum, and moving the right pointer decreases it. This avoids the O(n²) brute force. Extend this to three-sum by fixing one element and running two pointers on the remainder. For matrix traversal, two pointers are useful when scanning diagonals (i+j = constant) or when partitioning a grid into two regions. Another variant: slow and fast pointers for cycle detection in linked lists or infinite loops in grid BFS. Always ensure your data is sorted or that your invariant (e.g., monotonicity) holds, or else two pointers lose their correctness guarantee and degrade to brute force.
TwoPointers.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — dsa tutorialint[] twoSumSorted(int[] arr, int target) {
int l = 0, r = arr.length - 1;
while (l < r) {
int sum = arr[l] + arr[r];
if (sum == target) returnnewint[]{l, r};
elseif (sum < target) l++;
else r--;
}
returnnewint[]{-1, -1};
}
Output
[1, 2] (for arr = [2,7,11,15], target=18)
Production Trap:
Don't use two pointers on unsorted matrices without sorting first; you'll miss valid pairs or diagonals.
Key Takeaway
Two pointers only work when the sequence is monotonic; verify ordering first.
🧮 Prefix Sum — From 1D Ranges to 2D Submatrix Queries
Prefix sum precomputes cumulative totals so you can answer any range sum query in O(1). In 1D, prefix[i] is the sum from index 0 to i-1; then sum(l, r) = prefix[r+1] - prefix[l]. For 2D matrices, the prefix sum is a grid where pref[i][j] = sum of all cells from (0,0) to (i-1,j-1). The magic formula for submatrix sum (r1,c1) to (r2,c2) is: pref[r2+1][c2+1] - pref[r1][c2+1] - pref[r2+1][c1] + pref[r1][c1]. This works because of inclusion-exclusion. Build the 2D prefix sum in O(m*n) by accumulating row sums (pref[i][j] = pref[i-1][j] + pref[i][j-1] - pref[i-1][j-1] + matrix[i-1][j-1]). Use this pattern when the problem involves many range queries, or when you need to check submatrix conditions (e.g., largest submatrix with sum zero). The cost: extra memory. But for repeated queries, it's a 10x speedup over iterating. A common gotcha: off-by-one errors in indices — always double-check your formula with a small test case like a 2x2 grid.
PrefixSum2D.javaJAVA
1
2
3
4
5
6
7
8
9
// io.thecodeforge — dsa tutorialint[][] buildPrefix(int[][] mat) {
int m = mat.length, n = mat[0].length;
int[][] pref = newint[m+1][n+1];
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
pref[i][j] = pref[i-1][j] + pref[i][j-1] - pref[i-1][j-1] + mat[i-1][j-1];
return pref;
}
Video rotation pipeline processed 30 fps instead of 300 fps target. L1 cache hit rate dropped from 95% to 12% during transpose operations. CPU utilization was 100% but effective throughput was 10x lower than expected. No memory allocation issues, no GC pauses.
Assumption
A memory allocation issue was causing excessive garbage collection. The team spent 4 hours profiling GC logs, checking heap usage, and reviewing object allocation patterns.
Root cause
The transpose operation used column-major traversal: for (col = 0; col < width; col++) { for (row = 0; row < height; row++) { swap(matrix[row][col], matrix[col][row]); } }. Java 2D arrays are stored row-major: each row is a contiguous block of memory. Column-major traversal accesses matrix[0][col], matrix[1][col], matrix[2][col] — each access is width x 4 bytes apart. For 8K width, that is 30KB between consecutive accesses. The L1 cache (32KB) could hold only one row. Every column access was a cache miss. The L3 cache hit rate was 45% (vs 95% for row-major). Total memory bandwidth consumed: 12 GB/s vs 1.2 GB/s for row-major.
Fix
1. Changed the transpose to process 64x64 blocks (tile size matched L1 cache). Within each tile, row-major traversal was used. Cache hit rate improved from 12% to 89%.
2. Alternative fix: transpose row-major by processing only the upper triangle (i < j) to avoid double-swapping. Row-major traversal within the upper triangle.
3. Added a performance test: transpose a 4096x4096 matrix in under 100ms. This test would have caught the regression.
4. Added a metric: matrix_traversal_cache_miss_rate to track L1 cache performance during matrix operations.
5. Added a comment: 'Use row-major or tiled traversal for large matrices. Column-major is O(n^2) cache misses on row-major storage.'
Key lesson
Row-major traversal is 5-10x faster than column-major on large matrices due to CPU cache locality. This is not a micro-optimization — it is a 10x throughput difference.
Java 2D arrays are stored row-major. Column-major traversal causes cache thrashing on large matrices.
For matrix transpose, use tiled traversal (process small blocks in row-major order) to keep data in L1 cache.
Profile cache hit rates, not just CPU utilization. 100% CPU utilization with 12% cache hit rate means the CPU is waiting for memory.
Always benchmark matrix operations on production-sized data. A 100x100 test matrix hides cache effects that appear at 8K resolution.
Production debug guideSymptom-first investigation path for matrix traversal bugs.6 entries
Symptom · 01
Spiral traversal produces duplicate elements or misses elements on non-square matrices.
→
Fix
Check if the inner guard checks (top <= bottom before bottom row traversal, left <= right before left column traversal) are present. Without them, the algorithm re-processes the middle row or column.
Symptom · 02
ArrayIndexOutOfBoundsException on matrix[0].length when the matrix is null or empty.
→
Fix
Add a null/empty guard at the top: if (matrix == null || matrix.length == 0 || matrix[0].length == 0) return.
Symptom · 03
Column-major traversal is 5-10x slower than row-major on large matrices.
→
Fix
This is expected behavior due to cache locality. Switch to row-major traversal, or use tiled traversal for operations that require column access (like transpose).
Symptom · 04
Diagonal traversal produces wrong order or misses elements on rectangular (non-square) matrices.
→
Fix
Check the startRow and endRow calculations. Use Math.max(0, d - cols + 1) for startRow and Math.min(d, rows - 1) for endRow. The formulas differ for non-square matrices.
Symptom · 05
DFS/BFS on grid causes StackOverflowError on large grids.
→
Fix
DFS recursion depth equals the grid size for connected components. Switch to iterative DFS (explicit stack) or BFS (queue) for grids larger than 10,000 cells.
Symptom · 06
Matrix rotation produces a transposed but not rotated result.
→
Fix
Rotation = transpose + reverse rows (for clockwise). If you only transpose, you get a 90-degree counter-clockwise rotation. Add the row reversal step after transpose.
★ Matrix Traversal TriageRapid checks to isolate matrix traversal bugs.
Spiral traversal produces duplicates on non-square matrices.−
Immediate action
Check inner guard checks for top <= bottom and left <= right.
Commands
Test with 1x3 matrix [[1,2,3]] — expected [1,2,3], if you get [1,2,3,2,1] the guards are missing
Test with 3x1 matrix [[1],[2],[3]] — expected [1,2,3]
Fix now
Add if (top <= bottom) before bottom row traversal and if (left <= right) before left column traversal.
ArrayIndexOutOfBoundsException on empty or null matrix.+
Immediate action
Check for null/empty guard before accessing matrix[0].length.
What is the time complexity of spiral matrix traversal?
It is O(M x N) where M is rows and N is columns, as every cell is visited exactly once.
Was this helpful?
02
How do I traverse a matrix diagonally in Java?
Iterate through the sum of indices (row + col). Use Math.max and Math.min to ensure your derived indices stay within the bounds of the rows and columns.
Was this helpful?
03
Does row-major order matter for small matrices?
For small matrices that fit in L1 cache, the difference is negligible. However, for large matrices (1000x1000+), row-major is 5-10x faster due to CPU cache locality. Always benchmark on production-sized data.
Was this helpful?
04
How do you handle boundary conditions in matrix DFS?
Before recursing into neighbour (r,c), check: 0<=r<rows and 0<=c<cols and cell is unvisited and matches your condition. Combine all checks before any processing to avoid index-out-of-range errors.
Was this helpful?
05
What is the time complexity of a matrix DFS/BFS?
O(mn) — each cell is visited at most once. Each cell has at most 4 neighbours, so total edge traversals are O(4mn) = O(mn).
Was this helpful?
06
How do I rotate a matrix 90 degrees in-place?
For clockwise rotation: first transpose the matrix (swap M[i][j] with M[j][i] for all i<j), then reverse each row. For counter-clockwise: transpose then reverse each column (or reverse each row then transpose). Both are O(n^2) time and O(1) extra space.
Was this helpful?
07
What is the time complexity of diagonal traversal for an n x m matrix?
Each cell is visited exactly once, so diagonal traversal is O(n*m) time and O(min(n,m)) space for the current diagonal. The number of diagonals is n+m-1, but total work is still proportional to the total number of cells.
Was this helpful?
08
Why does column-major traversal cause cache thrashing on large matrices?
Java 2D arrays are stored row-major: each row is a contiguous block of memory. Column-major traversal accesses matrix[0][col], matrix[1][col], matrix[2][col] — each access is width x element_size bytes apart. For a 4096-wide matrix with 4-byte ints, that is 16KB between consecutive accesses. The L1 cache (32KB) can hold only 2 rows. Every column access beyond the first 2 rows is a cache miss.