Longest Common Subsequence: DP Deep Dive With Real Examples
Every time you run 'git diff', your terminal isn't doing magic — it's solving a variant of the Longest Common Subsequence problem. LCS is the algorithmic backbone of file comparison tools, bioinformatics sequence alignment, plagiarism detectors, and auto-correct engines. It's one of those problems that looks academic until you realize half the software you use daily depends on it. Understanding it at a deep level doesn't just help you ace interviews — it gives you a mental model for an entire family of DP problems.
The core problem: given two sequences, find the length (and optionally the actual content) of their longest common subsequence — where a subsequence means characters that appear in the same relative order, but don't have to be contiguous. That last part is what makes it tricky. A naive recursive approach explodes exponentially because the same subproblems get recomputed thousands of times. Dynamic programming crushes that by storing results so each subproblem is solved exactly once.
By the end of this article you'll understand how the DP recurrence is derived from first principles, how to build and read the DP table, how to reconstruct the actual subsequence (not just its length), how to optimize from O(m×n) space down to O(min(m,n)), and the exact edge cases and performance traps that trip up experienced engineers in production code and interviews alike.
Building the Recurrence From Scratch — The 'Why' Behind the Table
Before writing a single line of code, let's derive the recurrence honestly instead of just memorizing it.
Let text1 = 'AGCAT' and text2 = 'GAC'. Define dp[i][j] as the length of the LCS of the first i characters of text1 and the first j characters of text2.
Base case: if either string is empty (i=0 or j=0), the LCS is 0. There's nothing to match.
Recursive case: look at the last character of each current prefix. - If text1[i-1] == text2[j-1], those characters are part of the LCS. We include them and the answer grows by 1: dp[i][j] = dp[i-1][j-1] + 1. - If they don't match, we can't use both. We try skipping the last character of text1, or skipping the last character of text2, and take the better result: dp[i][j] = max(dp[i-1][j], dp[i][j-1]).
That's the entire recurrence. Two rules. The elegance is that we're reducing a problem of size (i,j) to strictly smaller subproblems — which guarantees no circular dependencies when we fill the table row by row, left to right.
Time complexity is O(m×n). Space is O(m×n) for the naive version, but we'll shrink that significantly later. Every cell depends only on the cell diagonally above-left, the cell above, and the cell to the left — which is the key insight for the space optimization.
public class LCSTableBuilder { /** * Builds the full DP table for LCS and prints it so you can * visually trace exactly what the algorithm is doing. * * @param text1 First input sequence * @param text2 Second input sequence * @return Length of the Longest Common Subsequence */ public static int buildAndPrintTable(String text1, String text2) { int rows = text1.length(); int cols = text2.length(); // dp[i][j] = LCS length of text1[0..i-1] and text2[0..j-1] // We allocate (rows+1) x (cols+1) to naturally handle the base case: // row 0 and col 0 are all zeros — empty prefix has LCS of 0 with anything. int[][] dp = new int[rows + 1][cols + 1]; // Fill table bottom-up: row by row, left to right for (int i = 1; i <= rows; i++) { for (int j = 1; j <= cols; j++) { if (text1.charAt(i - 1) == text2.charAt(j - 1)) { // Characters match — extend the LCS found without these two chars dp[i][j] = dp[i - 1][j - 1] + 1; } else { // No match — best we can do is skip one character from either string dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]); } } } // Print the table with headers for readability System.out.println("\nDP Table (rows = '" + text1 + "', cols = '" + text2 + "'):"); System.out.print(" "); // top-left corner padding System.out.print(" ε "); // ε represents the empty string base case for (char c : text2.toCharArray()) { System.out.printf(" %c ", c); } System.out.println(); for (int i = 0; i <= rows; i++) { // Row label: ε for base row, otherwise the current character if (i == 0) System.out.print(" ε "); else System.out.printf(" %c ", text1.charAt(i - 1)); for (int j = 0; j <= cols; j++) { System.out.printf(" %d ", dp[i][j]); } System.out.println(); } int lcsLength = dp[rows][cols]; System.out.println("\nLCS Length: " + lcsLength); return lcsLength; } public static void main(String[] args) { buildAndPrintTable("AGCAT", "GAC"); } }
DP Table (rows = 'AGCAT', cols = 'GAC'):
ε G A C
ε 0 0 0 0
A 0 0 1 1
G 0 1 1 1
C 0 1 1 2
A 0 1 2 2
T 0 1 2 2
LCS Length: 2
Reconstructing the Actual Subsequence — Not Just Its Length
Length is fine for interview warm-ups, but production use cases (diff tools, sequence alignment) need the actual sequence. Reconstruction means backtracking through the DP table from dp[m][n] to dp[0][0], following the decisions that built each cell.
The logic mirrors the fill logic in reverse: - If text1[i-1] == text2[j-1], this character is part of the LCS. Record it, then move diagonally to dp[i-1][j-1]. - If the cell above (dp[i-1][j]) is greater than or equal to the cell to the left (dp[i][j-1]), move up — we skipped a character from text1. - Otherwise, move left — we skipped a character from text2.
Since we're backtracking, the characters are collected in reverse order. Use a StringBuilder and reverse at the end — don't prepend to a String on each step, that's O(n²) due to string immutability in Java.
One important subtlety: when dp[i-1][j] equals dp[i][j-1], both directions are equally valid — they lead to different-but-equally-valid LCS strings. This is why LCS is not necessarily unique. A problem that asks for 'the' LCS is technically underspecified when there are ties. Your tie-breaking rule (prefer going up, or going left) determines which valid LCS you return.
public class LCSReconstructor { /** * Returns the actual LCS string (not just its length) by building * the DP table and then backtracking through it. * * Time: O(m * n) for table fill + O(m + n) for backtracking * Space: O(m * n) for the table */ public static String findLCS(String text1, String text2) { int rows = text1.length(); int cols = text2.length(); int[][] dp = new int[rows + 1][cols + 1]; // --- Phase 1: Fill the DP table (identical to the basic version) --- for (int i = 1; i <= rows; i++) { for (int j = 1; j <= cols; j++) { if (text1.charAt(i - 1) == text2.charAt(j - 1)) { dp[i][j] = dp[i - 1][j - 1] + 1; } else { dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]); } } } // --- Phase 2: Backtrack from dp[rows][cols] to dp[0][0] --- StringBuilder lcsReversed = new StringBuilder(); int i = rows; int j = cols; while (i > 0 && j > 0) { if (text1.charAt(i - 1) == text2.charAt(j - 1)) { // This character is part of the LCS — record it and go diagonal lcsReversed.append(text1.charAt(i - 1)); i--; j--; } else if (dp[i - 1][j] >= dp[i][j - 1]) { // Moving up was at least as good — skip text1's character i--; } else { // Moving left was better — skip text2's character j--; } } // Characters were added in reverse order during backtracking return lcsReversed.reverse().toString(); } public static void main(String[] args) { String sequence1 = "ABCBDAB"; String sequence2 = "BDCAB"; String lcs = findLCS(sequence1, sequence2); System.out.println("Sequence 1 : " + sequence1); System.out.println("Sequence 2 : " + sequence2); System.out.println("LCS : " + lcs); System.out.println("LCS Length : " + lcs.length()); // Demonstrate non-uniqueness: a different valid LCS also exists System.out.println("\nNote: 'BCAB' is also a valid LCS of the same length."); System.out.println("Tie-breaking in backtracking determines which one you get."); } }
Sequence 2 : BDCAB
LCS : BCAB
LCS Length : 4
Note: 'BCAB' is also a valid LCS of the same length.
Space Optimization: From O(m×n) to O(min(m,n)) — With the Catch Nobody Mentions
The full DP table is impractical for large inputs. If both strings are 10,000 characters long, you're allocating 100 million integers — roughly 400 MB. That's a hard no in most production environments.
The key observation: when filling row i, you only ever look at row i-1 (the row above) and the current row i (to the left). You never look further back. So you can drop the 2D table and use just two 1D arrays: 'previousRow' and 'currentRow', rolling them forward after each outer loop iteration.
But here's the catch nobody mentions: with two-row optimization, you still can't backtrack to reconstruct the sequence. You've thrown away the decision history. If you need the actual LCS string AND you need space optimization, you have to use Hirschberg's algorithm — a divide-and-conquer approach that reconstructs the sequence in O(m×n) time but O(min(m,n)) space. That's worth knowing by name in an interview even if you don't implement it on a whiteboard.
For the common case where only the length matters (or where strings are short enough for reconstruction), the two-row optimization is the right tool. Also: orient your strings so the shorter one is the columns dimension — that's what 'min(m,n)' means in practice.
public class LCSSpaceOptimized { /** * Computes LCS length using only O(min(m, n)) space. * * Key trick: ensure the shorter string drives the column dimension * so we allocate the smallest possible 1D arrays. * * Time: O(m * n) * Space: O(min(m, n)) — only two rows in memory at once */ public static int lcsLengthOptimized(String text1, String text2) { // Always make text2 the shorter string to minimize array size if (text1.length() < text2.length()) { String temp = text1; text1 = text2; text2 = temp; } int cols = text2.length(); // previousRow represents dp[i-1][0..cols] // currentRow represents dp[i][0..cols] int[] previousRow = new int[cols + 1]; // implicitly all zeros = base case int[] currentRow = new int[cols + 1]; for (int i = 1; i <= text1.length(); i++) { // Reset currentRow for this iteration // (col 0 stays 0 — empty prefix base case) for (int j = 1; j <= cols; j++) { if (text1.charAt(i - 1) == text2.charAt(j - 1)) { // Match: take the diagonal value from previousRow currentRow[j] = previousRow[j - 1] + 1; } else { // No match: best of skipping from either direction currentRow[j] = Math.max(previousRow[j], currentRow[j - 1]); } } // Roll: current becomes previous for the next iteration. // Swap references instead of copying — O(1) operation. int[] swap = previousRow; previousRow = currentRow; currentRow = swap; // reuse the old previousRow array memory } // After all rows, previousRow holds the final row's values return previousRow[cols]; } public static void main(String[] args) { // Test with the classic example System.out.println("LCS('ABCBDAB', 'BDCAB') = " + lcsLengthOptimized("ABCBDAB", "BDCAB")); // Edge cases System.out.println("LCS('', 'ABC') = " + lcsLengthOptimized("", "ABC")); System.out.println("LCS('ABC', 'ABC') = " + lcsLengthOptimized("ABC", "ABC")); System.out.println("LCS('ABC', 'XYZ') = " + lcsLengthOptimized("ABC", "XYZ")); // Large-ish input to show the approach stays snappy String longText1 = "AGGTAB".repeat(500); // 3000 chars String longText2 = "GXTXAYB".repeat(400); // 2800 chars long start = System.currentTimeMillis(); int result = lcsLengthOptimized(longText1, longText2); long elapsed = System.currentTimeMillis() - start; System.out.println("LCS(3000-char, 2800-char) = " + result + " [computed in " + elapsed + "ms]"); } }
LCS('', 'ABC') = 0
LCS('ABC', 'ABC') = 3
LCS('ABC', 'XYZ') = 0
LCS(3000-char, 2800-char) = 2000 [computed in 47ms]
LCS Variants and Real-World Applications — What It Actually Powers
LCS is a template, not just a standalone problem. Once you understand it deeply, a cluster of related problems become obvious reductions:
Shortest Common Supersequence (SCS): The shortest string that contains both text1 and text2 as subsequences. Length = m + n - LCS(m,n). You can reconstruct it by modifying the LCS backtracking to also include non-matching characters from both strings.
Minimum Edit Distance (Levenshtein): Counts insertions, deletions, and substitutions. It's a different recurrence but uses the same table structure. LCS gives you the 'free' edits (matches), so: min deletions + insertions = m + n - 2 * LCS(m,n) when only insert/delete are allowed.
Diff tools (Unix diff, Git): Git's diff uses Myers' algorithm, a space-efficient variant in the LCS family. Understanding LCS lets you reason about why diffs show what they show.
Bioinformatics: DNA and protein sequence alignment are LCS variants with affine gap penalties (gap opening costs more than gap extension). BLAST and Smith-Waterman are production-scale relatives.
Plagiarism detection: Comparing token streams (not just characters) using LCS gives a normalized similarity score: 2 * LCS / (m + n), bounded between 0 and 1.
The production gotcha with all of these: character-level LCS on large documents is slow. Real tools tokenize first (by word, line, or AST node) to shrink the effective string lengths dramatically before running DP.
/** * Demonstrates three practical derivations from LCS: * 1. Minimum insertions + deletions to convert text1 -> text2 * 2. Shortest Common Supersequence length * 3. Normalized similarity score (useful for plagiarism detection) */ public class LCSApplications { /** Computes LCS length using the space-optimized two-row approach. */ private static int lcsLength(String text1, String text2) { int rows = text1.length(); int cols = text2.length(); int[] previousRow = new int[cols + 1]; int[] currentRow = new int[cols + 1]; for (int i = 1; i <= rows; i++) { for (int j = 1; j <= cols; j++) { if (text1.charAt(i - 1) == text2.charAt(j - 1)) { currentRow[j] = previousRow[j - 1] + 1; } else { currentRow[j] = Math.max(previousRow[j], currentRow[j - 1]); } } int[] swap = previousRow; previousRow = currentRow; currentRow = swap; } return previousRow[cols]; } /** * Min number of character-level insertions + deletions to transform * source into target. (No substitutions — pure insert/delete model) */ public static int minInsertionsAndDeletions(String source, String target) { int lcs = lcsLength(source, target); // Characters in source NOT in LCS must be deleted int deletions = source.length() - lcs; // Characters in target NOT in LCS must be inserted int insertions = target.length() - lcs; return deletions + insertions; } /** * Length of the Shortest Common Supersequence. * SCS contains both strings as subsequences with minimum total length. */ public static int shortestCommonSupersequenceLength(String text1, String text2) { int lcs = lcsLength(text1, text2); // Each LCS character is shared — count it once, not twice return text1.length() + text2.length() - lcs; } /** * Returns a similarity ratio in [0.0, 1.0]. * 1.0 means identical, 0.0 means nothing in common. * Formula: 2 * LCS / (|text1| + |text2|) — Sørensen–Dice style */ public static double similarityScore(String text1, String text2) { if (text1.isEmpty() && text2.isEmpty()) return 1.0; if (text1.isEmpty() || text2.isEmpty()) return 0.0; int lcs = lcsLength(text1, text2); return (2.0 * lcs) / (text1.length() + text2.length()); } public static void main(String[] args) { String original = "SUNDAY"; String revised = "SATURDAY"; System.out.println("Source : " + original); System.out.println("Target : " + revised); System.out.println(); System.out.println("Min edits (insert+delete) : " + minInsertionsAndDeletions(original, revised)); System.out.println("Shortest Common Superseq : length " + shortestCommonSupersequenceLength(original, revised)); System.out.printf("Similarity score : %.2f (%.0f%% similar)%n", similarityScore(original, revised), similarityScore(original, revised) * 100); System.out.println(); // Demonstrate identical strings System.out.printf("Similarity('ABC', 'ABC') : %.2f%n", similarityScore("ABC", "ABC")); // Demonstrate completely different strings System.out.printf("Similarity('ABC', 'XYZ') : %.2f%n", similarityScore("ABC", "XYZ")); } }
Target : SATURDAY
Min edits (insert+delete) : 4
Shortest Common Superseq : length 9
Similarity score : 0.57 (57% similar)
Similarity('ABC', 'ABC') : 1.00
Similarity('ABC', 'XYZ') : 0.00
| Aspect | Full 2D DP Table | Two-Row Space Optimization |
|---|---|---|
| Time Complexity | O(m × n) | O(m × n) |
| Space Complexity | O(m × n) | O(min(m, n)) |
| Can Reconstruct LCS String? | Yes — backtrack the table | No — history is discarded |
| Memory for 10k × 10k strings | ~400 MB (int[][]) | ~80 KB (two int[]) |
| Implementation Complexity | Simple — straightforward 2D loop | Moderate — swap trick required |
| When to Use | Need actual sequence OR debugging | Only need the length, or huge inputs |
| Hirschberg Extension Needed? | No — backtrack handles it | Yes — to reconstruct with O(n) space |
🎯 Key Takeaways
- The LCS recurrence has exactly two rules: match → dp[i-1][j-1]+1, no match → max(dp[i-1][j], dp[i][j-1]). Derive it from first principles once and you'll never need to memorize it again.
- Reconstruction requires the full 2D table — the moment you space-optimize to two rows, you lose the ability to backtrack. If you need both the sequence AND low memory, Hirschberg's algorithm is the answer.
- LCS is not unique. Whenever dp[i-1][j] == dp[i][j-1] during backtracking, you have a tie. Your tie-breaking direction determines which valid LCS you return — always document or clarify this behavior.
- In production, always tokenize inputs (by line, word, or AST node) before running LCS DP. Character-level LCS on large texts is algorithmically correct but practically unusable due to table size and cache pressure.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Off-by-one errors in the DP table indices — Symptom: dp[i][j] reads text1.charAt(i) instead of text1.charAt(i-1), causing ArrayIndexOutOfBoundsException or silently wrong results — Fix: remember dp[i][j] represents the first i characters of text1, so the character being examined is always at index i-1 (zero-based). Draw the table with a row 0 and column 0 of all zeros and it becomes obvious.
- ✕Mistake 2: Confusing LCS (subsequence) with LCS (substring) — Symptom: the solution only checks contiguous matches, returning shorter-than-expected results — Fix: these are different problems. Longest Common Substring requires characters to be contiguous and uses dp[i][j] = dp[i-1][j-1] + 1 only on match, resetting to 0 on mismatch (and tracking a global max). LCS allows gaps. Know which problem you're solving before writing a single line.
- ✕Mistake 3: Using string concatenation instead of StringBuilder during backtracking — Symptom: reconstruction runs in O(n²) time due to Java's immutable String creating a new object on every prepend — Fix: collect characters into a StringBuilder via append(), then call reverse() once at the end. For sequences of length 10,000, this is the difference between microseconds and seconds.
Interview Questions on This Topic
- QGiven the LCS table is O(m×n) space, can you reduce it to O(n) without losing the ability to reconstruct the actual sequence — and if so, how? (Tests knowledge of Hirschberg's algorithm and the tradeoff between space and reconstructability.)
- QHow would you use LCS to implement a basic line-level diff between two text files? Walk through the algorithm and explain what 'added' and 'removed' lines correspond to in the LCS framework. (Tests ability to apply LCS to a real engineering problem.)
- QIf both input strings are identical, what does the LCS table look like, and what is the time complexity of backtracking through it to reconstruct the sequence? (Tricky because backtracking an identical-strings table always goes diagonal — O(n) — but candidates often say O(m×n) out of habit.)
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.