The longest common subsequence (LCS) problem is a classic computer science problem with applications in fields like bioinformatics, natural language processing, and data analysis. The goal is to find the longest subsequence that is common between two given sequences. This how-to guide will provide a comprehensive walkthrough on solving the LCS problem in Python, including finding the length of the LCS and reconstructing the actual LCS string.
Table of Contents
Open Table of Contents
Overview of the Longest Common Subsequence Problem
Formally, given two sequences X and Y, the longest common subsequence problem aims to find the longest sequence Z that is a subsequence of both X and Y.
A subsequence does not need to be contiguous - it can be obtained by deleting zero or more elements from the original sequence without changing the order of the remaining elements.
For example, let:
X = "AGGTAB"
Y = "GXTXAYB"
One of the LCS is “GTAB” - it is the longest sequence present in both X and Y while preserving the order of elements. Other possible LCS are “GTB”, “GAB”, etc.
The LCS problem has many real-world applications:
-
Bioinformatics: Finding similarities between DNA sequences or proteins. The LCS can identify conserved regions and analyzes genetic connections.
-
Text Processing: Used in spell checkers, plagiarism detection, natural language processing, etc. to find similarities between textual documents.
-
Data Analysis: Analyzing data versions, temporal changes, data lineage, etc. by finding the commonalities between data sets.
-
Computer Science: Used in diff utilities like git diff to highlight changes between code versions or files. Also has applications in file comparison, data reconciliation, etc.
The LCS problem can be solved using dynamic programming and memoization. The overall algorithm involves building a 2D matrix to store lengths of LCSes of substring pairs and using it to reconstruct the actual LCS.
Finding the Length of the Longest Common Subsequence
Let’s break down the steps to find just the length of the LCS between two strings X and Y:
1. Initialize the LCS matrix
We will build an (m x n) matrix lcs_mat to store the LCS lengths, where m and n are the lengths of sequences X and Y respectively.
Initialize all elements of lcs_mat to 0.
X = "AGGTAB"
Y = "GXTXAYB"
m = len(X)
n = len(Y)
lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
This initializes a 2D array with m+1 rows and n+1 columns filled with 0s.
2. Build the LCS matrix in a bottom-up manner
We will now progressively build up the LCS matrix by comparing substring pairs of X and Y:
for i in range(m):
for j in range(n):
if X[i] == Y[j]:
lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
else:
lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])
- We traverse the matrix row-wise from bottom to top.
- If the current pair of characters match, we copy the diagonal value from lcs_mat[i][j] and add 1 to it.
- Else we take the maximum of the values from left and top cells.
This builds up the length matrix incrementally based on the LCS values of previous substring pairs.
3. The final LCS length is computed
The length of the LCS between X and Y will be the value in the last cell of lcs_mat:
return lcs_mat[m][n]
This will give us the max length of possible LCS between the two strings.
Let’s see a complete example to find just the LCS length:
def lcs_length(X, Y):
m = len(X)
n = len(Y)
# initialize LCS matrix
lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
# build the matrix in bottom-up manner
for i in range(m):
for j in range(n):
if X[i] == Y[j]:
lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
else:
lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])
# LCS will be last element in matrix
return lcs_mat[m][n]
X = "AGGTAB"
Y = "GXTXAYB"
print("Length of LCS is:", lcs_length(X, Y))
This prints 4 as length of the LCS between strings “AGGTAB” and “GXTXAYB”.
The time complexity of this algorithm is O(m x n) as we iterate through the entire LCS matrix of size m x n. The space complexity is also O(m x n) due to the matrix.
Reconstructing the Longest Common Subsequence
In many cases, finding just the LCS length is not sufficient - we also need to reconstruct the actual longest common subsequence between X and Y.
This can be done by maintaining a parallel traceback matrix and tracing back the path that gave the maximum length.
The steps are:
1. Initialize the LCS and traceback matrices
We initialize both the familiar LCS length matrix lcs_mat, as well as a traceback matrix tb_mat:
# X and Y sequences
X = "AGGTAB"
Y = "GXTXAYB"
m = len(X)
n = len(Y)
# initialize matrices
lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
tb_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
2. Build the matrices bottom-up
We build up both lcs_mat and tb_mat row-by-row:
for i in range(m):
for j in range(n):
if X[i] == Y[j]:
lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
tb_mat[i+1][j+1] = 1 # diagonal
else:
lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])
tb_mat[i+1][j+1] = 2 if lcs_mat[i+1][j+1] == lcs_mat[i+1][j] else 3
-
lcs_mat is built exactly like before to store LCS lengths.
-
tb_mat stores the direction of the maximum path:
- 1 for diagonal cell (match)
- 2 for left cell (top greater)
- 3 for top cell (left greater)
This matrix will help us trace back the path iteratively.
3. Traceback from end cell
We initialize the LCS string and start from the end cell:
i = m
j = n
lcs = ""
while i > 0 and j > 0:
if tb_mat[i][j] == 1: # diagonal move
lcs += X[i-1]
i -= 1
j -= 1
elif tb_mat[i][j] == 2: # move left
i -= 1
else: # move up
j -= 1
We move through tb_mat based on the direction values and append matched characters to the LCS string accordingly.
4. Return reconstructed LCS
The lcs string generated by tracing back tb_mat will contain the LCS in reverse order. We simply reverse it to get the final LCS:
return lcs[::-1]
Here is the full program:
def lcs(X, Y):
m = len(X)
n = len(Y)
# initialize matrices
lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
tb_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
# build matrics in bottom-up manner
for i in range(m):
for j in range(n):
if X[i] == Y[j]:
lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
tb_mat[i+1][j+1] = 1
else:
lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])
tb_mat[i+1][j+1] = 2 if lcs_mat[i+1][j+1] == lcs_mat[i+1][j] else 3
# traceback to find LCS
i = m
j = n
lcs = ""
while i > 0 and j > 0:
if tb_mat[i][j] == 1:
lcs += X[i-1]
i -= 1
j -= 1
elif tb_mat[i][j] == 2:
i -= 1
else:
j -= 1
return lcs[::-1]
X = "AGGTAB"
Y = "GXTXAYB"
print("LCS is", lcs(X, Y))
This dynamically generates both the LCS length matrix and traceback matrix to reconstruct the longest common subsequence “GTAB”.
The overall time and space complexity is still O(m x n). This provides an optimal quadratic solution for the LCS problem.
Further Improvements
Some refinements can be made to the above algorithm:
-
Maintain only one current row of lcs_mat instead of the full matrix to reduce space complexity.
-
Use memorization to store results of subproblems. This can speed up execution time for repeated subsequence comparisons.
-
Allow flexibility in sequencing by assigning scores to matches/mismatches and taking the maximum score path. Useful for DNA sequence alignment.
-
Return all possible LCSs instead of just one longest sequence.
-
Make the algorithm work for multiple input sequences instead of just two.
Applications of LCS in Python
The LCS algorithm has many applications for Python developers:
-
Version control - Finding differences between code versions to highlight changes. The diff utility internally uses LCS.
-
Text comparisons - Plagiarism detection, spell checking by finding commonality between texts. Useful in NLP.
-
Bioinformatics - DNA or protein sequence alignment to find conserved regions and mutations. Helps analyze evolutionary connections.
-
Data analysis - Data integration, data lineage analysis, temporal change detection by finding commonalities between data sets.
-
Security - Analyzing similarities between malware variants or intrusions by computing LCS.
-
Compression - Encoding common substrings instead of repetitive text can help compression.
-
Education - Helps learn concepts like dynamic programming, memoization, recursion, matrices, and backtracking.
Conclusion
This guide covered a step-by-step process to find the longest common subsequence between two strings in Python. We looked at:
-
The LCS problem definition and its wide range of real-world applications
-
A dynamic programming solution to incrementally build the LCS matrix
-
Traceback technique to reconstruct the actual LCS
-
Ideas for further optimizing the algorithm
-
Usage of LCS in diverse domains like version control, bioinformatics, NLP, etc.
The LCS problem is a foundational computer science challenge that comes up frequently in technical interviews. I hope this comprehensive Python guide provided both conceptual clarity and implementable code samples to help master this classic algorithm.