Skip to content

Finding the Longest Common Subsequence in Python

Updated: at 05:12 AM

The longest common subsequence (LCS) problem is a classic computer science problem with applications in fields like bioinformatics, natural language processing, and data analysis. The goal is to find the longest subsequence that is common between two given sequences. This how-to guide will provide a comprehensive walkthrough on solving the LCS problem in Python, including finding the length of the LCS and reconstructing the actual LCS string.

Table of Contents

Open Table of Contents

Overview of the Longest Common Subsequence Problem

Formally, given two sequences X and Y, the longest common subsequence problem aims to find the longest sequence Z that is a subsequence of both X and Y.

A subsequence does not need to be contiguous - it can be obtained by deleting zero or more elements from the original sequence without changing the order of the remaining elements.

For example, let:

X = "AGGTAB"
Y = "GXTXAYB"

One of the LCS is “GTAB” - it is the longest sequence present in both X and Y while preserving the order of elements. Other possible LCS are “GTB”, “GAB”, etc.

The LCS problem has many real-world applications:

The LCS problem can be solved using dynamic programming and memoization. The overall algorithm involves building a 2D matrix to store lengths of LCSes of substring pairs and using it to reconstruct the actual LCS.

Finding the Length of the Longest Common Subsequence

Let’s break down the steps to find just the length of the LCS between two strings X and Y:

1. Initialize the LCS matrix

We will build an (m x n) matrix lcs_mat to store the LCS lengths, where m and n are the lengths of sequences X and Y respectively.

Initialize all elements of lcs_mat to 0.

X = "AGGTAB"
Y = "GXTXAYB"

m = len(X)
n = len(Y)

lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]

This initializes a 2D array with m+1 rows and n+1 columns filled with 0s.

2. Build the LCS matrix in a bottom-up manner

We will now progressively build up the LCS matrix by comparing substring pairs of X and Y:

for i in range(m):
    for j in range(n):
        if X[i] == Y[j]:
            lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
        else:
            lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])

This builds up the length matrix incrementally based on the LCS values of previous substring pairs.

3. The final LCS length is computed

The length of the LCS between X and Y will be the value in the last cell of lcs_mat:

return lcs_mat[m][n]

This will give us the max length of possible LCS between the two strings.

Let’s see a complete example to find just the LCS length:

def lcs_length(X, Y):

    m = len(X)
    n = len(Y)

    # initialize LCS matrix
    lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]

    # build the matrix in bottom-up manner
    for i in range(m):
        for j in range(n):
            if X[i] == Y[j]:
                lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
            else:
                lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])

    # LCS will be last element in matrix
    return lcs_mat[m][n]

X = "AGGTAB"
Y = "GXTXAYB"
print("Length of LCS is:", lcs_length(X, Y))

This prints 4 as length of the LCS between strings “AGGTAB” and “GXTXAYB”.

The time complexity of this algorithm is O(m x n) as we iterate through the entire LCS matrix of size m x n. The space complexity is also O(m x n) due to the matrix.

Reconstructing the Longest Common Subsequence

In many cases, finding just the LCS length is not sufficient - we also need to reconstruct the actual longest common subsequence between X and Y.

This can be done by maintaining a parallel traceback matrix and tracing back the path that gave the maximum length.

The steps are:

1. Initialize the LCS and traceback matrices

We initialize both the familiar LCS length matrix lcs_mat, as well as a traceback matrix tb_mat:

# X and Y sequences
X = "AGGTAB"
Y = "GXTXAYB"

m = len(X)
n = len(Y)

# initialize matrices
lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
tb_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]

2. Build the matrices bottom-up

We build up both lcs_mat and tb_mat row-by-row:

for i in range(m):
    for j in range(n):
        if X[i] == Y[j]:
            lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
            tb_mat[i+1][j+1] = 1 # diagonal
        else:
            lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])
            tb_mat[i+1][j+1] = 2 if lcs_mat[i+1][j+1] == lcs_mat[i+1][j] else 3

This matrix will help us trace back the path iteratively.

3. Traceback from end cell

We initialize the LCS string and start from the end cell:

i = m
j = n
lcs = ""

while i > 0 and j > 0:

    if tb_mat[i][j] == 1: # diagonal move
        lcs += X[i-1]
        i -= 1
        j -= 1

    elif tb_mat[i][j] == 2: # move left
        i -= 1

    else: # move up
        j -= 1

We move through tb_mat based on the direction values and append matched characters to the LCS string accordingly.

4. Return reconstructed LCS

The lcs string generated by tracing back tb_mat will contain the LCS in reverse order. We simply reverse it to get the final LCS:

return lcs[::-1]

Here is the full program:

def lcs(X, Y):

    m = len(X)
    n = len(Y)

    # initialize matrices
    lcs_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]
    tb_mat = [[0 for _ in range(n+1)] for _ in range(m+1)]

    # build matrics in bottom-up manner
    for i in range(m):
        for j in range(n):
            if X[i] == Y[j]:
                lcs_mat[i+1][j+1] = lcs_mat[i][j] + 1
                tb_mat[i+1][j+1] = 1
            else:
                lcs_mat[i+1][j+1] = max(lcs_mat[i+1][j], lcs_mat[i][j+1])
                tb_mat[i+1][j+1] = 2 if lcs_mat[i+1][j+1] == lcs_mat[i+1][j] else 3

    # traceback to find LCS
    i = m
    j = n
    lcs = ""

    while i > 0 and j > 0:

        if tb_mat[i][j] == 1:
            lcs += X[i-1]
            i -= 1
            j -= 1

        elif tb_mat[i][j] == 2:
            i -= 1

        else:
            j -= 1

    return lcs[::-1]

X = "AGGTAB"
Y = "GXTXAYB"

print("LCS is", lcs(X, Y))

This dynamically generates both the LCS length matrix and traceback matrix to reconstruct the longest common subsequence “GTAB”.

The overall time and space complexity is still O(m x n). This provides an optimal quadratic solution for the LCS problem.

Further Improvements

Some refinements can be made to the above algorithm:

Applications of LCS in Python

The LCS algorithm has many applications for Python developers:

Conclusion

This guide covered a step-by-step process to find the longest common subsequence between two strings in Python. We looked at:

The LCS problem is a foundational computer science challenge that comes up frequently in technical interviews. I hope this comprehensive Python guide provided both conceptual clarity and implementable code samples to help master this classic algorithm.