KMP Algorithm

Algorithms

The Knuth-Morris-Pratt algorithm is a string searching algorithm which uses a longest prefix suffix array to speed up the search.

Algorithm

The algorithm preprocesses the pattern to create a longest prefix suffix (LPS) array, which helps skip characters when a mismatch occurs.

A state of failure

So when you are matching a search string S to a word W there are actually a set number of ways the match can fail. In order to compare strings you need to loop your search string and your word. Depending on how far you get in the search string there are different states of failure. For example, the search string could fail at character 3, or character 4 or at the first character. The point of this is that there are partial matches that can happen in the search string. Now after a partial match the question is "where is the best place to go?"

Some limiting cases

To understand the value of the KMP algorithm I think it is useful to consider some limiting cases. First, there is the useless KMP. For illustration let's take the sentence: "Find a substring with KMP" Our search string will be KMP. Now let's build our failure array or our longest-prefix-suffix array. We can quickly see that the LPS array or failure array will be full of zeros. This means that the algorithm reverts to nothing more than brute force

Now let's envision a highly structured search string.....

Why it seems non-intuitive

When we think of string searching, our first thought is often of natural language. For example, if we were going to grep a string "the cat sat on the mat" we might search for cat. KMP is optimized for words like "the catatatatatatat satatatatatatat on the matatatatatatat" and search strings atat. Very clearly these are not natural language, but there are strings like DNA sequences which look more like this example. Once you understand this, KMP becomes much more simple.

Longest Prefix Suffix Array

If I had to sum this up succinctly it would be something like this: If the search string has a repeating substring you don't re-match that substring. Note that most natural language does not have have repeateateateating substrings. How you avoid rechecking the substring is the tricky part. You avoid rechecking a repeating substring by perfroming a scan of the search string and this will tell you how to jump. The first step is to create a list of all the substrings. Now from this list of substrings you count the longest prefix and suffix.

Time Complexity

The time complexity is $O (m + n)$ , where $m$ is the length of the pattern and $n$ is the length of the text.

Implementation

def compute_lps(pattern):
    """Compute the longest prefix suffix (lps) array for the pattern."""
    m = len(pattern)
    lps = [0] * m
    length = 0
    i = 1
    
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    
    return lps

def kmp_search(text, pattern):
    """Perform KMP search for the pattern in the text."""
    n = len(text)
    m = len(pattern)
    lps = compute_lps(pattern)
    
    i = 0  # index for text
    j = 0  # index for pattern
    
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        
        if j == m:
            print("Pattern found at index {}".format(i - j))
            j = lps[j - 1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1

# Example usage:
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
kmp_search(text, pattern)  # Output: Pattern found at index 10