Module VI

String Matching Algorithms

Master string matching with Naïve algorithm, Rabin-Karp, KMP algorithm, and the Longest Common Subsequence (LCS) problem.

1. Introduction to String Matching

String Matching Problem
Given a text T of length n and a pattern P of length m, find all occurrences of P in T. Return the starting indices where P matches in T.
🎯
Exam Tip (100% Frequency):
  • LCS appears in EVERY paper with different strings - master the DP table method!
  • KMP appears in 83% of papers - focus on computing the LPS (prefix) table
  • Naïve algorithm is simpler but know its O(mn) complexity

1.1 Terminology

  • Text (T): The string in which we search (length n)
  • Pattern (P): The string we search for (length m)
  • Shift (s): Position in text where pattern is compared
  • Valid shift: Position where pattern matches in text

1.2 Complexity Comparison

Algorithm Preprocessing Matching Total (Worst)
Naïve O(0) O((n-m+1)m) O(nm)
Rabin-Karp O(m) O(n) avg, O(nm) worst O(n+m) avg
KMP O(m) O(n) O(n+m)

2. Naïve String Matching Algorithm

Naïve (Brute Force) Algorithm
The simplest string matching algorithm. Check pattern at every position in text. If mismatch occurs, shift pattern by 1 and restart comparison.

2.1 Algorithm

Pseudocode
function NAIVE_STRING_MATCH(T, P):
    n = length(T)
    m = length(P)

    for s = 0 to n - m:
        // Check if pattern matches at shift s
        j = 0
        while j < m AND T[s + j] == P[j]:
            j = j + 1

        if j == m:
            print "Pattern found at shift " + s
📝 Example: Find "ABC" in "ABCABDABC"
Naïve String Matching Steps
Text:    A B C A B D A B C
Index:   0 1 2 3 4 5 6 7 8

Pattern: A B C

Shift 0: A B C A B D A B C
         A B C              ✓ MATCH at s=0!
         ↑ ↑ ↑

Shift 1: A B C A B D A B C
           A B C            ✗ B≠A at j=0
           ↑

Shift 2: A B C A B D A B C
             A B C          ✗ C≠A at j=0
             ↑

Shift 3: A B C A B D A B C
               A B C        ✗ A=A, B=B, D≠C at j=2
               ↑ ↑ ✗

Shift 4: A B C A B D A B C
                 A B C      ✗ B≠A
                 ↑

Shift 5: A B C A B D A B C
                   A B C    ✗ D≠A
                   ↑

Shift 6: A B C A B D A B C
                     A B C  ✓ MATCH at s=6!
                     ↑ ↑ ↑

Pattern found at shifts: 0, 6

2.2 Time Complexity Analysis

Best Case: O(n)

  • Pattern found at first position
  • First character never matches
  • Only (n-m+1) comparisons needed

Worst Case: O(nm)

  • Text: "AAAAAAAAAA"
  • Pattern: "AAAAB"
  • Pattern almost matches at every position
  • m comparisons at each of n-m+1 positions
Time Complexity: O((n-m+1) × m) = O(nm)
Space Complexity: O(1)

3. Rabin-Karp Algorithm

Rabin-Karp Algorithm
Uses hashing to find pattern matches. Compare hash values instead of characters. If hash matches, verify with character-by-character comparison (to handle hash collisions).

3.1 Key Idea: Rolling Hash

Compute hash of pattern once. Then "roll" the hash over text using:

h(T[s+1...s+m]) = d × (h(T[s...s+m-1]) - T[s] × d^(m-1)) + T[s+m]
Where d = number of characters in alphabet (e.g., 256 for ASCII)
All operations modulo prime q to avoid overflow

3.2 Algorithm

Pseudocode
function RABIN_KARP(T, P, d, q):
    n = length(T)
    m = length(P)
    h = d^(m-1) mod q    // High-order digit multiplier
    p = 0                 // Hash of pattern
    t = 0                 // Hash of current text window

    // Preprocessing: compute hash of pattern and first window
    for i = 0 to m-1:
        p = (d × p + P[i]) mod q
        t = (d × t + T[i]) mod q

    // Matching
    for s = 0 to n - m:
        if p == t:
            // Hash match - verify character by character
            if P[0..m-1] == T[s..s+m-1]:
                print "Pattern found at shift " + s

        // Compute hash for next window (rolling hash)
        if s < n - m:
            t = (d × (t - T[s] × h) + T[s + m]) mod q
            if t < 0:
                t = t + q
📝 Example: Rabin-Karp with d=10, q=13

Text: "31415926535"

Pattern: "26"

Step 1: Compute pattern hash

p = (2 × 10 + 6) mod 13 = 26 mod 13 = 0

Step 2: Compute hash of first window "31"

t = (3 × 10 + 1) mod 13 = 31 mod 13 = 5

Step 3: Slide and compare

  • Window "31": hash=5, p=0, no match
  • Window "14": hash=(5-3×10+4) mod 13 = (10×(5-30)+4) mod 13 = ...
  • Continue until hash matches, then verify
  • Match found at position 4 ("26")

3.3 Time Complexity

Case Complexity When
Average O(n + m) Few hash collisions
Worst O(nm) Many hash collisions (spurious hits)

4. Knuth-Morris-Pratt (KMP) Algorithm

KMP Algorithm
An efficient string matching algorithm that uses information from previous comparisons to skip unnecessary comparisons. Preprocesses the pattern to build a Longest Proper Prefix which is also Suffix (LPS) array.
🎯
Exam Tip (83% Frequency): The key skill tested is computing the prefix table (LPS array). Make sure you can calculate it step by step and show the valid shifts during matching.

4.1 LPS Array (Prefix Function)

LPS Array
LPS[i] = Length of the longest proper prefix of P[0..i] which is also a suffix.

Proper prefix: Prefix that is not the entire string.
📝 Understanding LPS with Example

Pattern: "ABABACA"

i 0 1 2 3 4 5 6
P[i] A B A B A C A
LPS[i] 0 0 1 2 3 0 1

Explanation:

  • LPS[0] = 0 (single char, no proper prefix)
  • LPS[1] = 0 ("AB" - no matching prefix/suffix)
  • LPS[2] = 1 ("ABA" - "A" is both prefix and suffix)
  • LPS[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
  • LPS[4] = 3 ("ABABA" - "ABA" is both prefix and suffix)
  • LPS[5] = 0 ("ABABAC" - no matching prefix/suffix)
  • LPS[6] = 1 ("ABABACA" - "A" is both prefix and suffix)

4.2 LPS Computation Algorithm

Pseudocode
function COMPUTE_LPS(P):
    m = length(P)
    LPS[0] = 0
    len = 0        // Length of previous longest prefix suffix
    i = 1

    while i < m:
        if P[i] == P[len]:
            len = len + 1
            LPS[i] = len
            i = i + 1
        else:
            if len != 0:
                len = LPS[len - 1]   // Fall back
                // Don't increment i
            else:
                LPS[i] = 0
                i = i + 1

    return LPS

4.3 KMP Matching Algorithm

Pseudocode
function KMP_SEARCH(T, P):
    n = length(T)
    m = length(P)
    LPS = COMPUTE_LPS(P)

    i = 0    // Index in text
    j = 0    // Index in pattern

    while i < n:
        if P[j] == T[i]:
            i = i + 1
            j = j + 1

        if j == m:
            print "Pattern found at index " + (i - j)
            j = LPS[j - 1]    // Look for next match

        else if i < n AND P[j] != T[i]:
            if j != 0:
                j = LPS[j - 1]    // Skip comparisons using LPS
            else:
                i = i + 1
Solved Example (83% Frequency) 10 Marks
Give the pseudo code for KMP String Matching Algorithm. Use KMP algorithm to find pattern "ababada" in text "badbabababadaab". Show the prefix table and the valid shifts.
Solution

Step 1: Compute LPS (Prefix Table) for pattern "ababada"

i 0 1 2 3 4 5 6
P[i] a b a b a d a
LPS[i] 0 0 1 2 3 0 1

Step 2: KMP Matching on "badbabababadaab"

KMP Matching Steps
Text:    b a d b a b a b a d a a b
Index:   0 1 2 3 4 5 6 7 8 9 ...

Pattern: a b a b a d a

Step 1: i=0, j=0
Text[0]='b' ≠ Pattern[0]='a'
j=0, so i++

Step 2: i=1, j=0
Text[1]='a' = Pattern[0]='a' ✓
i++, j++

Step 3: i=2, j=1
Text[2]='d' ≠ Pattern[1]='b'
j=LPS[0]=0, i stays

Step 4: i=2, j=0
Text[2]='d' ≠ Pattern[0]='a'
j=0, so i++

Step 5: i=3, j=0
Text[3]='b' ≠ Pattern[0]='a'
j=0, so i++

Step 6: i=4, j=0
Text[4]='a' = Pattern[0]='a' ✓
Continue matching...

Text[4..8] = 'ababa' matches Pattern[0..4] = 'ababa' ✓
At i=9: Text[9]='d' = Pattern[5]='d' ✓
At i=10: Text[10]='a' = Pattern[6]='a' ✓

j == m (7), MATCH FOUND at index i-j = 10-7 = 4
Wait, let me recheck...

Actually matching at position 4:
Text[4..10] = "ababada" = Pattern ✓

Pattern found at index 4!
LPS Table: [0, 0, 1, 2, 3, 0, 1]
Pattern "ababada" found at index 4

4.4 Why KMP is Efficient

KMP vs Naïve: Using LPS to Skip
Naïve: After mismatch at position 5, restart from position 1
       Very wasteful - we already know positions 1-4!

KMP:   After mismatch, use LPS to skip redundant comparisons
       LPS tells us: "These characters already match, skip them!"

Example:
Text:     A B A B A C A B A B
Pattern:  A B A B A D
                    ↑ mismatch at index 5

Naïve shifts by 1:  _ A B A B A D
                      ↑ restart from beginning

KMP uses LPS[4]=3:  _ _ _ A B A B A D
                          ↑ skip first 3 chars (we know they match!)

4.5 Time Complexity

Phase Complexity
LPS Computation O(m)
Matching O(n)
Total O(n + m)

5. Longest Common Subsequence (LCS)

Longest Common Subsequence
Given two sequences X and Y, find the longest subsequence that appears in both. A subsequence is formed by deleting some (or no) elements without changing order.

Note: Subsequence ≠ Substring. Subsequence elements need not be contiguous.
🎯
Exam Tip (100% Frequency): LCS appears in EVERY exam with different strings! You must know:
  1. How to fill the DP table
  2. How to backtrack to find the actual LCS string

5.1 DP Recurrence

LCS[i][j] =
0                              if i=0 or j=0
LCS[i-1][j-1] + 1           if X[i] = Y[j]
max(LCS[i-1][j], LCS[i][j-1])   if X[i] ≠ Y[j]

5.2 Algorithm

Pseudocode
function LCS_LENGTH(X, Y):
    m = length(X)
    n = length(Y)
    c[0..m][0..n] = 0    // Initialize DP table

    // Fill the table
    for i = 1 to m:
        for j = 1 to n:
            if X[i] == Y[j]:
                c[i][j] = c[i-1][j-1] + 1
            else:
                c[i][j] = max(c[i-1][j], c[i][j-1])

    return c[m][n]    // Length of LCS

function PRINT_LCS(c, X, Y, i, j):
    if i == 0 or j == 0:
        return ""
    if X[i] == Y[j]:
        return PRINT_LCS(c, X, Y, i-1, j-1) + X[i]
    if c[i-1][j] > c[i][j-1]:
        return PRINT_LCS(c, X, Y, i-1, j)
    else:
        return PRINT_LCS(c, X, Y, i, j-1)
Solved Example (100% Frequency) 10 Marks
Find the LCS for following strings:
X = "AMERICA"
Y = "ARMENIA"
Solution

Step 1: Create DP Table

ε A R M E N I A
ε 0 0 0 0 0 0 0 0
A 0 1 1 1 1 1 1 1
M 0 1 1 2 2 2 2 2
E 0 1 1 2 3 3 3 3
R 0 1 2 2 3 3 3 3
I 0 1 2 2 3 3 4 4
C 0 1 2 2 3 3 4 4
A 0 1 2 2 3 3 4 5

Step 2: Backtrack to Find LCS

  • Start at c[7][7] = 5
  • X[7]='A' = Y[7]='A' → Add 'A', go diagonal to c[6][6]
  • X[6]='C' ≠ Y[6]='I' → c[5][6]=4 > c[6][5]=3, go up
  • X[5]='I' = Y[6]='I' → Add 'I', go diagonal to c[4][5]
  • X[4]='R' ≠ Y[5]='N' → c[3][5]=3 = c[4][4]=3, go left
  • X[4]='R' ≠ Y[4]='E' → c[3][4]=3 > c[4][3]=2, go up
  • X[3]='E' = Y[4]='E' → Add 'E', go diagonal to c[2][3]
  • X[2]='M' = Y[3]='M' → Add 'M', go diagonal to c[1][2]
  • X[1]='A' ≠ Y[2]='R' → c[0][2]=0 < c[1][1]=1, go left
  • X[1]='A' = Y[1]='A' → Add 'A', go diagonal to c[0][0]

Reading backwards: A, M, E, I, A → LCS = "AMEIA"

LCS Length = 5
LCS = "AMEIA"
📝 More LCS Examples (From Past Papers)
  • "ACBAED" and "ABCABE" → LCS = "ABAE" (length 4)
  • "AGGTAB" and "GXTXAYB" → LCS = "GTAB" (length 4)
  • "SAVANT" and "ADVENT" → LCS = "AVENT" or "AVANT" (length 5)

5.3 Time & Space Complexity

Aspect Complexity
Time O(m × n)
Space O(m × n) or O(min(m,n)) with optimization

6. Applications of String Matching

6.1 Text Searching

  • Find/Replace in text editors
  • Search engines (keyword matching)
  • Plagiarism detection
  • Log file analysis

6.2 DNA Sequencing

  • Sequence alignment: Finding similarities between DNA sequences
  • Gene finding: Locating specific patterns in genome
  • Mutation detection: Comparing sequences to find differences
  • LCS application: Finding common subsequences in protein/DNA sequences
📝 DNA Matching Example

DNA Sequence 1: ATCGTACGATCG

DNA Sequence 2: ATCGACGATCG

LCS: ATCGACGATCG (finding common genetic material)

This helps identify evolutionary relationships and shared genes.

6.3 Data Compression

  • LZ77/LZ78: Find repeated patterns and reference earlier occurrences
  • Run-length encoding: Pattern detection for compression
  • Dictionary-based compression: Building pattern dictionaries

6.4 Other Applications

  • Spell checkers: Finding similar words (edit distance)
  • Autocomplete: Prefix matching in tries
  • Network security: Intrusion detection (pattern matching in packets)
  • Version control: Diff algorithms use LCS

7. Practice Questions

Question 1 (100% Frequency) 10 Marks
Find the LCS for strings X = "ACBAED" and Y = "ABCABE"
View Answer

Build 7×7 DP table (including row/column 0).

LCS Length = 4

LCS = "ABAE"

Question 2 (83% Frequency) 10 Marks
Explain KMP algorithm with example. Compute the prefix table for pattern "AABAACAABAA".
View Answer

LPS Table for "AABAACAABAA":

Index: 0 1 2 3 4 5 6 7 8 9 10

Char: A A B A A C A A B A A

LPS: 0 1 0 1 2 0 1 2 3 4 5

Question 3 (50% Frequency) 10 Marks
Explain Naïve string matching algorithm with example.
View Answer

See Section 2 for algorithm and example.

Key points:

  • Check pattern at every position
  • Shift by 1 on mismatch
  • Time: O(nm) worst case
Question 4 10 Marks
Give the Rabin-Karp Algorithm for string matching. Explain its working.
View Answer

See Section 3 for algorithm.

Key concepts:

  • Use hash values to compare
  • Rolling hash: O(1) to compute next hash
  • Verify matches to handle collisions
  • O(n+m) average, O(nm) worst

Key Points Summary

📌 Remember for Exam
  • Naïve: O(nm) - check pattern at every position, shift by 1
  • Rabin-Karp: O(n+m) avg - use hash values, verify on match
  • KMP: O(n+m) guaranteed - use LPS table to skip comparisons
  • LPS[i] = longest proper prefix that is also suffix of pattern[0..i]
  • LCS: DP approach, O(mn) - if chars match go diagonal+1, else max(up, left)
  • LCS backtrack: match → diagonal, else go direction of larger value
  • KMP advantage: never re-examine text characters
  • Applications: Text search, DNA sequencing, compression, version control