Module VI

String Matching Algorithms

Master string matching with Naïve algorithm, Rabin-Karp, KMP algorithm, and the Longest Common Subsequence (LCS) problem.

1. Introduction to String Matching

String Matching Problem
Given a text T of length n and a pattern P of length m, find all occurrences of P in T. Return the starting indices where P matches in T.
🎯
Exam Tip (100% Frequency):
  • LCS appears in EVERY paper with different strings - master the DP table method!
  • KMP appears in 83% of papers - focus on computing the LPS (prefix) table
  • Naïve algorithm is simpler but know its O(mn) complexity

1.1 Terminology

  • Text (T): The string in which we search (length n)
  • Pattern (P): The string we search for (length m)
  • Shift (s): Position in text where pattern is compared
  • Valid shift: Position where pattern matches in text

1.2 Complexity Comparison

Algorithm Preprocessing Matching Total (Worst)
Naïve O(0) O((n-m+1)m) O(nm)
Rabin-Karp O(m) O(n) avg, O(nm) worst O(n+m) avg
KMP O(m) O(n) O(n+m)

2. Naïve String Matching Algorithm

Naïve (Brute Force) Algorithm
The simplest string matching algorithm. Check pattern at every position in text. If mismatch occurs, shift pattern by 1 and restart comparison.

2.1 Algorithm

Pseudocode
function NAIVE_STRING_MATCH(T, P):
    n = length(T)
    m = length(P)

    for s = 0 to n - m:
        // Check if pattern matches at shift s
        j = 0
        while j < m AND T[s + j] == P[j]:
            j = j + 1

        if j == m:
            print "Pattern found at shift " + s
📝 Example: Find "ABC" in "ABCABDABC"
Naïve String Matching Steps
Text:    A B C A B D A B C
Index:   0 1 2 3 4 5 6 7 8

Pattern: A B C

Shift 0: A B C A B D A B C
         A B C              ✓ MATCH at s=0!
         ↑ ↑ ↑

Shift 1: A B C A B D A B C
           A B C            ✗ B≠A at j=0
           ↑

Shift 2: A B C A B D A B C
             A B C          ✗ C≠A at j=0
             ↑

Shift 3: A B C A B D A B C
               A B C        ✗ A=A, B=B, D≠C at j=2
               ↑ ↑ ✗

Shift 4: A B C A B D A B C
                 A B C      ✗ B≠A
                 ↑

Shift 5: A B C A B D A B C
                   A B C    ✗ D≠A
                   ↑

Shift 6: A B C A B D A B C
                     A B C  ✓ MATCH at s=6!
                     ↑ ↑ ↑

Pattern found at shifts: 0, 6

2.2 Time Complexity Analysis

Best Case: O(n)

  • Pattern found at first position
  • First character never matches
  • Only (n-m+1) comparisons needed

Worst Case: O(nm)

  • Text: "AAAAAAAAAA"
  • Pattern: "AAAAB"
  • Pattern almost matches at every position
  • m comparisons at each of n-m+1 positions
Time Complexity: O((n-m+1) × m) = O(nm)
Space Complexity: O(1)

3. Rabin-Karp Algorithm

Rabin-Karp Algorithm
Uses hashing to find pattern matches. Compare hash values instead of characters. If hash matches, verify with character-by-character comparison (to handle hash collisions).

3.1 Key Idea: Rolling Hash

Compute hash of pattern once. Then "roll" the hash over text using:

h(T[s+1...s+m]) = d × (h(T[s...s+m-1]) - T[s] × d^(m-1)) + T[s+m]
Where d = number of characters in alphabet (e.g., 256 for ASCII)
All operations modulo prime q to avoid overflow

3.2 Algorithm

Pseudocode
function RABIN_KARP(T, P, d, q):
    n = length(T)
    m = length(P)
    h = d^(m-1) mod q    // High-order digit multiplier
    p = 0                 // Hash of pattern
    t = 0                 // Hash of current text window

    // Preprocessing: compute hash of pattern and first window
    for i = 0 to m-1:
        p = (d × p + P[i]) mod q
        t = (d × t + T[i]) mod q

    // Matching
    for s = 0 to n - m:
        if p == t:
            // Hash match - verify character by character
            if P[0..m-1] == T[s..s+m-1]:
                print "Pattern found at shift " + s

        // Compute hash for next window (rolling hash)
        if s < n - m:
            t = (d × (t - T[s] × h) + T[s + m]) mod q
            if t < 0:
                t = t + q
📝 Example: Rabin-Karp with d=10, q=13

Text: "31415926535"

Pattern: "26"

Step 1: Compute pattern hash

p = (2 × 10 + 6) mod 13 = 26 mod 13 = 0

Step 2: Compute hash of first window "31"

t = (3 × 10 + 1) mod 13 = 31 mod 13 = 5

Step 3: Slide and compare

  • Window "31": hash=5, p=0, no match
  • Window "14": hash=(5-3×10+4) mod 13 = (10×(5-30)+4) mod 13 = ...
  • Continue until hash matches, then verify
  • Match found at position 4 ("26")

3.3 Time Complexity

Case Complexity When
Average O(n + m) Few hash collisions
Worst O(nm) Many hash collisions (spurious hits)

4. Knuth-Morris-Pratt (KMP) Algorithm

KMP Algorithm
An efficient string matching algorithm that uses information from previous comparisons to skip unnecessary comparisons. Preprocesses the pattern to build a Longest Proper Prefix which is also Suffix (LPS) array.
🎯
Exam Tip (83% Frequency): The key skill tested is computing the prefix table (LPS array). Make sure you can calculate it step by step and show the valid shifts during matching.

4.1 LPS Array (Prefix Function)

LPS Array
LPS[i] = Length of the longest proper prefix of P[0..i] which is also a suffix.

Proper prefix: Prefix that is not the entire string.
📝 Understanding LPS with Example

Pattern: "ABABACA"

i 0 1 2 3 4 5 6
P[i] A B A B A C A
LPS[i] 0 0 1 2 3 0 1

Explanation:

  • LPS[0] = 0 (single char, no proper prefix)
  • LPS[1] = 0 ("AB" - no matching prefix/suffix)
  • LPS[2] = 1 ("ABA" - "A" is both prefix and suffix)
  • LPS[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
  • LPS[4] = 3 ("ABABA" - "ABA" is both prefix and suffix)
  • LPS[5] = 0 ("ABABAC" - no matching prefix/suffix)
  • LPS[6] = 1 ("ABABACA" - "A" is both prefix and suffix)

4.2 LPS Computation Algorithm

Pseudocode
function COMPUTE_LPS(P):
    m = length(P)
    LPS[0] = 0
    len = 0        // Length of previous longest prefix suffix
    i = 1

    while i < m:
        if P[i] == P[len]:
            len = len + 1
            LPS[i] = len
            i = i + 1
        else:
            if len != 0:
                len = LPS[len - 1]   // Fall back
                // Don't increment i
            else:
                LPS[i] = 0
                i = i + 1

    return LPS

4.3 KMP Matching Algorithm

Pseudocode
function KMP_SEARCH(T, P):
    n = length(T)
    m = length(P)
    LPS = COMPUTE_LPS(P)

    i = 0    // Index in text
    j = 0    // Index in pattern

    while i < n:
        if P[j] == T[i]:
            i = i + 1
            j = j + 1

        if j == m:
            print "Pattern found at index " + (i - j)
            j = LPS[j - 1]    // Look for next match

        else if i < n AND P[j] != T[i]:
            if j != 0:
                j = LPS[j - 1]    // Skip comparisons using LPS
            else:
                i = i + 1
Solved Example (83% Frequency) 10 Marks
Give the pseudo code for KMP String Matching Algorithm. Use KMP algorithm to find pattern "ababada" in text "badbabababadaab". Show the prefix table and the valid shifts.
Solution

Step 1: Compute LPS (Prefix Table) for pattern "ababada"

i 0 1 2 3 4 5 6
P[i] a b a b a d a
LPS[i] 0 0 1 2 3 0 1

Step 2: KMP Matching on "badbabababadaab"

KMP Matching Steps
Text:    b a d b a b a b a d a a b
Index:   0 1 2 3 4 5 6 7 8 9 ...

Pattern: a b a b a d a

Step 1: i=0, j=0
Text[0]='b' ≠ Pattern[0]='a'
j=0, so i++

Step 2: i=1, j=0
Text[1]='a' = Pattern[0]='a' ✓
i++, j++

Step 3: i=2, j=1
Text[2]='d' ≠ Pattern[1]='b'
j=LPS[0]=0, i stays

Step 4: i=2, j=0
Text[2]='d' ≠ Pattern[0]='a'
j=0, so i++

Step 5: i=3, j=0
Text[3]='b' ≠ Pattern[0]='a'
j=0, so i++

Step 6: i=4, j=0
Text[4]='a' = Pattern[0]='a' ✓
Continue matching...

Text[4..8] = 'ababa' matches Pattern[0..4] = 'ababa' ✓
At i=9: Text[9]='d' = Pattern[5]='d' ✓
At i=10: Text[10]='a' = Pattern[6]='a' ✓

j == m (7), MATCH FOUND at index i-j = 10-7 = 4
Wait, let me recheck...

Actually matching at position 4:
Text[4..10] = "ababada" = Pattern ✓

Pattern found at index 4!
LPS Table: [0, 0, 1, 2, 3, 0, 1]
Pattern "ababada" found at index 4

4.4 Why KMP is Efficient

KMP vs Naïve: Using LPS to Skip
Naïve: After mismatch at position 5, restart from position 1
       Very wasteful - we already know positions 1-4!

KMP:   After mismatch, use LPS to skip redundant comparisons
       LPS tells us: "These characters already match, skip them!"

Example:
Text:     A B A B A C A B A B
Pattern:  A B A B A D
                    ↑ mismatch at index 5

Naïve shifts by 1:  _ A B A B A D
                      ↑ restart from beginning

KMP uses LPS[4]=3:  _ _ _ A B A B A D
                          ↑ skip first 3 chars (we know they match!)

4.5 Time Complexity

Phase Complexity
LPS Computation O(m)
Matching O(n)
Total O(n + m)

5. Longest Common Subsequence (LCS)

Longest Common Subsequence
Given two sequences X and Y, find the longest subsequence that appears in both. A subsequence is formed by deleting some (or no) elements without changing order.

Note: Subsequence ≠ Substring. Subsequence elements need not be contiguous.
🎯
Exam Tip (100% Frequency): LCS appears in EVERY exam with different strings! You must know:
  1. How to fill the DP table
  2. How to backtrack to find the actual LCS string

5.1 DP Recurrence

LCS[i][j] =
0                              if i=0 or j=0
LCS[i-1][j-1] + 1           if X[i] = Y[j]
max(LCS[i-1][j], LCS[i][j-1])   if X[i] ≠ Y[j]

5.2 Algorithm

Pseudocode
function LCS_LENGTH(X, Y):
    m = length(X)
    n = length(Y)
    c[0..m][0..n] = 0    // Initialize DP table

    // Fill the table
    for i = 1 to m:
        for j = 1 to n:
            if X[i] == Y[j]:
                c[i][j] = c[i-1][j-1] + 1
            else:
                c[i][j] = max(c[i-1][j], c[i][j-1])

    return c[m][n]    // Length of LCS

function PRINT_LCS(c, X, Y, i, j):
    if i == 0 or j == 0:
        return ""
    if X[i] == Y[j]:
        return PRINT_LCS(c, X, Y, i-1, j-1) + X[i]
    if c[i-1][j] > c[i][j-1]:
        return PRINT_LCS(c, X, Y, i-1, j)
    else:
        return PRINT_LCS(c, X, Y, i, j-1)
Solved Example (100% Frequency) 10 Marks
Find the LCS for following strings:
X = "AMERICA"
Y = "ARMENIA"
Solution

Step 1: Create DP Table

ε A R M E N I A
ε 0 0 0 0 0 0 0 0
A 0 1 1 1 1 1 1 1
M 0 1 1 2 2 2 2 2
E 0 1 1 2 3 3 3 3
R 0 1 2 2 3 3 3 3
I 0 1 2 2 3 3 4 4
C 0 1 2 2 3 3 4 4
A 0 1 2 2 3 3 4 5

Step 2: Backtrack to Find LCS

  • Start at c[7][7] = 5
  • X[7]='A' = Y[7]='A' → Add 'A', go diagonal to c[6][6]
  • X[6]='C' ≠ Y[6]='I' → c[5][6]=4 > c[6][5]=3, go up
  • X[5]='I' = Y[6]='I' → Add 'I', go diagonal to c[4][5]
  • X[4]='R' ≠ Y[5]='N' → c[3][5]=3 = c[4][4]=3, go left
  • X[4]='R' ≠ Y[4]='E' → c[3][4]=3 > c[4][3]=2, go up
  • X[3]='E' = Y[4]='E' → Add 'E', go diagonal to c[2][3]
  • X[2]='M' = Y[3]='M' → Add 'M', go diagonal to c[1][2]
  • X[1]='A' ≠ Y[2]='R' → c[0][2]=0 < c[1][1]=1, go left
  • X[1]='A' = Y[1]='A' → Add 'A', go diagonal to c[0][0]

Reading backwards: A, M, E, I, A → LCS = "AMEIA"

LCS Length = 5
LCS = "AMEIA"
📝 More LCS Examples (From Past Papers)
  • "ACBAED" and "ABCABE" → LCS = "ABAE" (length 4)
  • "AGGTAB" and "GXTXAYB" → LCS = "GTAB" (length 4)
  • "SAVANT" and "ADVENT" → LCS = "AVENT" or "AVANT" (length 5)

5.3 Time & Space Complexity

Aspect Complexity
Time O(m × n)
Space O(m × n) or O(min(m,n)) with optimization

6. Applications of String Matching

6.1 Text Searching

  • Find/Replace in text editors
  • Search engines (keyword matching)
  • Plagiarism detection
  • Log file analysis

6.2 DNA Sequencing

  • Sequence alignment: Finding similarities between DNA sequences
  • Gene finding: Locating specific patterns in genome
  • Mutation detection: Comparing sequences to find differences
  • LCS application: Finding common subsequences in protein/DNA sequences
📝 DNA Matching Example

DNA Sequence 1: ATCGTACGATCG

DNA Sequence 2: ATCGACGATCG

LCS: ATCGACGATCG (finding common genetic material)

This helps identify evolutionary relationships and shared genes.

6.3 Data Compression

  • LZ77/LZ78: Find repeated patterns and reference earlier occurrences
  • Run-length encoding: Pattern detection for compression
  • Dictionary-based compression: Building pattern dictionaries

6.4 Other Applications

  • Spell checkers: Finding similar words (edit distance)
  • Autocomplete: Prefix matching in tries
  • Network security: Intrusion detection (pattern matching in packets)
  • Version control: Diff algorithms use LCS

7. Practice Questions

Question 1 (100% Frequency) 10 Marks
Find the LCS for strings X = "ACBAED" and Y = "ABCABE"
View Answer

Build 7×7 DP table (including row/column 0).

LCS Length = 4

LCS = "ABAE"

Question 2 (83% Frequency) 10 Marks
Explain KMP algorithm with example. Compute the prefix table for pattern "AABAACAABAA".
View Answer

LPS Table for "AABAACAABAA":

Index: 0 1 2 3 4 5 6 7 8 9 10

Char: A A B A A C A A B A A

LPS: 0 1 0 1 2 0 1 2 3 4 5

Question 3 (50% Frequency) 10 Marks
Explain Naïve string matching algorithm with example.
View Answer

See Section 2 for algorithm and example.

Key points:

  • Check pattern at every position
  • Shift by 1 on mismatch
  • Time: O(nm) worst case
Question 4 10 Marks
Give the Rabin-Karp Algorithm for string matching. Explain its working.
View Answer

See Section 3 for algorithm.

Key concepts:

  • Use hash values to compare
  • Rolling hash: O(1) to compute next hash
  • Verify matches to handle collisions
  • O(n+m) average, O(nm) worst

Key Points Summary

📌 Important Definitions
  • String Matching Problem: Given text T (length n) and pattern P (length m), find all occurrences of P in T
  • Shift: Position in text where pattern comparison begins; valid shift means pattern matches at that position
  • Proper Prefix: A prefix of a string that is not the entire string itself
  • Proper Suffix: A suffix of a string that is not the entire string itself
  • Subsequence: A sequence derived by deleting some (or no) elements without changing order; NOT necessarily contiguous
  • Substring: A contiguous sequence of characters within a string
📌 Naïve String Matching
  • Method: Check pattern at every position in text; on mismatch, shift pattern by 1 and restart
  • Best Case: O(n) - first character never matches, or pattern found immediately
  • Worst Case: O(nm) - pattern almost matches at every position (e.g., T="AAAA...A", P="AAAB")
  • Space: O(1) - no preprocessing
  • Advantage: Simple to implement, no preprocessing required
  • Disadvantage: Inefficient; re-examines characters already compared
📌 Rabin-Karp Algorithm
  • Method: Use hashing; compare hash values instead of characters; verify on hash match
  • Rolling Hash: Compute next window's hash in O(1) using previous hash
  • Formula: h(new) = d × (h(old) - T[s] × d^(m-1)) + T[s+m]
  • Spurious Hit: Hash match but actual mismatch (collision) - requires verification
  • Average Case: O(n + m) - few collisions
  • Worst Case: O(nm) - many spurious hits
  • Use Case: Multiple pattern matching, plagiarism detection
📌 KMP Algorithm (83% Exam Frequency)
  • Method: Preprocess pattern to build LPS array; use it to skip redundant comparisons
  • LPS[i]: Length of longest proper prefix of P[0..i] which is also a suffix
  • Key Insight: After mismatch, LPS tells how many characters already match - don't re-compare them
  • Preprocessing: O(m) to build LPS table
  • Matching: O(n) - never re-examines text characters
  • Total: O(n + m) - guaranteed, not just average
  • Advantage: Never backtracks in text; optimal for streaming data
📌 LPS Table Construction
  • LPS[0]: Always 0 (single character has no proper prefix)
  • Building: Compare P[i] with P[len]; if match, LPS[i] = len + 1; if not, fallback to LPS[len-1]
  • Example: Pattern "AABAACAABAA" → LPS = [0,1,0,1,2,0,1,2,3,4,5]
  • Using LPS: On mismatch at j, set j = LPS[j-1] instead of restarting
📌 Longest Common Subsequence (100% Exam Frequency)
  • Problem: Find longest subsequence present in both strings X and Y
  • Recurrence: LCS[i][j] = LCS[i-1][j-1] + 1 if X[i]=Y[j]; else max(LCS[i-1][j], LCS[i][j-1])
  • Base Case: LCS[0][j] = 0 for all j; LCS[i][0] = 0 for all i
  • Traceback: If X[i]=Y[j], go diagonal and include character; else go direction of larger value
  • Time Complexity: O(m × n)
  • Space Complexity: O(m × n), can be optimized to O(min(m,n))
📌 LCS Example Solutions (Common in Exams)
  • "AMERICA" and "ARMENIA": LCS = "AMEIA" (length 5)
  • "ACBAED" and "ABCABE": LCS = "ABAE" (length 4)
  • "AGGTAB" and "GXTXAYB": LCS = "GTAB" (length 4)
📌 Time Complexity Comparison
  • Naïve: O(nm) worst, O(n) best
  • Rabin-Karp: O(n+m) average, O(nm) worst
  • KMP: O(n+m) guaranteed
  • LCS: O(m × n)
📌 Applications
  • Text Searching: Find/Replace in editors, search engines, log analysis
  • DNA Sequencing: Sequence alignment, gene finding, mutation detection
  • Data Compression: LZ77/LZ78 algorithms find repeated patterns
  • Version Control: Diff algorithms use LCS to find changes between files
  • Plagiarism Detection: Rabin-Karp for multiple pattern matching
  • Spell Checkers: Finding similar words using edit distance variants