Module VI: String Matching Algorithms

1. Introduction to String Matching

String Matching Problem

Given a text T of length n and a pattern P of length m, find all occurrences of P in T. Return the starting indices where P matches in T.

🎯

Exam Tip (100% Frequency):

LCS appears in EVERY paper with different strings - master the DP table method!
KMP appears in 83% of papers - focus on computing the LPS (prefix) table
Naïve algorithm is simpler but know its O(mn) complexity

1.1 Terminology

Text (T): The string in which we search (length n)
Pattern (P): The string we search for (length m)
Shift (s): Position in text where pattern is compared
Valid shift: Position where pattern matches in text

1.2 Complexity Comparison

Algorithm	Preprocessing	Matching	Total (Worst)
Naïve	O(0)	O((n-m+1)m)	O(nm)
Rabin-Karp	O(m)	O(n) avg, O(nm) worst	O(n+m) avg
KMP	O(m)	O(n)	O(n+m)

2. Naïve String Matching Algorithm

Naïve (Brute Force) Algorithm

The simplest string matching algorithm. Check pattern at every position in text. If mismatch occurs, shift pattern by 1 and restart comparison.

2.1 Algorithm

Pseudocode

function NAIVE_STRING_MATCH(T, P):
    n = length(T)
    m = length(P)

    for s = 0 to n - m:
        // Check if pattern matches at shift s
        j = 0
        while j < m AND T[s + j] == P[j]:
            j = j + 1

        if j == m:
            print "Pattern found at shift " + s

📝 Example: Find "ABC" in "ABCABDABC"

Naïve String Matching Steps

Text:    A B C A B D A B C
Index:   0 1 2 3 4 5 6 7 8

Pattern: A B C

Shift 0: A B C A B D A B C
         A B C              ✓ MATCH at s=0!
         ↑ ↑ ↑

Shift 1: A B C A B D A B C
           A B C            ✗ B≠A at j=0
           ↑

Shift 2: A B C A B D A B C
             A B C          ✗ C≠A at j=0
             ↑

Shift 3: A B C A B D A B C
               A B C        ✗ A=A, B=B, D≠C at j=2
               ↑ ↑ ✗

Shift 4: A B C A B D A B C
                 A B C      ✗ B≠A
                 ↑

Shift 5: A B C A B D A B C
                   A B C    ✗ D≠A
                   ↑

Shift 6: A B C A B D A B C
                     A B C  ✓ MATCH at s=6!
                     ↑ ↑ ↑

Pattern found at shifts: 0, 6

2.2 Time Complexity Analysis

Best Case: O(n)

Pattern found at first position
First character never matches
Only (n-m+1) comparisons needed

Worst Case: O(nm)

Text: "AAAAAAAAAA"
Pattern: "AAAAB"
Pattern almost matches at every position
m comparisons at each of n-m+1 positions

Time Complexity: O((n-m+1) × m) = O(nm)

Space Complexity: O(1)

3. Rabin-Karp Algorithm

Rabin-Karp Algorithm

Uses hashing to find pattern matches. Compare hash values instead of characters. If hash matches, verify with character-by-character comparison (to handle hash collisions).

3.1 Key Idea: Rolling Hash

Compute hash of pattern once. Then "roll" the hash over text using:

h(T[s+1...s+m]) = d × (h(T[s...s+m-1]) - T[s] × d^(m-1)) + T[s+m]

Where d = number of characters in alphabet (e.g., 256 for ASCII)
All operations modulo prime q to avoid overflow

3.2 Algorithm

Pseudocode

function RABIN_KARP(T, P, d, q):
    n = length(T)
    m = length(P)
    h = d^(m-1) mod q    // High-order digit multiplier
    p = 0                 // Hash of pattern
    t = 0                 // Hash of current text window

    // Preprocessing: compute hash of pattern and first window
    for i = 0 to m-1:
        p = (d × p + P[i]) mod q
        t = (d × t + T[i]) mod q

    // Matching
    for s = 0 to n - m:
        if p == t:
            // Hash match - verify character by character
            if P[0..m-1] == T[s..s+m-1]:
                print "Pattern found at shift " + s

        // Compute hash for next window (rolling hash)
        if s < n - m:
            t = (d × (t - T[s] × h) + T[s + m]) mod q
            if t < 0:
                t = t + q

📝 Example: Rabin-Karp with d=10, q=13

Text: "31415926535"

Pattern: "26"

Step 1: Compute pattern hash

p = (2 × 10 + 6) mod 13 = 26 mod 13 = 0

Step 2: Compute hash of first window "31"

t = (3 × 10 + 1) mod 13 = 31 mod 13 = 5

Step 3: Slide and compare

Window "31": hash=5, p=0, no match
Window "14": hash=(5-3×10+4) mod 13 = (10×(5-30)+4) mod 13 = ...
Continue until hash matches, then verify
Match found at position 4 ("26")

3.3 Time Complexity

Case	Complexity	When
Average	O(n + m)	Few hash collisions
Worst	O(nm)	Many hash collisions (spurious hits)

4. Knuth-Morris-Pratt (KMP) Algorithm

KMP Algorithm

An efficient string matching algorithm that uses information from previous comparisons to skip unnecessary comparisons. Preprocesses the pattern to build a Longest Proper Prefix which is also Suffix (LPS) array.

🎯

Exam Tip (83% Frequency): The key skill tested is computing the prefix table (LPS array). Make sure you can calculate it step by step and show the valid shifts during matching.

4.1 LPS Array (Prefix Function)

LPS Array

LPS[i] = Length of the longest proper prefix of P[0..i] which is also a suffix.

Proper prefix: Prefix that is not the entire string.

📝 Understanding LPS with Example

Pattern: "ABABACA"

i	0	1	2	3	4	5	6
P[i]	A	B	A	B	A	C	A
LPS[i]	0	0	1	2	3	0	1

Explanation:

LPS[0] = 0 (single char, no proper prefix)
LPS[1] = 0 ("AB" - no matching prefix/suffix)
LPS[2] = 1 ("ABA" - "A" is both prefix and suffix)
LPS[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
LPS[4] = 3 ("ABABA" - "ABA" is both prefix and suffix)
LPS[5] = 0 ("ABABAC" - no matching prefix/suffix)
LPS[6] = 1 ("ABABACA" - "A" is both prefix and suffix)

4.2 LPS Computation Algorithm

Pseudocode

function COMPUTE_LPS(P):
    m = length(P)
    LPS[0] = 0
    len = 0        // Length of previous longest prefix suffix
    i = 1

    while i < m:
        if P[i] == P[len]:
            len = len + 1
            LPS[i] = len
            i = i + 1
        else:
            if len != 0:
                len = LPS[len - 1]   // Fall back
                // Don't increment i
            else:
                LPS[i] = 0
                i = i + 1

    return LPS

4.3 KMP Matching Algorithm

Pseudocode

function KMP_SEARCH(T, P):
    n = length(T)
    m = length(P)
    LPS = COMPUTE_LPS(P)

    i = 0    // Index in text
    j = 0    // Index in pattern

    while i < n:
        if P[j] == T[i]:
            i = i + 1
            j = j + 1

        if j == m:
            print "Pattern found at index " + (i - j)
            j = LPS[j - 1]    // Look for next match

        else if i < n AND P[j] != T[i]:
            if j != 0:
                j = LPS[j - 1]    // Skip comparisons using LPS
            else:
                i = i + 1

Solved Example (83% Frequency) 10 Marks

Give the pseudo code for KMP String Matching Algorithm. Use KMP algorithm to find pattern "ababada" in text "badbabababadaab". Show the prefix table and the valid shifts.

✓ Solution

Step 1: Compute LPS (Prefix Table) for pattern "ababada"

i	0	1	2	3	4	5	6
P[i]	a	b	a	b	a	d	a
LPS[i]	0	0	1	2	3	0	1

Step 2: KMP Matching on "badbabababadaab"

KMP Matching Steps

Text:    b a d b a b a b a d a a b
Index:   0 1 2 3 4 5 6 7 8 9 ...

Pattern: a b a b a d a

Step 1: i=0, j=0
Text[0]='b' ≠ Pattern[0]='a'
j=0, so i++

Step 2: i=1, j=0
Text[1]='a' = Pattern[0]='a' ✓
i++, j++

Step 3: i=2, j=1
Text[2]='d' ≠ Pattern[1]='b'
j=LPS[0]=0, i stays

Step 4: i=2, j=0
Text[2]='d' ≠ Pattern[0]='a'
j=0, so i++

Step 5: i=3, j=0
Text[3]='b' ≠ Pattern[0]='a'
j=0, so i++

Step 6: i=4, j=0
Text[4]='a' = Pattern[0]='a' ✓
Continue matching...

Text[4..8] = 'ababa' matches Pattern[0..4] = 'ababa' ✓
At i=9: Text[9]='d' = Pattern[5]='d' ✓
At i=10: Text[10]='a' = Pattern[6]='a' ✓

j == m (7), MATCH FOUND at index i-j = 10-7 = 4
Wait, let me recheck...

Actually matching at position 4:
Text[4..10] = "ababada" = Pattern ✓

Pattern found at index 4!

LPS Table: [0, 0, 1, 2, 3, 0, 1]
Pattern "ababada" found at index 4

4.4 Why KMP is Efficient

KMP vs Naïve: Using LPS to Skip

Naïve: After mismatch at position 5, restart from position 1
       Very wasteful - we already know positions 1-4!

KMP:   After mismatch, use LPS to skip redundant comparisons
       LPS tells us: "These characters already match, skip them!"

Example:
Text:     A B A B A C A B A B
Pattern:  A B A B A D
                    ↑ mismatch at index 5

Naïve shifts by 1:  _ A B A B A D
                      ↑ restart from beginning

KMP uses LPS[4]=3:  _ _ _ A B A B A D
                          ↑ skip first 3 chars (we know they match!)

4.5 Time Complexity

Phase	Complexity
LPS Computation	O(m)
Matching	O(n)
Total	O(n + m)

5. Longest Common Subsequence (LCS)

Longest Common Subsequence

Given two sequences X and Y, find the longest subsequence that appears in both. A subsequence is formed by deleting some (or no) elements without changing order.

Note: Subsequence ≠ Substring. Subsequence elements need not be contiguous.

🎯

Exam Tip (100% Frequency): LCS appears in EVERY exam with different strings! You must know:

How to fill the DP table
How to backtrack to find the actual LCS string

5.1 DP Recurrence

LCS[i][j] =
0 if i=0 or j=0
LCS[i-1][j-1] + 1 if X[i] = Y[j]
max(LCS[i-1][j], LCS[i][j-1]) if X[i] ≠ Y[j]

5.2 Algorithm

Pseudocode

function LCS_LENGTH(X, Y):
    m = length(X)
    n = length(Y)
    c[0..m][0..n] = 0    // Initialize DP table

    // Fill the table
    for i = 1 to m:
        for j = 1 to n:
            if X[i] == Y[j]:
                c[i][j] = c[i-1][j-1] + 1
            else:
                c[i][j] = max(c[i-1][j], c[i][j-1])

    return c[m][n]    // Length of LCS

function PRINT_LCS(c, X, Y, i, j):
    if i == 0 or j == 0:
        return ""
    if X[i] == Y[j]:
        return PRINT_LCS(c, X, Y, i-1, j-1) + X[i]
    if c[i-1][j] > c[i][j-1]:
        return PRINT_LCS(c, X, Y, i-1, j)
    else:
        return PRINT_LCS(c, X, Y, i, j-1)

Solved Example (100% Frequency) 10 Marks

Find the LCS for following strings:
X = "AMERICA"
Y = "ARMENIA"

✓ Solution

Step 1: Create DP Table

	A	R	M	E	N	I	A
ε	0	0	0	0	0	0	0
A	1	1	1	1	1	1	1
M	1	1	2	2	2	2	2
E	1	1	2	3	3	3	3
R	1	2	2	3	3	3	3
I	1	2	2	3	3	4	4
C	1	2	2	3	3	4	4
A	1	2	2	3	3	4	5

Step 2: Backtrack to Find LCS

Start at c[7][7] = 5
X[7]='A' = Y[7]='A' → Add 'A', go diagonal to c[6][6]
X[6]='C' ≠ Y[6]='I' → c[5][6]=4 > c[6][5]=3, go up
X[5]='I' = Y[6]='I' → Add 'I', go diagonal to c[4][5]
X[4]='R' ≠ Y[5]='N' → c[3][5]=3 = c[4][4]=3, go left
X[4]='R' ≠ Y[4]='E' → c[3][4]=3 > c[4][3]=2, go up
X[3]='E' = Y[4]='E' → Add 'E', go diagonal to c[2][3]
X[2]='M' = Y[3]='M' → Add 'M', go diagonal to c[1][2]
X[1]='A' ≠ Y[2]='R' → c[0][2]=0 < c[1][1]=1, go left
X[1]='A' = Y[1]='A' → Add 'A', go diagonal to c[0][0]

Reading backwards: A, M, E, I, A → LCS = "AMEIA"

LCS Length = 5
LCS = "AMEIA"

📝 More LCS Examples (From Past Papers)

"ACBAED" and "ABCABE" → LCS = "ABAE" (length 4)
"AGGTAB" and "GXTXAYB" → LCS = "GTAB" (length 4)
"SAVANT" and "ADVENT" → LCS = "AVENT" or "AVANT" (length 5)

5.3 Time & Space Complexity

Aspect	Complexity
Time	O(m × n)
Space	O(m × n) or O(min(m,n)) with optimization

6. Applications of String Matching

6.1 Text Searching

Find/Replace in text editors
Search engines (keyword matching)
Plagiarism detection
Log file analysis

6.2 DNA Sequencing

Sequence alignment: Finding similarities between DNA sequences
Gene finding: Locating specific patterns in genome
Mutation detection: Comparing sequences to find differences
LCS application: Finding common subsequences in protein/DNA sequences

📝 DNA Matching Example

DNA Sequence 1: ATCGTACGATCG

DNA Sequence 2: ATCGACGATCG

LCS: ATCGACGATCG (finding common genetic material)

This helps identify evolutionary relationships and shared genes.

6.3 Data Compression

LZ77/LZ78: Find repeated patterns and reference earlier occurrences
Run-length encoding: Pattern detection for compression
Dictionary-based compression: Building pattern dictionaries

6.4 Other Applications

Spell checkers: Finding similar words (edit distance)
Autocomplete: Prefix matching in tries
Network security: Intrusion detection (pattern matching in packets)
Version control: Diff algorithms use LCS

7. Practice Questions

Question 1 (100% Frequency) 10 Marks

Find the LCS for strings X = "ACBAED" and Y = "ABCABE"

View Answer ▼

Build 7×7 DP table (including row/column 0).

LCS Length = 4

LCS = "ABAE"

Question 2 (83% Frequency) 10 Marks

Explain KMP algorithm with example. Compute the prefix table for pattern "AABAACAABAA".

View Answer ▼

LPS Table for "AABAACAABAA":

Index: 0 1 2 3 4 5 6 7 8 9 10

Char: A A B A A C A A B A A

LPS: 0 1 0 1 2 0 1 2 3 4 5

Question 3 (50% Frequency) 10 Marks

Explain Naïve string matching algorithm with example.

View Answer ▼

See Section 2 for algorithm and example.

Key points:

Check pattern at every position
Shift by 1 on mismatch
Time: O(nm) worst case

Question 4 10 Marks

Give the Rabin-Karp Algorithm for string matching. Explain its working.

View Answer ▼

See Section 3 for algorithm.

Key concepts:

Use hash values to compare
Rolling hash: O(1) to compute next hash
Verify matches to handle collisions
O(n+m) average, O(nm) worst

Key Points Summary

📌 Remember for Exam

Naïve: O(nm) - check pattern at every position, shift by 1
Rabin-Karp: O(n+m) avg - use hash values, verify on match
KMP: O(n+m) guaranteed - use LPS table to skip comparisons
LPS[i] = longest proper prefix that is also suffix of pattern[0..i]
LCS: DP approach, O(mn) - if chars match go diagonal+1, else max(up, left)
LCS backtrack: match → diagonal, else go direction of larger value
KMP advantage: never re-examine text characters
Applications: Text search, DNA sequencing, compression, version control

	A	R	M	E	N	I	A
ε	0	0	0	0	0	0	0
A	1	1	1	1	1	1	1
M	1	1	2	2	2	2	2
E	1	1	2	3	3	3	3
R	1	2	2	3	3	3	3
I	1	2	2	3	3	4	4
C	1	2	2	3	3	4	4
A	1	2	2	3	3	4	5

	A	R	M	E	N	I	A
ε	0	0	0	0	0	0	0
A	1	1	1	1	1	1	1
M	1	1	2	2	2	2	2
E	1	1	2	3	3	3	3
R	1	2	2	3	3	3	3
I	1	2	2	3	3	4	4
C	1	2	2	3	3	4	4
A	1	2	2	3	3	4	5

	A	R	M	E	N	I	A
ε	0	0	0	0	0	0	0
A	1	1	1	1	1	1	1
M	1	1	2	2	2	2	2
E	1	1	2	3	3	3	3
R	1	2	2	3	3	3	3
I	1	2	2	3	3	4	4
C	1	2	2	3	3	4	4
A	1	2	2	3	3	4	5