1. Introduction to String Matching
- LCS appears in EVERY paper with different strings - master the DP table method!
- KMP appears in 83% of papers - focus on computing the LPS (prefix) table
- Naïve algorithm is simpler but know its O(mn) complexity
1.1 Terminology
- Text (T): The string in which we search (length n)
- Pattern (P): The string we search for (length m)
- Shift (s): Position in text where pattern is compared
- Valid shift: Position where pattern matches in text
1.2 Complexity Comparison
| Algorithm | Preprocessing | Matching | Total (Worst) |
|---|---|---|---|
| Naïve | O(0) | O((n-m+1)m) | O(nm) |
| Rabin-Karp | O(m) | O(n) avg, O(nm) worst | O(n+m) avg |
| KMP | O(m) | O(n) | O(n+m) |
2. Naïve String Matching Algorithm
2.1 Algorithm
function NAIVE_STRING_MATCH(T, P):
n = length(T)
m = length(P)
for s = 0 to n - m:
// Check if pattern matches at shift s
j = 0
while j < m AND T[s + j] == P[j]:
j = j + 1
if j == m:
print "Pattern found at shift " + s
Text: A B C A B D A B C
Index: 0 1 2 3 4 5 6 7 8
Pattern: A B C
Shift 0: A B C A B D A B C
A B C ✓ MATCH at s=0!
↑ ↑ ↑
Shift 1: A B C A B D A B C
A B C ✗ B≠A at j=0
↑
Shift 2: A B C A B D A B C
A B C ✗ C≠A at j=0
↑
Shift 3: A B C A B D A B C
A B C ✗ A=A, B=B, D≠C at j=2
↑ ↑ ✗
Shift 4: A B C A B D A B C
A B C ✗ B≠A
↑
Shift 5: A B C A B D A B C
A B C ✗ D≠A
↑
Shift 6: A B C A B D A B C
A B C ✓ MATCH at s=6!
↑ ↑ ↑
Pattern found at shifts: 0, 6
2.2 Time Complexity Analysis
Best Case: O(n)
- Pattern found at first position
- First character never matches
- Only (n-m+1) comparisons needed
Worst Case: O(nm)
- Text: "AAAAAAAAAA"
- Pattern: "AAAAB"
- Pattern almost matches at every position
- m comparisons at each of n-m+1 positions
3. Rabin-Karp Algorithm
3.1 Key Idea: Rolling Hash
Compute hash of pattern once. Then "roll" the hash over text using:
All operations modulo prime q to avoid overflow
3.2 Algorithm
function RABIN_KARP(T, P, d, q):
n = length(T)
m = length(P)
h = d^(m-1) mod q // High-order digit multiplier
p = 0 // Hash of pattern
t = 0 // Hash of current text window
// Preprocessing: compute hash of pattern and first window
for i = 0 to m-1:
p = (d × p + P[i]) mod q
t = (d × t + T[i]) mod q
// Matching
for s = 0 to n - m:
if p == t:
// Hash match - verify character by character
if P[0..m-1] == T[s..s+m-1]:
print "Pattern found at shift " + s
// Compute hash for next window (rolling hash)
if s < n - m:
t = (d × (t - T[s] × h) + T[s + m]) mod q
if t < 0:
t = t + q
Text: "31415926535"
Pattern: "26"
Step 1: Compute pattern hash
p = (2 × 10 + 6) mod 13 = 26 mod 13 = 0
Step 2: Compute hash of first window "31"
t = (3 × 10 + 1) mod 13 = 31 mod 13 = 5
Step 3: Slide and compare
- Window "31": hash=5, p=0, no match
- Window "14": hash=(5-3×10+4) mod 13 = (10×(5-30)+4) mod 13 = ...
- Continue until hash matches, then verify
- Match found at position 4 ("26")
3.3 Time Complexity
| Case | Complexity | When |
|---|---|---|
| Average | O(n + m) | Few hash collisions |
| Worst | O(nm) | Many hash collisions (spurious hits) |
4. Knuth-Morris-Pratt (KMP) Algorithm
4.1 LPS Array (Prefix Function)
Proper prefix: Prefix that is not the entire string.
Pattern: "ABABACA"
| i | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| P[i] | A | B | A | B | A | C | A |
| LPS[i] | 0 | 0 | 1 | 2 | 3 | 0 | 1 |
Explanation:
- LPS[0] = 0 (single char, no proper prefix)
- LPS[1] = 0 ("AB" - no matching prefix/suffix)
- LPS[2] = 1 ("ABA" - "A" is both prefix and suffix)
- LPS[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
- LPS[4] = 3 ("ABABA" - "ABA" is both prefix and suffix)
- LPS[5] = 0 ("ABABAC" - no matching prefix/suffix)
- LPS[6] = 1 ("ABABACA" - "A" is both prefix and suffix)
4.2 LPS Computation Algorithm
function COMPUTE_LPS(P):
m = length(P)
LPS[0] = 0
len = 0 // Length of previous longest prefix suffix
i = 1
while i < m:
if P[i] == P[len]:
len = len + 1
LPS[i] = len
i = i + 1
else:
if len != 0:
len = LPS[len - 1] // Fall back
// Don't increment i
else:
LPS[i] = 0
i = i + 1
return LPS
4.3 KMP Matching Algorithm
function KMP_SEARCH(T, P):
n = length(T)
m = length(P)
LPS = COMPUTE_LPS(P)
i = 0 // Index in text
j = 0 // Index in pattern
while i < n:
if P[j] == T[i]:
i = i + 1
j = j + 1
if j == m:
print "Pattern found at index " + (i - j)
j = LPS[j - 1] // Look for next match
else if i < n AND P[j] != T[i]:
if j != 0:
j = LPS[j - 1] // Skip comparisons using LPS
else:
i = i + 1
Step 1: Compute LPS (Prefix Table) for pattern "ababada"
| i | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| P[i] | a | b | a | b | a | d | a |
| LPS[i] | 0 | 0 | 1 | 2 | 3 | 0 | 1 |
Step 2: KMP Matching on "badbabababadaab"
Text: b a d b a b a b a d a a b Index: 0 1 2 3 4 5 6 7 8 9 ... Pattern: a b a b a d a Step 1: i=0, j=0 Text[0]='b' ≠ Pattern[0]='a' j=0, so i++ Step 2: i=1, j=0 Text[1]='a' = Pattern[0]='a' ✓ i++, j++ Step 3: i=2, j=1 Text[2]='d' ≠ Pattern[1]='b' j=LPS[0]=0, i stays Step 4: i=2, j=0 Text[2]='d' ≠ Pattern[0]='a' j=0, so i++ Step 5: i=3, j=0 Text[3]='b' ≠ Pattern[0]='a' j=0, so i++ Step 6: i=4, j=0 Text[4]='a' = Pattern[0]='a' ✓ Continue matching... Text[4..8] = 'ababa' matches Pattern[0..4] = 'ababa' ✓ At i=9: Text[9]='d' = Pattern[5]='d' ✓ At i=10: Text[10]='a' = Pattern[6]='a' ✓ j == m (7), MATCH FOUND at index i-j = 10-7 = 4 Wait, let me recheck... Actually matching at position 4: Text[4..10] = "ababada" = Pattern ✓ Pattern found at index 4!
Pattern "ababada" found at index 4
4.4 Why KMP is Efficient
Naïve: After mismatch at position 5, restart from position 1
Very wasteful - we already know positions 1-4!
KMP: After mismatch, use LPS to skip redundant comparisons
LPS tells us: "These characters already match, skip them!"
Example:
Text: A B A B A C A B A B
Pattern: A B A B A D
↑ mismatch at index 5
Naïve shifts by 1: _ A B A B A D
↑ restart from beginning
KMP uses LPS[4]=3: _ _ _ A B A B A D
↑ skip first 3 chars (we know they match!)
4.5 Time Complexity
| Phase | Complexity |
|---|---|
| LPS Computation | O(m) |
| Matching | O(n) |
| Total | O(n + m) |
5. Longest Common Subsequence (LCS)
Note: Subsequence ≠ Substring. Subsequence elements need not be contiguous.
- How to fill the DP table
- How to backtrack to find the actual LCS string
5.1 DP Recurrence
0 if i=0 or j=0
LCS[i-1][j-1] + 1 if X[i] = Y[j]
max(LCS[i-1][j], LCS[i][j-1]) if X[i] ≠ Y[j]
5.2 Algorithm
function LCS_LENGTH(X, Y):
m = length(X)
n = length(Y)
c[0..m][0..n] = 0 // Initialize DP table
// Fill the table
for i = 1 to m:
for j = 1 to n:
if X[i] == Y[j]:
c[i][j] = c[i-1][j-1] + 1
else:
c[i][j] = max(c[i-1][j], c[i][j-1])
return c[m][n] // Length of LCS
function PRINT_LCS(c, X, Y, i, j):
if i == 0 or j == 0:
return ""
if X[i] == Y[j]:
return PRINT_LCS(c, X, Y, i-1, j-1) + X[i]
if c[i-1][j] > c[i][j-1]:
return PRINT_LCS(c, X, Y, i-1, j)
else:
return PRINT_LCS(c, X, Y, i, j-1)
X = "AMERICA"
Y = "ARMENIA"
Step 1: Create DP Table
| ε | A | R | M | E | N | I | A | |
|---|---|---|---|---|---|---|---|---|
| ε | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| M | 0 | 1 | 1 | 2 | 2 | 2 | 2 | 2 |
| E | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 3 |
| R | 0 | 1 | 2 | 2 | 3 | 3 | 3 | 3 |
| I | 0 | 1 | 2 | 2 | 3 | 3 | 4 | 4 |
| C | 0 | 1 | 2 | 2 | 3 | 3 | 4 | 4 |
| A | 0 | 1 | 2 | 2 | 3 | 3 | 4 | 5 |
Step 2: Backtrack to Find LCS
- Start at c[7][7] = 5
- X[7]='A' = Y[7]='A' → Add 'A', go diagonal to c[6][6]
- X[6]='C' ≠ Y[6]='I' → c[5][6]=4 > c[6][5]=3, go up
- X[5]='I' = Y[6]='I' → Add 'I', go diagonal to c[4][5]
- X[4]='R' ≠ Y[5]='N' → c[3][5]=3 = c[4][4]=3, go left
- X[4]='R' ≠ Y[4]='E' → c[3][4]=3 > c[4][3]=2, go up
- X[3]='E' = Y[4]='E' → Add 'E', go diagonal to c[2][3]
- X[2]='M' = Y[3]='M' → Add 'M', go diagonal to c[1][2]
- X[1]='A' ≠ Y[2]='R' → c[0][2]=0 < c[1][1]=1, go left
- X[1]='A' = Y[1]='A' → Add 'A', go diagonal to c[0][0]
Reading backwards: A, M, E, I, A → LCS = "AMEIA"
LCS = "AMEIA"
- "ACBAED" and "ABCABE" → LCS = "ABAE" (length 4)
- "AGGTAB" and "GXTXAYB" → LCS = "GTAB" (length 4)
- "SAVANT" and "ADVENT" → LCS = "AVENT" or "AVANT" (length 5)
5.3 Time & Space Complexity
| Aspect | Complexity |
|---|---|
| Time | O(m × n) |
| Space | O(m × n) or O(min(m,n)) with optimization |
6. Applications of String Matching
6.1 Text Searching
- Find/Replace in text editors
- Search engines (keyword matching)
- Plagiarism detection
- Log file analysis
6.2 DNA Sequencing
- Sequence alignment: Finding similarities between DNA sequences
- Gene finding: Locating specific patterns in genome
- Mutation detection: Comparing sequences to find differences
- LCS application: Finding common subsequences in protein/DNA sequences
DNA Sequence 1: ATCGTACGATCG
DNA Sequence 2: ATCGACGATCG
LCS: ATCGACGATCG (finding common genetic material)
This helps identify evolutionary relationships and shared genes.
6.3 Data Compression
- LZ77/LZ78: Find repeated patterns and reference earlier occurrences
- Run-length encoding: Pattern detection for compression
- Dictionary-based compression: Building pattern dictionaries
6.4 Other Applications
- Spell checkers: Finding similar words (edit distance)
- Autocomplete: Prefix matching in tries
- Network security: Intrusion detection (pattern matching in packets)
- Version control: Diff algorithms use LCS
7. Practice Questions
Build 7×7 DP table (including row/column 0).
LCS Length = 4
LCS = "ABAE"
LPS Table for "AABAACAABAA":
Index: 0 1 2 3 4 5 6 7 8 9 10
Char: A A B A A C A A B A A
LPS: 0 1 0 1 2 0 1 2 3 4 5
See Section 2 for algorithm and example.
Key points:
- Check pattern at every position
- Shift by 1 on mismatch
- Time: O(nm) worst case
See Section 3 for algorithm.
Key concepts:
- Use hash values to compare
- Rolling hash: O(1) to compute next hash
- Verify matches to handle collisions
- O(n+m) average, O(nm) worst
Key Points Summary
- Naïve: O(nm) - check pattern at every position, shift by 1
- Rabin-Karp: O(n+m) avg - use hash values, verify on match
- KMP: O(n+m) guaranteed - use LPS table to skip comparisons
- LPS[i] = longest proper prefix that is also suffix of pattern[0..i]
- LCS: DP approach, O(mn) - if chars match go diagonal+1, else max(up, left)
- LCS backtrack: match → diagonal, else go direction of larger value
- KMP advantage: never re-examine text characters
- Applications: Text search, DNA sequencing, compression, version control