Spelling similarity
- Typos
- Variants in spelling
Edit operations
- Insertion
- Deletion
- Substitution
- Multiple edits
Levenstein method
- Based on dynamic programming
- Insertions, deletions and substitutions usually have a cost of 1
Example
we want to calculate the edit distance of strength and trend.
Definitions
- s1(i) : ith character in string s1
- s2(j) : jth character in string s2
- D(i,j) : edit distance between a prefix of s1 of length i and a prefix of s2 of length j
- t(i,j) : cost of aligning the ith character in string s1 with the jth character in string s2
Recursive dependencies
D(i, 0) = i
D(0, j) = j
D(i, j) = min{D( i - 1, j ) + 1,
D( i, j - 1 ) + 1,
D( i - 1, j - 1 ) + t( i, j )}
Simple edit distance
t( i, j ) = 0 iff s1( i ) = s2( j )
t( i, j ) = 1 otherwise
Initialization
Recursion
Other costs
- Damerau modification
- swaps of 2 adjacent characters also have cost of 1( people are likely to swap the adjacent characters)
- Lev( cats, cast ) = 2
- Dam( cats, cast ) = 1
Other edit distance
Dist( sit down, sit clown ) = 1?
- model the errors common with optical character recognition(OCR), i.e. d is likely to be writen as cl
Dist( qeather, weather ) = 1, Dist( leather, weather ) = 2?
- model spelling errors introduced by fat finger, i.e. q and w are near on the keyboard whereas l and w is far(unlikely to be a typo)
Edit distance is also used in genetic sequence and amino acid sequence in bioinformatics.