Week2-5Spelling similarity:edit distance

Spelling similarity

  • Typos
  • Variants in spelling

Edit operations

  • Insertion
  • Deletion
  • Substitution
  • Multiple edits

Levenstein method

  • Based on dynamic programming
  • Insertions, deletions and substitutions usually have a cost of 1

Example

we want to calculate the edit distance of strength and trend.
这里写图片描述

Definitions

  • s1(i) : ith character in string s1
  • s2(j) : jth character in string s2
  • D(i,j) : edit distance between a prefix of s1 of length i and a prefix of s2 of length j
  • t(i,j) : cost of aligning the ith character in string s1 with the jth character in string s2

Recursive dependencies

D(i, 0) = i
D(0, j) = j
D(i, j) = min{D( i - 1, j ) + 1, 
              D( i, j - 1 ) + 1,
              D( i - 1, j - 1 ) + t( i, j )}

Simple edit distance

t( i, j ) = 0 iff s1( i ) = s2( j )
t( i, j ) = 1 otherwise 

Initialization

这里写图片描述

Recursion

这里写图片描述

这里写图片描述

这里写图片描述

这里写图片描述

Other costs

  • Damerau modification
    • swaps of 2 adjacent characters also have cost of 1( people are likely to swap the adjacent characters)
    • Lev( cats, cast ) = 2
    • Dam( cats, cast ) = 1

Other edit distance

  • Dist( sit down, sit clown ) = 1?

    • model the errors common with optical character recognition(OCR), i.e. d is likely to be writen as cl
  • Dist( qeather, weather ) = 1, Dist( leather, weather ) = 2?

    • model spelling errors introduced by fat finger, i.e. q and w are near on the keyboard whereas l and w is far(unlikely to be a typo)

Edit distance is also used in genetic sequence and amino acid sequence in bioinformatics.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值