手把手编写语音矫正(Spelling Corrector) – 玩具级(python)
这个代码示例是翻译Morvig的,原文地址
import re
from collections import Counter
def words(text):
return re.findall(r'\w+', text.lower())
# big.txt 我是随便用了一个文本,我用的是莎士比亚文集
WORDS = Counter(words(open('big.txt').read()))
def P(word, N=sum(WORDS.values())):
"词的概率"
return WORDS[word] / N
def correction(word):
"对于word这个词最大可能的拼写矫正"
return max(candidates(word), key=P)
def candidates(word):
"创建错词的候选词组"
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words):
"在辞典WORDS中出现的`words`的子集"
return set(w for w in words if w in WORDS)
def edits1(word):
"word编辑距离为1的所有可能"
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L +R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L +c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
"word编辑距离为2的所有可能"
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
correction('korrectud') --> 'corrected'
correction('speling') --> 'spelling'
工作原理–理论部分:噪声模型
函数correction(w)
试图选择错词 w 最可能的校正词. 但是没有办法确认一定是哪一个(比如,”lates”应该被修正为”late” 还是 “latest” 或者”lattes” 或者其他呢?),所以我们只能用概率去评估可能性。在所有的候选的校正词中,我们找到的校正词
argmaxc∈candidatesP(c∣w)
根据贝叶斯定理,等价于:
a