手把手编写语音矫正(Spelling Corrector) -- 玩具级（python）

最新推荐文章于 2024-08-08 08:11:23 发布

24thAUG

最新推荐文章于 2024-08-08 08:11:23 发布

阅读量1.4k

点赞数

分类专栏：机器学习 nlp 文章标签： python nlp 语音校正

本文链接：https://blog.csdn.net/iwanthn/article/details/68948061

版权

手把手编写语音矫正(Spelling Corrector) – 玩具级（python）

这个代码示例是翻译Morvig的，原文地址

import re
from collections import Counter

def words(text):
    return re.findall(r'\w+', text.lower())

# big.txt 我是随便用了一个文本，我用的是莎士比亚文集
WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):
    "词的概率"
    return WORDS[word] / N

def correction(word):
    "对于word这个词最大可能的拼写矫正"
    return max(candidates(word), key=P)

def candidates(word):
    "创建错词的候选词组"
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
    "在辞典WORDS中出现的`words`的子集"
    return set(w for w in words if w in WORDS)

def edits1(word):
    "word编辑距离为1的所有可能"
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L +R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces = [L +c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    "word编辑距离为2的所有可能"
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

correction('korrectud')  --> 'corrected'
correction('speling') --> 'spelling'

工作原理–理论部分：噪声模型

函数correction(w)试图选择错词 $w$ 最可能的校正词. 但是没有办法确认一定是哪一个（比如，”lates”应该被修正为”late” 还是 “latest” 或者”lattes” 或者其他呢？），所以我们只能用概率去评估可能性。在所有的候选的校正词中，我们找到的校正词 $c$ 应该是，给定原始的词 $w$ ， $c$ 具有最大的概率：