python拼写_用 Python 27 行实现拼写纠正

最新推荐文章于 2023-11-14 12:47:24 发布

weixin_39746241

最新推荐文章于 2023-11-14 12:47:24 发布

阅读量500

点赞数

文章标签： python拼写

用 Python 27 行实现拼写纠正

徐宥翻译过一次，但是后来 Norvig 又更新了代码。

首先，这不是一个工业级的拼写纠正器，是 Peter Norvig(Director of Research，Google) 在一次长途航班上完成并给出解释的玩具级拼写纠正器。

spell.py ：

import re

from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):

"Probability of `word`."

return WORDS[word] / N

def correction(word):

"Most probable spelling correction for word."

return max(candidates(word), key=P)

def candidates(word):

"Generate possible spelling corrections for word."

return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):

"The subset of `words` that appear in the dictionary of WORDS."

return set(w for w in words if w in WORDS)

def edits1(word):

"All edits that are one edit away from `word`."

letters = 'abcdefghijklmnopqrstuvwxyz'

splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]

deletes = [L + R[1:] for L, R in splits if R]

transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]

replaces = [L + c + R[1:] for L, R in splits if R for c in letters]

inserts = [L + c + R for L, R in splits for c in letters]

return set(deletes + transposes + replaces + inserts)

def edits2(word):

"All edits that are two edits away from `word`."

return (e2 for e1 in edits1(word) for e2 in edits1(e1))

原理

首先我们知道，这个 corrector 是基于 big.txt 这个文本工作的，我随便写了一个 big.txt，它首先生成 WORDS 这个 Counter 对象，例如

>>> WORDS

Counter({'a': 7, 'and': 7, 'of': 5, 'spelling': 4, 'in': 4, ..., 'code': 1})

P 函数返回一个词在 big.txt 中出现的概率

>>> P('they')

0.025

known 函数返回一组单词里，在 big.txt 中出现过的单词

>>> known(['I','you','they'])

{'you', 'they'}

edits1 函数返回，对一个单词，进行一个字母的修改，所有可能的结果

>>> edits1('yo')

{'yc', 'yh', 'wyo', 'bo', 'yt', 'yio', 'yu', 'io', 'yz', 'yoy', 'ypo', 'yob', 'yy', 'ygo', 'syo', 'to', 'vyo', 'eo', 'xo', 'yuo', 'yb', 'yoc', 'yf', 'yao', 'yo', 'yon', 'yro', 'po', 'tyo', 'ymo', 'ryo', 'yox', 'yoi', 'no', 'yyo', 'yw', 'mo', 'yow', 'yho', 'dyo', 'nyo', 'yg', 'cyo', 'fo', 'yk', 'yov', 'yq', 'uyo', 'yoe', 'qo', 'yv', 'oy', 'kyo', 'ys', 'yol', 'yot', 'ya', 'yqo', 'iyo', 'yx', 'yd', 'lo', 'yno', 'yoj', 'yod', 'yfo', 'yko', 'ylo', 'yj', 'yeo', 'gyo', 'ayo', 'yoh', 'lyo', 'you', 'yok', 'ydo', 'ywo', 'ao', 'ybo', 'oo', 'y', 'pyo', 'yr', 'yoo', 'ye', 'yco', 'yto', 'do', 'so', 'yoz', 'vo', 'hyo', 'yxo', 'o', 'yn', 'ym', 'ko', 'yl', 'yi', 'yop', 'yos', 'uo', 'wo', 'yzo', 'jo', 'yof', 'yso', 'yjo', 'yoa', 'zo', 'go', 'oyo', 'yoq', 'yom', 'myo', 'yp', 'yvo', 'jyo', 'co', 'yog', 'xyo', 'ro', 'eyo', 'qyo', 'fyo', 'zyo', 'ho', 'yor', 'byo'}

所以 edits2 函数返回对一个单词，进行两个字母的修改，所有可能的结果。

所以 candidates(word) 的执行是：如果 word 在 big.txt 中，那么返回 word，否则

如果对 word 修改一个字母后，结果在 big.txt 中，那么返回修改一个字母后的样子(可能是很多个结果)，否则

如果对 word 修改两个字母后，结果在 big.txt 中，那么返回修改一个字母后的样子(可能是很多个结果)，否则

返回 word

然后 correction(word) 的执行就是接受 candidates(word) 的结果列表，返回出现概率最高的那一个。

理解程序

首先，用概率论理解一下我们的程序。函数 correction(w) 就是在给定一个单词 w 的情况下，在一组候选答案中选出一个的单词 c，使得 c 是正确答案的概率最大，即

据贝叶斯定理，我们有

因为 P(w) 是一个定值，我们可以略去

所以我们的程序可以分成四部分：选择机制，从候选答案中选出正确概率最高的那个 argmax

候选模型，给出候选答案 candidates