文本预处理 | （5）文本纠错的简单案例

最新推荐文章于 2025-03-19 08:00:00 发布

郭畅小渣渣

最新推荐文章于 2025-03-19 08:00:00 发布

阅读量1.1k

点赞数 2

分类专栏： # 文本预处理

本文链接：https://blog.csdn.net/qq_40276310/article/details/109975552

版权

文本预处理专栏收录该内容

7 篇文章

订阅专栏

上一节我们留下了，一个小问题，就是如何对给定的英文文本语料，来进行拼写纠错。

首先，我们给定一个语料文本“beyes_train_text.txt”,然后统计语料中各单词的出现情况。

import re,collections

# 提取语料库中的所有单词并且转化为小写
def words(text):
    return re.findall("[a-z]+", text.lower())

# 若单词不在语料库中，默认词频为1，避免先验概率为0的情况
def train(features):
    model = collections.defaultdict(lambda:1)#若key为空，默认值为1
    for f in features:
        model[f]+=1#统计词频
    return model

words_N = train(words(open("bayes_train_text.txt").read()))
print(words_N)

输出结果：

接下来计算错拼单词编辑距离为1以及2的所有候选项：

#英文字母
alphabet="abcdefghijklmnopqrstuvwxyz"

# 编辑距离为1的所有单词
def edits1(word):
    n = len(word)
    # 删除某一字母而得的词
    s1 = [word[0:i]+word[i+1:] for i in range(n)]
    # 相邻字母调换位置
    s2 = [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)]
    # 替换
    s3 = [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet]
    # 插入
    s4 = [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]
    edits1_words = set(s1+s2+s3+s4)
    return edits1_words

# 编辑距离为2的所有单词
def edits2(word):
    edits2_words = set(e2 for e1 in edits1(word) for e2 in edits1(e1))
    return edits2_words

有了候选项之后，便可以通过一定算法找出最有可能的纠正项，由于我们没有历史的错误项、纠正项相对应的语料，因此只根据词语在语料中出现频次来确定此词语的候选可能性：

# 过滤非词典中的单词
def known(words):
    return set(w for w in words if w in words_N)

def correct(word):
    if word not in words_N:
        candidates = known(edits1(word)) | known(edits2(word))
        return max(candidates, key=lambda w:words_N[w])
    else:
        return None

做一些简单的实验：

print(correct("het"))
print(correct("annd"))

# 输出结果为
# the
# and

由此可见，对于一般的拼写错误，基于贝叶斯原理的纠错能力还是不错的。

此外，目前最新的纠错模型是文本纠错最优模型：Soft-Masked BERT 来自于复旦大学的研究人员在2020 ACL上发表了最新论文：

“Spelling Error Correction with Soft-Masked BERT”