单词拼写纠错

最新推荐文章于 2022-06-29 13:44:01 发布

CCChenhao997

最新推荐文章于 2022-06-29 13:44:01 发布

阅读量615

点赞数

分类专栏： NLP 文章标签：拼写纠错

本文链接：https://blog.csdn.net/qq_35687547/article/details/100998586

版权

NLP 专栏收录该内容

19 篇文章 1 订阅

订阅专栏

原文链接: http://chenhao.space/post/409250ae.html

所需数据集:

spell-errors.txt

testdata.txt

vocab.txt

词典库

# 词典库
vocab = set([line.rstrip() for line in open('vocab.txt')]) # 用set效率高一些(时间复杂度)

需要生成所有候选集合

# 需要生成所有候选集合
def generate_candidates(word):
    '''
    word: 给定的输入(错误的输入)
    返回所有(valid)候选集合
    '''
    # 生成编辑距离为1的单词
    # 1.insert 2.delete 3.replace
    # 假设使用26个字符
    letters = 'abcdefghijklmnopqrstuvwxyz'
    
    splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
    # insert 操作
    inserts = [L+c+R for L, R in splits for c in letters] # insert操作后的结果
    # delete 操作
    deletes = [L+R[1:] for L, R in splits if R]   # if R 判断R不为空
    # replace
    replaces = [L+c+R[1:] for L, R in splits if R for c in letters] # 把c替换为R中删掉的字符
    
    candidates = set(inserts+deletes+replaces)
    
    # 过滤掉不存在于词典库里面的单词
    return [word for word in candidates if word in vocab]

读取语料库

from nltk.corpus import reuters

# 读取语料库
categories = reuters.categories()   # 路透社语料库的类别
corpus = reuters.sents(categories=categories) # sents()指定分类中的句子

构建语言模型

# 构建语言模型：bigram
term_count = {}
bigram_count = {}
for doc in corpus:
    doc = ['<s>'] + doc   # '<s>'表示开头
    for i in range(0, len(doc)-1):
        # bigram: [i,i+1]
        term = doc[i]          # term是doc中第i个单词
        bigram = doc[i:i+2]    # bigram为第i,i+1个单词组成的
        
        if term in term_count:
            term_count[term] += 1   # 如果term存在term_count中，则加1
        else:
            term_count[term] = 1    # 如果不存在，则添加，置为1
        
        bigram = ' '.join(bigram)
        if bigram in bigram_count:
            bigram_count[bigram] += 1
        else:
            bigram_count[bigram] = 1

用户打错的概率

# 用户打错的概率统计 - channel probability
channel_prob = {}

for line in open('spell-errors.txt'):
    items = line.split(":")  # 按":"分割成两个子项(在items列表中)
    correct = items[0].strip()
    mistakes = [item.strip() for item in items[1].strip().split(",")]
    channel_prob[correct] = {}
    for mis in mistakes:
        channel_prob[correct][mis] = 1.0/len(mistakes)

纠错

import numpy as np

V = len(term_count.keys())

file = open("testdata.txt", 'r')
for line in file:
    items = line.rstrip().split('\t')   # 句子存放在items[2]中
    line = items[2].split()    # 得到句子中所有单词
    for word in line:
        if word not in vocab:   # 若单词不存在词典中，则是错误的单词
            # 需要替换word成正确的单词
            # Step1: 生成所有的(valid)候选集合
            candidates = generate_candidates(word)
            
            # 若candidates为空
            # 一种方式：多生成几个candidates，比如编辑距离不大于2
            # TODO: 根据条件生成更多的候选集合
            if len(candidates) < 1:
                continue   # 不建议这么做
                
            probs = []
            
            # 对于每一个candidate，计算它的score
            # score = p(correct)*p(mistake|correct)
            #       = log p(correct) + log p(mistake|correct)
            # 返回 score 最大的 candidate
            for candi in candidates:
                prob = 0
                # a. 计算 channel probability
                if candi in channel_prob and word in channel_prob[candi]:
                    prob += np.log(channel_prob[candi][word])
                else:   # word not in channel_prob[candi]
                    prob += np.log(0.0001)
            
                # b. 计算语言模型的概率
                idx = items[2].index(word)+1   # +1是因为每句前面加了<s>
                if items[2][idx-1] in bigram_count and condi in bigram_count[items[2][idx-1]]:
                    prob += np.log((bigram_count[items[2][idx-1]][candi] + 1.0) / (term_count[bigram_count[items[2][idx - 1]]] + V))
                # TODO: 也要考虑当前[word，post_word]，上面只考虑了[pre_word, word]
                # prob += np.log(bigram概率)
                else:
                    prob += np.log(1.0 / V)
                    
                probs.append(prob)
            max_idx = probs.index(max(probs))
            print(word, candidates[max_idx])

结果

protectionst protectionist
products. products
long-run, long-run
gain. gain
17, 17
retaiation retaliation
cost. coste
busines, business
ltMC.T. ltMC.T
U.S., U.S.
Murtha, Murtha
worried. worried
seriousnyss seriousness
aganst against
us, us
named. named
year, year
sewll sell
dlrs, dlrs
world's worlds
largest. largest
markets, markets
importsi imports
Products, Products
Retaliation, Retaliation
Group. Group
Korea's Koreans
Korea, Korea
Japan. Japan
Koreva Korea
U.S., U.S.
1985. 1985
Malaysia, Malaysian
......

附: 语料库基本函数表

示例	描述
fileids()	语料库中的文件
fileids([categories])	这些分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	这些文件对应的语料库中的分类
raw()	语料库的原始内容
raw(fileids=[f1,f2,f3])	指定文件的原始内容
raw(categories=[c1,c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件的编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径

CCChenhao997

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
单词拼写纠错

原文链接: http://chenhao.space/post/409250ae.html所需数据集:spell-errors.txttestdata.txtvocab.txt词典库# 词典库vocab = set([line.rstrip() for line in open('vocab.txt')]) # 用set效率高一些(时间复杂度)需要生成所有候选集合# 需要生成...
复制链接

扫一扫

专栏目录