NLP初学-文本预处理

最新推荐文章于 2023-04-07 17:14:52 发布

GlassySky0816

最新推荐文章于 2023-04-07 17:14:52 发布

阅读量338

点赞数

分类专栏： NLP初学文章标签：自然语言处理拼写错误纠正停用词

本文链接：https://blog.csdn.net/qq_38784098/article/details/104577949

版权

NLP初学专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一.spell correction(拼写错误纠正)

1. 错写的单词与正确单词的拼写相似，容易错写；这里safari是否容易错写成saferi需要统计数据的支持；为了简化问题，我们认为字形越相近的错写率越高，用编辑距离来表示。字形相近要求单词之间编辑距离小于等于2，这里saferi与safari编辑距离为1，后面我们再具体了解编辑距离的定义。

2. 正确单词有很多，除去语义因素外最有可能的单词，也就是这个单词的使用频率了。所以我们确认的标准还有一项就是，单词使用频率。

下面介绍一个机器学习拼写检查方法，基于贝叶斯定理的拼写检查法，主要思想就是上面2条，列举所有可能的正确拼写，根据编辑距离以及词频从中选取可能性最大的用于校正。

原理：

用户输入的错误的单词记做w，用户想要输入的拼写正确的单词记做c，则

P(c | w) ：用户输错成w时，想要的单词是c的概率。

P(w | c) : 用户将c错写成w的概率，与编辑距离有关。

P(c) : 正确词是c的概率，可以认为是c的使用频率，需要数据训练。

根据贝叶斯公式

P(c | w) = P(w | c) * P(c) / P(w)

因为同一次纠正中w是不变的，所以公式中我们不必理会P(w)，它是一个常量。比较 P(c | w) 就是比较 P(w | c) * P(c) 的大小。

1）、P(c)

P(c)替换成“使用频率”，我们从足够大的文本库（词典）点击打开链接中统计出各个单词的出现频率，也可以将频率归一化缩小方便比较。

2）、P(w | c)

P(w | c)替换成常数lambda * editDist

editDist编辑距离只计算editDist = 1与editDist = 2的，

editDist1，编辑距离为1的有下面几种情况：

（1）splits：将word依次按照每一位分割成前后两半。比如，'abc'会被分割成 [('', 'abc'), ('a', 'bc'), ('ab', 'c'), ('abc', '')] 。

　　（2）beletes：依次删除word的每一位后、所形成的所有新词。比如，'abc'对应的deletes就是 ['bc', 'ac', 'ab'] 。

　　（3）transposes：依次交换word的邻近两位，所形成的所有新词。比如，'abc'对应的transposes就是 ['bac', 'acb'] 。

　　（4）replaces：将word的每一位依次替换成其他25个字母，所形成的所有新词。比如，'abc'对应的replaces就是 ['abc', 'bbc', 'cbc', ... , 'abx', ' aby', 'abz' ] ，一共包含78个词（26 × 3）。

　　（5）inserts：在word的邻近两位之间依次插入一个字母，所形成的所有新词。比如，'abc' 对应的inserts就是['aabc', 'babc', 'cabc', ..., 'abcx', 'abcy', 'abcz']，一共包含104个词（26 × 4）。

editDist2则是在editDist1得到的单词集合的基础上再对它们作以上五种变换，得到所有编辑距离为2的单词（无论是否存在，在词典中不存在的记P(c) = 1）。

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

http://norvig.com/big.txt

还有现成的python库：

pip install pyenchant

具体用法：

http://pythonhosted.org/pyenchant/tutorial.html

二、Filtering words（停用词）

import jieba

#分词
def stripdata(Test):
    # jieba 默认启用了HMM（隐马尔科夫模型）进行中文分词
    seg_list = jieba.cut(Test,cut_all=True)  # 分词

    #获取字典，去除停用词
    line = "/".join(seg_list)
    word = stripword(line)
    #print(line)
    #列出关键字
    print("\n关键字：\n"+word)

#停用词分析
def stripword(seg):
    #打开写入关键词的文件
    keyword = open('key_word.txt', 'w+', encoding='utf-8')
    print("去停用词：\n")
    wordlist = []

    #获取停用词表
    stop = open('stopword.txt', 'r+', encoding='utf-8')
    stopword = stop.read().split("\n")

    #遍历分词表
    for key in seg.split('/'):
        #print(key)
        #去除停用词，去除单字，去除重复词
        if not(key.strip() in stopword) and (len(key.strip()) > 1) and not(key.strip() in wordlist) :
            wordlist.append(key)
            print(key)
            keyword.write(key+"\n")

    #停用词去除END
    stop.close()
    keyword.close()
    return '/'.join(wordlist)

def creat():
    Rawdata = open('raw.txt','r+',encoding='utf-8')
    text = Rawdata.read()
    #调用分词
    stripdata(text)

    #END
    Rawdata.close()

GlassySky0816

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
NLP初学-文本预处理

一.spell correction(拼写错误纠正)1. 错写的单词与正确单词的拼写相似，容易错写；这里safari是否容易错写成saferi需要统计数据的支持；为了简化问题，我们认为字形越相近的错写率越高，用编辑距离来表示。字形相近要求单词之间编辑距离小于等于2，这里saferi与safari编辑距离为1，后面我们再具体了解编辑距离的定义。2. 正确单词有很多，除去语义因素外最有可能的单...
复制链接

扫一扫