文本NLP噪音预处理（加拼写检查）

Damien_J

已于 2023-10-26 15:53:35 修改

阅读量1.1k

点赞数

分类专栏： Python AI 文章标签：自然语言处理人工智能

于 2023-08-02 17:16:39 首次发布

本文链接：https://blog.csdn.net/Damien_J_Scott/article/details/132066741

版权

Python 同时被 2 个专栏收录

79 篇文章 0 订阅

订阅专栏

4 篇文章 0 订阅

订阅专栏

最近总结修改了下预处理方法，记录下

首先download需要的依赖

pip install pyenchant

pip install nltk

pyenchant 是用来检测拼写正确的，如果你的文本里面可能包含非正确拼写的单词，那就忽略它，nltk用来做分词的。

python -m nltk.downloader punkt
python -m nltk.downloader stopwords

from nltk.corpus import stopwords
import nltk
import enchant
import re

def is_spelled_correctly(word, language='en_US'):
        spell_checker = enchant.Dict(language)
        return spell_checker.check(word)
    
def preprocess_text(text):
        text= re.sub(r'\W+', ' ',re.sub(r'[0-9]+', '', text.replace('-', '').replace('_', ' ')))
        words=nltk.word_tokenize(text)
        stop_words = set(stopwords.words('english'))
        words = [item for word in words for item in re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)if is_spelled_correctly(item) and item.lower() not in stop_words]
        return ' '.join(words).lower()

if __name__ == '__main__':
    print(preprocess_text('ServiceHandlerId caedbe-85432-xssc-dsdabffdddbea An exception of some microservice TargetDownService occurred and was test #@/*-sss '))
#service handler id exception target service occurred test

这里最后再转小写是因为防止ServiceHandlerId这种连续的单词链接成的字符串被拼写检查剔除，只有保持驼峰情况下，才能用 re.findall(r'[A-Z]+[a-z]*|[a-z]+', word) 成功把他分成单独的单词，所以最后再处理大小写。

改进方案1：

之后测试的时候发现数据量一大，他就很慢，后面优化了一下，速度大大提升了

from nltk.corpus import stopwords
import nltk
import enchant
import re

spell_checker = enchant.Dict(language)

def memoize(func):
        cache = {}
        def wrapper(*args):
            if args not in cache:
                cache[args] = func(*args)
            return cache[args]
        return wrapper

@memoize
def check_spelling(word):
    return spell_checker.check(word)


def preprocess_text(text):
        text= re.sub(r'\W+', ' ',re.sub(r'[0-9]+', '', text.replace('-', '').replace('_', ' ')))
        words=nltk.word_tokenize(text)
        stop_words = set(stopwords.words('english'))
        words = [item for word in words for item in re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)if check_spelling(item) and item.lower() not in stop_words]
        return ' '.join(words).lower()

if __name__ == '__main__':
    print(preprocess_text('ServiceHandlerId caedbe-85432-xssc-dsdabffdddbea An exception of some microservice TargetDownService occurred and was test #@/*-sss '))
#service handler id exception target service occurred test

这里面使用了memoization 技术，它是一种将函数调用和结果存储在一个字典中的优化技术。我这里用来缓存单词的拼写检查结果。

这样之后数据量大了之后速度依然不会太慢了。

改进方案2：

使用spellchecker 这个的速度就比enchant 快的多

pip install pyspellchecker

spell = SpellChecker()
def preprocess_text(text):
        text= re.sub(r'\W+', ' ',re.sub(r'[0-9]+', '', text.replace('-', '').replace('_', ' ')))
        words=nltk.word_tokenize(text)
        stop_words = set(stopwords.words('english'))
        words = [item for word in words for item in spell.known(re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)) if  item.lower() not in stop_words]
        return ' '.join(words).lower()

区别：

SpellChecker是一个基于编辑距离的拼写检查库，它可以在内存中加载一个词典，并对给定的单词列表进行快速的拼写检查。enchant是一个基于C语言的拼写检查库，它可以使用不同的后端，如aspell, hunspell, ispell等，来检查单词是否存在于词典中。SpellChecker比enchant更快，尤其是当单词列表很大时。

新增改进

这边用了一段时间，发现有些字符串是不是正确拼写，但是确实对我之后的分类有用处，所以新加一个list

spell = SpellChecker()
exceptions_words = ['jwt','json',......]
def preprocess_text(self, text):
    text = re.sub(
        r'\W+', ' ', re.sub(r'[0-9]+', '', text.replace('-', ' ').replace('_', ' ')))
    split_text = re.findall(r'[A-Z]+[a-z]*|[a-z]+', text)
    new_sentence = ' '.join(split_text)
    words = nltk.word_tokenize(new_sentence)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if (word in self.spell.known([word]) and word.lower() not in stop_words) or word.lower() in exceptions_words]
    return ' '.join(words).lower()

Damien_J

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
文本NLP噪音预处理（加拼写检查）

是用来检测拼写正确的，如果你的文本里面可能包含非正确拼写的单词，那就忽略它，这种连续的单词链接成的字符串被拼写检查剔除，只有保持驼峰情况下，才能用。成功把他分成单独的单词，所以最后再处理大小写。最近总结修改了下预处理方法，记录下。首先download需要的依赖。这里最后再转小写是因为防止。
复制链接

扫一扫