常见的NLP处理手段和相应的spaCy库使用

最新推荐文章于 2024-03-05 15:03:43 发布

VIP文章 Ding_xiaofei

最新推荐文章于 2024-03-05 15:03:43 发布

阅读量6.7k

点赞数 4

分类专栏： Python NLP 文本处理文章标签： spacy nltk nlp

本文链接：https://blog.csdn.net/ding_xiaofei/article/details/80234373

版权

开篇

这篇博客主要讲的是关于英文的一些处理，关于中文的一些nlp处理后续有机会补上。本文主要有以下几个内容：

基于规则的预处理
常规预处理
spaCy库的常规使用
pointer-generator

关于预处理

预处理是很多NLP任务的基础，一个好的预处理对后续的NLP结果有很重要的影响。首先是关于分词的一些问题，中文分词是老大难问题，不太好分，英文就简单得多了，但是英文还是会出现一些问题的，比如what’s,can’t这种基本的分词器就很难分好，有些带有否定语义的后期对整句的语义理解就会带来很大的问题。所以这边希望能够通过一些常规的正则化手段去替换掉这些能以分词成功的缩写，下面的代码就展示了这样的功能预处理，希望能够给予大家一点启发。

基于规则的预处理

一些缩写的替换和符号的替换

def clean_text(text):
        """
        Clean text
        :param text: the string of text
        :return: text string after cleaning
        """
        # unit
        text = re.sub(r"(\d+)kgs ", lambda m: m.group(1) + ' kg ', text)        # e.g. 4kgs => 4 kg
        text = re.sub(r"(\d+)kg ", lambda m: m.group(1) + ' kg ', text)         # e.g. 4kg => 4 kg
        text = re.sub(r"(\d+)k ", lambda m: m.group(1) + '000 ', text)          # e.g. 4k => 4000
        text = re.sub(r"\$(\d+)", lambda m: m.group(1) + ' dollar ', text)
        text = re.sub(r"(\d+)\$", lambda m: m.group(1) + ' dollar ', text)

        # acronym
        text = re.sub(r"can\'t", "can not", text)
        text = re.sub(r"cannot", "can not ", text)
        text = re.sub(r"what\'s", "what is", text)
        text = re.sub(r"What\'s", "what is", text)
        text = re.sub(r"\'ve ", " have ", text)
        text = re.sub(r"n\'t", " not ", text)
        text = re.sub(r"i\'m", "i am ", text)
        text = re.sub(r"I\'m", "i am ", text)
        text = re.sub(r"\'re", " are ", text)
        text = re.sub(r"\'d", " would ", text)
        text = re.sub(r"\'ll", " will ", text)
        text = re.sub(r"c\+\+", "cplusplus", text)
        text = re.sub(r"c \+\+", "cplusplus", text)
        text = re.sub(r"c \+ \+", "cplusplus", text)
        text = re.sub(r"c#", "csharp", text)
        text = re.sub(r"f#", "fsharp", text)
        text = re.sub(r"g#", "gsharp", text)
        text = re.sub(r" e mail ", " email ", text)
        text = re.sub(r" e \- mail ", " email ", text)
        text = re.sub(r" e\-mail ", " email ", text)
        text = re.sub(r",000", '000', text)
        text = re.sub(r"\'s", " ", text)

        # spelling correction
        text = re.sub(r"ph\.d", "phd", text)
        text = re.sub(r"PhD", "phd", text)
        text = re.sub(r"pokemons", "pokemon", text)
        text = re.sub(r"pokémon", "pokemon", text)
        text = re.sub(r"pokemon go ", "pokemon-go ", text)
        text = re.sub(r" e g ", " eg ", text)
        text = re.sub(r" b g ", " bg ", text)
        text = re.sub(r" 9 11 ", " 911 ", text)
        text = re.sub(r" j k ", " jk ", text)
        text = re.sub(

最低0.47元/天解锁文章

Ding_xiaofei

关注

4
点赞
踩
23

收藏

觉得还不错? 一键收藏
3
评论
常见的NLP处理手段和相应的spaCy库使用

开篇这篇博客主要讲的是关于英文的一些处理，关于中文的一些nlp处理后续有机会补上。本文主要有以下几个内容：基于规则的预处理常规预处理spaCy库的常规使用pointer-generator关于预处理预处理是很多NLP任务的基础，一个好的预处理对后续的NLP结果有很重要的影响。首先是关于分词的一些问题，中文分词是老大难问题，不太好分，英文就简单得多了，但是英文还是会出现一些...
复制链接

扫一扫