深度学习实战（5）NLP数据预处理

最新推荐文章于 2022-10-09 18:24:20 发布

icebird_craft

最新推荐文章于 2022-10-09 18:24:20 发布

阅读量1.8k

点赞数 1

分类专栏： pytorch深度学习文章标签：自然语言处理深度学习 pytorch

本文链接：https://blog.csdn.net/icestorm_rain/article/details/108591250

版权

pytorch深度学习专栏收录该内容

14 篇文章 3 订阅

订阅专栏

NLP数据预处理

前言
常见的数据预处理

前言

如何成为一个优秀的NLP工程师，it’s not all about training! 很多小伙伴的模型在训练集上表现良好，却在测试集上表现欠佳，有的小伙伴甚至连训练集都拟合不了。一个优秀的NLP工程师做一个项目的时候第一件是不是训练模型，而是观察自己的数据，自己的数据有什么，有什么可能是我们模型最需要的特征，我们能否删除一些不重要的信息，可以说数据预处理是NLP工程师也是很多DL行业必备的一项技能。

常见的数据预处理

Tokenisation

分词tokenisation是NLP领域最常用的技巧之一了，现在的模式是无法直接地对一个句子获得其embedding的，无论是一切的word2vec还是现在大火大热的bert，都需要分词器。在这里我们介绍最简单的分词代码

# 下载分词集合
import nltk
nltk.download('punkt')
from nltk import word_tokenize
sentence  = " I love python! "
sentence = ' '.join(word_tokenize(sentence))

lowercase and true-casing

单词的大小写也是一个影响因素，单词区分大小写建立的词汇表可能引入不必要的噪声，在单词的大小写这种信息对于你的模型不重要的时候，可以将单词全部小写

text = "I love python"
text = text.lower()
# 结果应该是: i love python

如果大小写对你的文本有重要影响，而这时候文本没有正确的大小写，比如国家名，地名，人名：China，Daive等等

import truecase
# pip install truecase
text = "I love python"
truecase.get_true_case(text)

stopwords removal

停用词指那些在哪都大量出现的词汇，因为出现得太过频繁，一般就没有重要信息。一般都会删去。

from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_list = stopwords.words('english')
text = " What? You don't love python?"
text  = text .split()
    for word in text :
      if word in stopwords_list:
        text .remove(word)

Stemming and Lemmatisation

对于Stemming和Lemmatisation的英文解释如下：
Stemming
In the case of stemming, we want to normalise all words to their stem (or root). The stem is the part of the word to which affixes (suffixes or prefixes) are attached. Stemming a word may result in the word not actually looking like a word. For example, some stemming algorithms may stem trouble, troubling, troubled as troubl.

Lemmatisation
Lemmatisation attempts to reduce tokens to a word that belongs in the language. The basic form of the word is called a lemma, and is the canonical form of a set of words. For example, runs, running, ran are all forms of the word run.
意思就是说Stemming是讲单词的词根提取出来保留，词根不一定是个能解释的单词，而Lemmatisation是讲一个单词缩小到一个最短的完整的单词比如讲running缩小到run。有的时候如果单词的时态不重要，我们就可以考虑做个Lemmatisation试试，如果词根就足够了，我们也可以做stemming。当然目前的代码实现做stemming和Lemmatisation仍然不准确，毕竟也都是用AI技术训练或者统计方法计算的，肯定有准确率。
Stemming代码：

 # STEMMING
import nltk
from nltk.stem import PorterStemmer

porter = PorterStemmer()

stemming_word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}".format("Word","Stemmed variant"))
print()

for word in stemming_word_list:
      print("{0:20}{1:20}".format(word,porter.stem(word)))
'''
输出结果：
Word                Stemmed variant     

friend              friend              
friendship          friendship          
friends             friend              
friendships         friendship          
stabil              stabil              
destabilize         destabil            
misunderstanding    misunderstand       
railroad            railroad            
moonlight           moonlight           
football            footbal   
'''

Lemmatisation代码：

# LEMMATISATION
import nltk
from nltk.stem import WordNetLemmatizer
import re
import string
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()

to_lemmatize_sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# lemmatisation requires punctuation removal
to_lemmatize_sentence = "".join([c for c in to_lemmatize_sentence if c not in string.punctuation])
to_lemmatize_sentence = to_lemmatize_sentence.split(" ")

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))
'''
输出结果：
Word                Lemma               

He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun    
'''