2020.11-NLTK文本预处理工具学习

最新推荐文章于 2023-01-30 10:53:17 发布

CuriousLiu

最新推荐文章于 2023-01-30 10:53:17 发布

阅读量439

点赞数

分类专栏：个人笔记文章标签： NLP 文本预处理 NLTK

本文链接：https://blog.csdn.net/CuriousLiu/article/details/109816457

版权

个人笔记专栏收录该内容

15 篇文章 1 订阅

订阅专栏

Reference：

作者：藏经阁了知
链接：https://www.jianshu.com/p/32258d3b02f6
来源：简书

四、去除停用词、自定义去除的characters列表、大小写转化

五、文本分析

1. 统计词频+折线图

2. 词云（可能需要对Windows环境的依赖，有待尝试）

零、NLTK安装&模型下载

安装过程应该是pip install就可以了，在安装后发现在import语句时还会显示没有模块的情况，这时候需要用到以下语句进行相关下载：

import nltk
nltk.download('punkt') # 其中punkt替换为需要安装的内容，在报错中会有显示

一、分词

使用NLTK完成简单英文句子分词的示例如下，对这里的补充内容是如果只是这种简单的拆分方式的话，和使用str.split(" ")没有区别，另外对于中文文本由于各词之间没有以空格分割，所以如果希望做char字符级别的拆分的话，可以先用" ".join()，再用.split(" ")，这里留的疑问是如果使用NLTK做中文分词级别的拆分，是否有可依赖的内容？因为对于中文分词，jieba分词应该是比较广泛使用的。

from nltk import word_tokenize

sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
tokenWords = word_tokenize(sentence) 
print(tokenWords)

#sentence = "有记者提问，意大利米兰国家癌症中心研究显示，新冠病毒自去年9月就在意大利传播。有观点认为新冠病毒在武汉暴发之前就已经在海外传播。中方是否认同这一观点？"
#sentence = " ".join(sentence) # 中文分不开，所以需要先join空格再进行split
#print(sentence.split(" "))

output：

['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']

['有', '记', '者', '提', '问', '，', '意', '大', '利', '米', '兰', '国', '家', '癌', '症', '中', '心', '研', '究', '显', '示', '，', '新', '冠', '病', '毒', '自', '去', '年', '9', '月', '就', '在', '意', '大', '利', '传', '播', '。', '有', '观', '点', '认', '为', '新', '冠', '病', '毒', '在', '武', '汉', '暴', '发', '之', '前', '就', '已', '经', '在', '海', '外', '传', '播', '。', '中', '方', '是', '否', '认', '同', '这', '一', '观', '点', '？']

二、词性标注

1. 词干提取

词干提取对于英文的目标是提取单词的词根，虽然在常规文本分析中可能用处不大，但是可能在一些信息检索中有所用处

from nltk.stem.lancaster import LancasterStemmer
from nltk import word_tokenize

sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
tokenWords = word_tokenize(sentence)
print("tokenWords", tokenWords)

lancasterStemmer = LancasterStemmer() # 词干测定器Stemmer lancasterStemmer
wordsStemmer = [lancasterStemmer.stem(tokenWord) for tokenWord in tokenWords]
print("wordsStemmer", wordsStemmer)

output：

tokenWords ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']

wordsStemmer ['dbscan', '-', 'density-based', 'spat', 'clust', 'of', 'apply', 'with', 'nois', '.', 'find', 'cor', 'sampl', 'of', 'high', 'dens', 'and', 'expand', 'clust', 'from', 'them', '.', 'good', 'for', 'dat', 'which', 'contain', 'clust', 'of', 'simil', 'dens']

2. 单词变体还原（基础）

例如对过去时，将来时，三单等内容进行还原，但是这种基础的还原方法会发现还有一些单词处于原来的形态，这是因wordnet_lematizer.lemmatize函数默认将其当做一个名词，以为这就是单词原型，如果我们在使用该函数时指明动词词性，就可以了，这部分汇总在之后的进阶版。

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize

sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
tokenWords = word_tokenize(sentence)
print("tokenWords", tokenWords)

wordnetLematizer = WordNetLemmatizer()
wordsLematizer = [wordnetLematizer.lemmatize(tokenWord) for tokenWord in tokenWords]
print(wordsLematizer)

output：

tokenWords ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']

['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expands', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'cluster', 'of', 'similar', 'density']

3. 词性标注

词性标注就是把这个单词对应到动词，名词等具体的词性。

from nltk import word_tokenize,pos_tag

sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density"
tokenWords = word_tokenize(sentence)  #分词
print("tokenWords", tokenWords)
labeledTokenWords = pos_tag(tokenWords)     #词性标注
print("labeledTokenWords", labeledTokenWords)

output：

tokenWords ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']

labeledTokenWords [('DBSCAN', 'NNP'), ('-', ':'), ('Density-Based', 'JJ'), ('Spatial', 'NNP'), ('Clustering', 'NNP'), ('of', 'IN'), ('Applications', 'NNP'), ('with', 'IN'), ('Noise', 'NNP'), ('.', '.'), ('Finds', 'NNP'), ('core', 'NN'), ('samples', 'NNS'), ('of', 'IN'), ('high', 'JJ'), ('density', 'NN'), ('and', 'CC'), ('expands', 'VBZ'), ('clusters', 'NNS'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('Good', 'JJ'), ('for', 'IN'), ('data', 'NNS'), ('which', 'WDT'), ('contains', 'VBZ'), ('clusters', 'NNS'), ('of', 'IN'), ('similar', 'JJ'), ('density', 'NN')]

词性标签如下（Reference：https://blog.csdn.net/weixin_38627015/article/details/88727790）：

CC      Coordinating conjunction 连接词
CD     Cardinal number  基数词
DT     Determiner  限定词（如this,that,these,those,such，不定限定词：no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another.
EX     Existential there 存在句
FW     Foreign word 外来词
IN     Preposition or subordinating conjunction 介词或从属连词
JJ     Adjective 形容词或序数词
JJR     Adjective, comparative 形容词比较级
JJS     Adjective, superlative 形容词最高级
LS     List item marker 列表标示
MD     Modal 情态助动词
NN     Noun, singular or mass 常用名词 单数形式
NNS     Noun, plural  常用名词 复数形式
NNP     Proper noun, singular  专有名词，单数形式
NNPS     Proper noun, plural  专有名词，复数形式
PDT     Predeterminer 前位限定词
POS     Possessive ending 所有格结束词
PRP     Personal pronoun 人称代词
PRP$     Possessive pronoun 所有格代名词
RB     Adverb 副词
RBR     Adverb, comparative 副词比较级
RBS     Adverb, superlative 副词最高级
RP     Particle 小品词
SYM     Symbol 符号
TO     to 作为介词或不定式格式
UH     Interjection 感叹词
VB     Verb, base form 动词基本形式
VBD     Verb, past tense 动词过去式
VBG     Verb, gerund or present participle 动名词和现在分词
VBN     Verb, past participle 过去分词
VBP     Verb, non-3rd person singular present 动词非第三人称单数
VBZ     Verb, 3rd person singular present 动词第三人称单数
WDT     Wh-determiner 限定词（如关系限定词：whose,which.疑问限定词：what,which,whose.）
WP      Wh-pronoun 代词（who whose which）
WP$     Possessive wh-pronoun 所有格代词
WRB     Wh-adverb   疑问代词（how where when）

三、单词变体还原（进阶）

使用词性指明，之后进一步还原变体，这样一些动词就不会被识别为名词等了。

from nltk import word_tokenize,pos_tag
from nltk.stem import WordNetLemmatizer
wordsLematizer = []
wordnetLematizer = WordNetLemmatizer()

sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density"
tokenWords = word_tokenize(sentence)  #分词
labeledTokenWords = pos_tag(tokenWords)     #词性标注
print(labeledTokenWords)

tempPos = ''
for word, tag in labeledTokenWords:
    tempPos = '' # 指定了一个词性，这样在往回还原的时候会按照动词的形式进行往回还原
    if tag.startswith('NN'):
        tempPos = 'n'
        
    elif tag.startswith('VB'):
        tempPos = 'v'
    elif tag.startswith('JJ'):
        tempPos = 'a'
    elif tag.startswith('R'):
        tempPos = 'r'

    if len(tempPos):
        wordLematizer = wordnetLematizer.lemmatize(word, pos=tempPos)
    else:
        wordLematizer = wordnetLematizer.lemmatize(word)
    tempPos = ''

    wordsLematizer.append(wordLematizer)

print(wordsLematizer)

output：

[('DBSCAN', 'NNP'), ('-', ':'), ('Density-Based', 'JJ'), ('Spatial', 'NNP'), ('Clustering', 'NNP'), ('of', 'IN'), ('Applications', 'NNP'), ('with', 'IN'), ('Noise', 'NNP'), ('.', '.'), ('Finds', 'NNP'), ('core', 'NN'), ('samples', 'NNS'), ('of', 'IN'), ('high', 'JJ'), ('density', 'NN'), ('and', 'CC'), ('expands', 'VBZ'), ('clusters', 'NNS'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('Good', 'JJ'), ('for', 'IN'), ('data', 'NNS'), ('which', 'WDT'), ('contains', 'VBZ'), ('clusters', 'NNS'), ('of', 'IN'), ('similar', 'JJ'), ('density', 'NN')]

['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']

四、去除停用词、自定义去除的characters列表、大小写转化

#encoding:utf-8
from nltk.corpus import stopwords
from nltk import word_tokenize,pos_tag
from nltk.stem import WordNetLemmatizer

wordsLematizer = []
wordnetLematizer = WordNetLemmatizer()

sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density?"
tokenWords = word_tokenize(sentence)  #分词
labeledTokenWords = pos_tag(tokenWords)     #词性标注

tempPos = ''
for word, tag in labeledTokenWords:
    tempPos = '' # 指定了一个词性，这样在往回还原的时候会按照动词的形式进行往回还原
    if tag.startswith('NN'):
        tempPos = 'n'
        
    elif tag.startswith('VB'):
        tempPos = 'v'
    elif tag.startswith('JJ'):
        tempPos = 'a'
    elif tag.startswith('R'):
        tempPos = 'r'

    if len(tempPos):
        wordLematizer = wordnetLematizer.lemmatize(word, pos=tempPos)
    else:
        wordLematizer = wordnetLematizer.lemmatize(word)
    tempPos = ''

    wordsLematizer.append(wordLematizer)

cleanedWords = [word for word in wordsLematizer if word not in stopwords.words('english')]
print("原始词：", wordsLematizer)
print("去除停用词后：", cleanedWords)


#去除特殊字符
characters = [',', '.','DBSCAN', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-','...','^','{','}']
wordsList = [word for word in cleanedWords if word not in characters]
print("自定义characters列表，并去除包含在列表中的内容", wordsList)



#大小写转化
#为防止同一个单词同时存在大小写，而算作两个单词的情况，还需要统一单词大小写
lowerWordsList = [x.lower() for x in wordsList]
print("大小写转化（小写）：", lowerWordsList)

upperWordsList = [x.upper() for x in wordsList]
print("大小写转化（大写）：", upperWordsList)

output：

原始词： ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density', '?']

去除停用词后： ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', '.', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', '.', 'Good', 'data', 'contain', 'cluster', 'similar', 'density', '?']

自定义characters列表，并去除包含在列表中的内容 ['Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']

大小写转化（小写）： ['density-based', 'spatial', 'clustering', 'applications', 'noise', 'finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'good', 'data', 'contain', 'cluster', 'similar', 'density']

大小写转化（大写）： ['DENSITY-BASED', 'SPATIAL', 'CLUSTERING', 'APPLICATIONS', 'NOISE', 'FINDS', 'CORE', 'SAMPLE', 'HIGH', 'DENSITY', 'EXPAND', 'CLUSTER', 'GOOD', 'DATA', 'CONTAIN', 'CLUSTER', 'SIMILAR', 'DENSITY']

五、文本分析

1. 统计词频+折线图

#encoding:utf-8
#经过之前的文本预处理后，已经得到干净的单词列表做文本分析或文本挖掘

#统计词频
from nltk import FreqDist
import matplotlib

lowerWordsList = ['density-based', 'spatial', 'clustering', 'applications', 'noise', 'finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'good', 'data', 'contain', 'cluster', 'similar', 'density']

#函数FreqDist方法获取在文本中每个出现的标识符的频率分布。通常情况下，函数得到的是每个标识符出现的次数与标识符的map映射 
freq = FreqDist(lowerWordsList) # 可以理解为dict
print("frequency",freq)
for key, value in freq.items():
    print(str(key) + ':' + str(value))

#可视化-折线图
freq.plot(20,cumulative=False) # 保存下来这个图的清晰度还是很高的，在数学建模等比赛中都可以尝试使用

output：

density-based:1
spatial:1
clustering:1
applications:1
noise:1
finds:1
core:1
sample:1
high:1
density:2
expand:1
cluster:2
good:1
data:1
contain:1
similar:1

2. 词云（可能需要对Windows环境的依赖，有待尝试）

CuriousLiu

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
2020.11-NLTK文本预处理工具学习

Reference：https://www.jianshu.com/p/32258d3b02f6目录零、NLTK安装&模型下载一、分词二、词性标注1. 词干提取2. 单词变体还原（基础）3. 词性标注三、单词变体还原（进阶）四、去除停用词、自定义去除的characters列表、大小写转化五、文本分析1. 统计词频+折线图2. 词云（可能需要对Windows环境的依赖，有待尝试）零、NLTK安装&模型下载安装过程应该是pip instal
复制链接

扫一扫