自然语言处理 | (4)英文文本处理与NLTK

最新推荐文章于 2024-07-06 11:44:33 发布

CoreJT

最新推荐文章于 2024-07-06 11:44:33 发布

阅读量4.6k

点赞数 8

分类专栏：自然语言处理文章标签：自然语言处理NLP NLTK 英文文本处理

本文链接：https://blog.csdn.net/sdu_hao/article/details/86752556

版权

自然语言处理专栏收录该内容

29 篇文章 48 订阅

订阅专栏

本篇博客我们将介绍使用NLTK对英文文本进行一些基本处理，之后我们还会学习一些更高级的模型或方法，不过这些基本处理要熟练掌握，因为他们可以对我们的数据进行一些预处理，作为更高级模型或工具的输入。

1.NLTK简介

2.英文Tokenization(标记化/分词)

7.Stemming和Lemmatizing

8.WordNet与词义解析

完整代码

1.NLTK简介

2.英文Tokenization(标记化/分词)

import nltk
from nltk import word_tokenize, sent_tokenize
import matplotlib
%matplotlib inline
matplotlib.use('Agg')

# 读入数据
# 把文本读入到字符串中
with open('./data/text.txt','r') as f:
    corpus = f.read()
# 查看类型
print("corpus的数据类型为:",type(corpus))

#对文本进行断句 返回一个列表
#nltk.download('punkt') 
sentences = sent_tokenize(corpus)
print(sentences)

# 对文本进行分词 返回一个列表
words = word_tokenize(corpus)
print(words[:20])

3.停用词

关于机器学习中停用词的产出与收集方法，大家可以参见知乎讨论机器学习中如何收集停用词

# 导入nltk内置的停用词
from nltk.corpus import stopwords
#nltk.download('stopwords') 需要下载到本地
stop_words = stopwords.words('english') #得到nltk内置的所有英文停用词
print(stop_words[:10]) #查看前10个

# 使用列表推导式去掉停用词
filter_corpus = [w for w in words if w not in stop_words]
print(filter_corpus[:20])

print("我们总共剔除的停用词数量为：", len(words)-len(filter_corpus))

4.词性标注

# 词性标注
from nltk import pos_tag
#nltk.download('averaged_perceptron_tagger') 需要下载到本地
tags = pos_tag(filter_corpus) 
print(tags[:20])

具体的词性标注编码和含义见如下对应表：

5.chunking/组块分析

from nltk.chunk import RegexpParser
from nltk import sent_tokenize,word_tokenize

# 写一个匹配名词短语NP的模式
#JJ形容词+NN名词 或 JJ形容词+NN名词+CC连词+NN名词
pattern = """
    NP: {<JJ>*<NN>+}   
    {<JJ>*<NN><CC>*<NN>+}
    """

# 定义组块分析器
chunker = RegexpParser(pattern)

# 一段文本(字符串)
text = """
he National Wrestling Association was an early professional wrestling sanctioning body created in 1930 by 
the National Boxing Association (NBA) (now the World Boxing Association, WBA) as an attempt to create
a governing body for professional wrestling in the United States. The group created a number of "World" level 
championships as an attempt to clear up the professional wrestling rankings which at the time saw a number of 
different championships promoted as the "true world championship". The National Wrestling Association's NWA 
World Heavyweight Championship was later considered part of the historical lineage of the National Wrestling 
Alliance's NWA World Heavyweight Championship when then National Wrestling Association champion Lou Thesz 
won the National Wrestling Alliance championship, folding the original championship into one title in 1949."""


#断句 返回一个列表
tokenized_sentence = nltk.sent_tokenize(text)
#分词 返回一个嵌套列表
tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]
#词性标注
tagged_words = [nltk.pos_tag(word) for word in tokenized_words]

#识别之前定义的NP组块
word_tree = [chunker.parse(word) for word in tagged_words]

word_tree[0].draw() # 会跳出弹窗，显示第一句话的解析图

6.命名实体识别

from nltk import ne_chunk,pos_tag,word_tokenize
#nltk.download('maxent_ne_chunker') #需要下载到本地
#nltk.download('words')
sentence = 'CoreJT studies at Stanford University.'
#依次对句子/文本进行分词 词性标注和命名实体识别
print(ne_chunk(pos_tag(word_tokenize(sentence))))

命名实体识别也非常推荐大家使用 stanford core nlp modules 作为nltk的NER工具库，通常来说它速度更快，而且有更高的识别准确度。

7.Stemming和Lemmatizing

# 可以用PorterStemmer
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem('running'))
print(stemmer.stem('makes'))
print(stemmer.stem('tagged'))

# 也可以用SnowballStemmer

from nltk.stem import SnowballStemmer
stemmer1 = SnowballStemmer('english') #指定为英文
print(stemmer1.stem('growing'))

#Lemmatization和Stemmer很类似，不同的是他还考虑了词义关联等信息
#Stemmer速度更快 因为他只是基于一系列规则
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')#需要下载到本地
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('makes'))

8.WordNet与词义解析

from nltk.corpus import wordnet as wn

print(wn.synsets('man')) #查看单词man的各个词义
print(wn.synsets('man')[0].definition()) #查看第一种词义的解释
print(wn.synsets('man')[1].definition()) #查看第二种词义的解释

print(wn.synsets('dog'))#查看单词dog的各个词义
print(wn.synsets('dog')[0].definition())#查看第一种词义的解释
#基于第一种词义进行造句
dog = wn.synsets('dog')[0]
#或者 dog = wn.synset('dog.n.01')
print(dog.examples()[0])

# 查看dog的上位词
print(dog.hypernyms()) #犬类 家养动物

CoreJT

关注

8
点赞
踩
40

收藏

觉得还不错? 一键收藏
1
评论
自然语言处理 | (4)英文文本处理与NLTK

本篇博客我们将介绍使用NLTK对英文文本进行一些基本处理，之后我们还会学习一些更高级的模型或方法，不过这些基本处理要熟练掌握，因为他们可以对我们的数据进行一些预处理，作为更高级模型或工具的输入。目录1.NLTK简介2.英文Tokenization(标记化/分词)3.停用词4.词性标注5.chunking/组块分析6.命名实体识别7.Stemming和Lemmatiz...
复制链接

扫一扫

专栏目录