英文文本分词处理（NLTK）

最新推荐文章于 2024-02-03 12:03:58 发布

SK-Berry

最新推荐文章于 2024-02-03 12:03:58 发布

阅读量2.9w

点赞数 89

文章标签： nltk python

本文链接：https://blog.csdn.net/sk_berry/article/details/105240317

版权

文章目录

1、NLTK的安装

首先，打开终端（Anaconda Prompt）安装nltk：

pip install nltk

打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包

import nltk
nltk.download()

注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库。

2、NLTK分词和分句

由于英语的句子基本上就是由标点符号、空格和词构成，那么只要根据空格和标点符号将词语分割成数组即可，所以相对来说简单很多：
（1）分词：

from nltk import word_tokenize     #以空格形式实现分词
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
words = word_tokenize(paragraph)
print(words)

运行结果：

['The', 'first', 'time', 'I', 'heard', 'that', 'song', 'was', 'in', 'Hawaii', 'on', 'radio', '.', 'I', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'What', 'a', 'fantastic', 'song', '!']

（2）分句：

from nltk import sent_tokenize    #以符号形式实现分句
sentences = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
sentence = sent_tokenize(sentences )
print(sentence)

运行结果：

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']

注意： NLTK分词或者分句以后，都会自动形成列表的形式

3、NLTK分词后去除标点符号

from nltk import word_tokenize
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果：】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义标点符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果：】')
print(cutwords2)

运行结果：

【NLTK分词结果：】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果：】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']

4、NLTK分词后去除停用词

from nltk import word_tokenize
from nltk.corpus import stopwords

paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".l

最低0.47元/天解锁文章

SK-Berry

关注

89
点赞
踩
311

收藏

觉得还不错? 一键收藏
7
评论
英文文本分词处理（NLTK）

1、NLTK的安装首先，打开终端（Anaconda Prompt）安装nltk：pip install nltk打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包import nltknltk.download()注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库。2、NLTK分词和分句由于英语的句子基...
复制链接

扫一扫