英文文本分词处理(NLTK)

1、NLTK的安装

首先,打开终端(Anaconda Prompt)安装nltk:

pip install nltk

打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包

import nltk
nltk.download()

注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库

2、NLTK分词和分句

 由于英语的句子基本上就是由标点符号、空格和词构成,那么只要根据空格和标点符号将词语分割成数组即可,所以相对来说简单很多:
(1)分词:

from nltk import word_tokenize     #以空格形式实现分词
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
words = word_tokenize(paragraph)
print(words)

运行结果:

['The', 'first', 'time', 'I', 'heard', 'that', 'song', 'was', 'in', 'Hawaii', 'on', 'radio', '.', 'I', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'What', 'a', 'fantastic', 'song', '!']

(2)分句:

from nltk import sent_tokenize    #以符号形式实现分句
sentences = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
sentence = sent_tokenize(sentences )
print(sentence)

运行结果:

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']

注意: NLTK分词或者分句以后,都会自动形成列表的形式

3、NLTK分词后去除标点符号
from nltk import word_tokenize
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义标点符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)

运行结果:

【NLTK分词结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']
4、NLTK分词后去除停用词
from nltk import word_tokenize
from nltk.corpus import stopwords

paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".l
  • 89
    点赞
  • 311
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值