NLTK处理文本(一)

NLTK处理文本(一)

  • 查看NLTK内置文本
  • 归一化与合法化
  • 断词与断句

导入包

import nltk
nltk.download()
from nltk.book import *

查看文本text7

text7
len(set(text7))
## 统计词频
dist = FreqDist(text7)
## 查看某个词出现的次数
dist["four"]
##筛选长度大于5且词频大于100的词
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w]>100]

归一化与合法化(Normalization and stemming)

1.归一化

input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

output:

['list', 'list', 'list', 'list', 'list'] ## 得到list词根

2.合法化

udhr = nltk.corpus.udhr.words('English-Latin1')    ## “English-Latin1章节”
[porter.stem(t) for t in udhr[:20]]     ## 用前面的PorterStemmer进行归一化
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]    ##合法化后得到原来的文本

3.Tokenization

text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

output:

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

Tokenization

nltk.word_tokenize(text11)

output

['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed', '.']

“n’t"和”."被分开

断句
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)

output:

['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?','Yes, it is!']
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值