NLTK处理文本(一)
- 查看NLTK内置文本
- 归一化与合法化
- 断词与断句
导入包
import nltk
nltk.download()
from nltk.book import *
查看文本text7
text7
len(set(text7))
## 统计词频
dist = FreqDist(text7)
## 查看某个词出现的次数
dist["four"]
##筛选长度大于5且词频大于100的词
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w]>100]
归一化与合法化(Normalization and stemming)
1.归一化
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]
output:
['list', 'list', 'list', 'list', 'list'] ## 得到list词根
2.合法化
udhr = nltk.corpus.udhr.words('English-Latin1') ## “English-Latin1章节”
[porter.stem(t) for t in udhr[:20]] ## 用前面的PorterStemmer进行归一化
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]] ##合法化后得到原来的文本
3.Tokenization
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')
output:
['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']
Tokenization
nltk.word_tokenize(text11)
output
['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed', '.']
“n’t"和”."被分开
断句
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
output:
['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?','Yes, it is!']