NLTK处理文本(一)

最新推荐文章于 2024-07-08 16:14:38 发布

LBWNB、

最新推荐文章于 2024-07-08 16:14:38 发布

阅读量555

点赞数

分类专栏： Text Mining in Python 学习笔记文章标签： nltk python 自然语言处理 nlp

本文链接：https://blog.csdn.net/qq_38356492/article/details/109004539

版权

Text Mining in Python 学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

NLTK处理文本(一)

查看NLTK内置文本
归一化与合法化
断词与断句

导入包

import nltk
nltk.download()
from nltk.book import *

查看文本text7

text7
len(set(text7))
## 统计词频
dist = FreqDist(text7)
## 查看某个词出现的次数
dist["four"]
##筛选长度大于5且词频大于100的词
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w]>100]

归一化与合法化(Normalization and stemming)

1.归一化

input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

output:

['list', 'list', 'list', 'list', 'list'] ## 得到list词根

2.合法化

udhr = nltk.corpus.udhr.words('English-Latin1')    ## “English-Latin1章节”
[porter.stem(t) for t in udhr[:20]]     ## 用前面的PorterStemmer进行归一化
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]    ##合法化后得到原来的文本

3.Tokenization

text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

output:

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

Tokenization

nltk.word_tokenize(text11)

output

['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed', '.']

“n’t"和”."被分开

断句

text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)

output:

['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?','Yes, it is!']

LBWNB、

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLTK处理文本(一)

NLTK处理文本(一)导入包import nltknltk.download()from nltk.book import *查看文本text7text7len(set(text7))## 统计词频dist = FreqDist(text7)## 查看某个词出现的次数dist["four"]##筛选长度大于5且词频大于100的词freqwords = [w for w in vocab1 if len(w) > 5 and dist[w]>100]归一化与合法化(
复制链接

扫一扫

专栏目录