【自然语言处理】-nltk库学习笔记(一)

勤奋努力的野指针

已于 2022-05-02 22:05:40 修改

阅读量661

点赞数 3

分类专栏： python 文章标签：自然语言处理学习 python

于 2022-05-01 14:45:27 首次发布

本文链接：https://blog.csdn.net/SUOLONG1/article/details/124522117

版权

句子切分(Sentence Tokenize)

nltk的分词是句子级别的，所以对于一篇文档首先要将文章按句子进行分割，然后句子进行分词

from nltk.tokenize import sent_tokenize

text = """Hello Mr. Smith, how are you doing today? The weather is great, and 
city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""

tokenized_text = sent_tokenize(text)
print(tokenized_text)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and \ncity is awesome.The sky is pinkish-blue.', "You shouldn't eat cardboard"]

单词切分(Word Tokenize)

import nltk

sent = "Study hard and improve every day."
token = nltk.word_tokenize(sent)
print(token)

['Study', 'hard', 'and', 'improve', 'every', 'day', '.']

移除标点符号

对每个切词调用该函数，移除字符串中的标点符号，string.punctuation包含了所有的标点符号，从切词中把这些标点符号替换为空格。

最低0.47元/天解锁文章

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

勤奋努力的野指针

关注关注

3
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
【自然语言处理】-nltk库学习笔记(一)

句子切分from nltk.tokenize import sent_tokenizetext = """Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""tokenized_text = sent_tokenize(text)print(tokenized_text)
复制链接

扫一扫