自然语言处理(NLP) 一：分词、分句、词干提取

最新推荐文章于 2024-06-19 17:11:53 发布

爱跑步的george

最新推荐文章于 2024-06-19 17:11:53 发布

阅读量1.5w

点赞数 3

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/weixin_38246633/article/details/80636976

版权

自然语言处理专栏收录该内容

6 篇文章 2 订阅

订阅专栏

需要安装nltk自然语言处理包，anaconda默认已经安装了
还需要安装nltk语料库:http://www.nltk.org/data.html

自然语言基础知识：

1、分词

鱼香肉丝里面多放点辣椒
对称加密需要DES处理引擎
天儿冷了多穿点

Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.

import nltk.tokenize as tk
doc = 'Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.'
print(doc)
print('-'*72)

#分句
tokens = tk.sent_tokenize(doc)
for token in tokens:
    print(token)
print('_'*72)

#分词
tokens = tk.word_tokenize(doc)
for token in tokens:
    print(token)
print('_'*72)

#词和标点
tokenizer = tk.WordPuncTokenizer()
tokens = tokenizer.tokenize(doc)
for token in tokens:
    print(token)
print('_'*72)

2、词干提取

play -> plays/playing/player
Porter:宽松、简单、快，但是比较粗暴
Lancater:严格，复杂，慢，保词干的语法正确
Snowball:在精度和效率上位于以上两种提取器之间

import nltk.stem.porter as pt 
import nltk.stem.lancaster as lc 
import nltk.stem.snowball as sb 
words = ['table','probably','wolves','playing','is','dog','the','beaches','grounded','deamt','envision']
#用porter方法来识别词干
stemmer = pt.PorterStemmer()
for word in words:
    stem = stemmer.stem(word)
    print(stem)
print('-'*72)

stemmer = lc.LancasterStemmer()
for word in words:
    stem = stemmer.stem(word)
    print(stem)
print('-'*72)

stemmer = sb.SnowballStemmer('english')
for word in words:
    stem = stemmer.stem(word)
    print(stem)
print('-'*72)

爱跑步的george

关注

3
点赞
踩
32

收藏

觉得还不错? 一键收藏
0
评论
自然语言处理(NLP) 一：分词、分句、词干提取

需要安装nltk自然语言处理包，anaconda默认已经安装了还需要安装nltk语料库:http://www.nltk.org/data.html自然语言基础知识：1、分词鱼香肉丝里面多放点辣椒对称加密需要DES处理引擎天儿冷了多穿点Are you curious about tokenization ? Let’s see how it works! we ne...
复制链接

扫一扫