词干的提取

最新推荐文章于 2022-09-28 19:12:10 发布

鹰眼2号

最新推荐文章于 2022-09-28 19:12:10 发布

阅读量680

点赞数

分类专栏：数据分析

本文链接：https://blog.csdn.net/qq_35810838/article/details/87097260

版权

数据分析专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一。单词标记是将大量文本分解为单词的过

首先安装NLTK

接下来，使用word_tokenize()方法将段落拆分为单个单词

import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

标记句子，用sent_tokenize()

import nltk
sentence_data = "Sun rises in the east. Sun sets in the west."
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

['Sun rises in the east.', 'Sun sets in the west.']

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

python词干与词形化

在自然语言处理领域，我们遇到了两个或两个以上单词具有共同根源的情况。例如，agreed, agreeing 和 agreeable这三个词具有相同的词根。涉及任何这些词的搜索应该把它们当作是根词的同一个词。因此将所有单词链接到它们的词根变得非常重要。 NLTK库有一些方法来完成这个链接，并给出显示根词的输出。

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print ("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))

Actual: It  Stem: It
Actual: originated  Stem: origin
Actual: from  Stem: from
Actual: the  Stem: the
Actual: idea  Stem: idea
Actual: that  Stem: that
Actual: there  Stem: there
Actual: are  Stem: are
Actual: readers  Stem: reader
Actual: who  Stem: who
Actual: prefer  Stem: prefer
Actual: learning  Stem: learn
Actual: new  Stem: new
Actual: skills  Stem: skill
Actual: from  Stem: from
Actual: the  Stem: the
Actual: comforts  Stem: comfort
Actual: of  Stem: of
Actual: their  Stem: their
Actual: drawing  Stem: draw
Actual: rooms  Stem: room

词形化是类似的词干，但是它为词语带来了上下文。所以它进一步将具有相似含义的词链接到一个词。例如，如果一个段落有像汽车，火车和汽车这样的词，那么它将把它们全部连接到汽车。在下面的程序中，使用WordNet词法数据库进行词式化。

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
    print ("Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))

Actual: It  Lemma: It
Actual: originated  Lemma: originated
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: idea  Lemma: idea
Actual: that  Lemma: that
Actual: there  Lemma: there
Actual: are  Lemma: are
Actual: readers  Lemma: reader
Actual: who  Lemma: who
Actual: prefer  Lemma: prefer
Actual: learning  Lemma: learning
Actual: new  Lemma: new
Actual: skills  Lemma: skill
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: comforts  Lemma: comfort
Actual: of  Lemma: of
Actual: their  Lemma: their
Actual: drawing  Lemma: drawing
Actual: rooms  Lemma: room

鹰眼2号

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
词干的提取

一。单词标记是将大量文本分解为单词的过首先安装NLTK接下来，使用word_tokenize()方法将段落拆分为单个单词import nltkword_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of t...
复制链接

扫一扫

专栏目录