sklearn文档向量化（CountVectorizer、stopwords和ngram的简单举例）

最新推荐文章于 2024-03-15 10:12:54 发布

叽叽贝贝

最新推荐文章于 2024-03-15 10:12:54 发布

阅读量2.2k

点赞数 2

文章标签： sklearn python

本文链接：https://blog.csdn.net/weixin_42196948/article/details/123510285

版权

from sklearn.feature_extraction.text import CountVectorizer

corpus=['Job was the charirman of Apple Inc., and he was very famous',
       'I like to use apple computer',
       'And I also like to eat apple']

如上图，希望对列表中的三句话进行向量化处理

vectorizer = CountVectorizer()
print("未经停用词过滤的文档向量化情况：")
print(vectorizer.fit_transform(corpus).todense())  # 显示完整矩阵形式
print(vectorizer.vocabulary_)

可看到文档被转换成了3行17列的形式，表示各个词在矩阵中的出现情况

接下来，加载nltk库包中的停用词表，对文档进一步处理

import nltk
nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words("english")
print (stopwords)

通过设定停用词，对文档重新进行向量化处理

vectorizer1 = CountVectorizer(stop_words="english")
print("after stopwords removal:")
print(vectorizer1.fit_transform(corpus).todense())
print(vectorizer1.vocabulary_)

由此，现矩阵变成3行8列的形式

此外，还可以通过ngram方式对文档划分标准进行设定，以下代码演示了以单个单词和两个单词为划分标准的情况，即ngram_range（1，2）

vectorizer2 = CountVectorizer(ngram_range=(1,2))
print("N-gram mode:")
print(vectorizer2.fit_transform(corpus).todense())
print(vectorizer2.vocabulary_)

该偏代码源自深度学习基础_哈尔滨工业大学_中国大学MOOC(慕课) (icourse163.org)

叽叽贝贝

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
sklearn文档向量化（CountVectorizer、stopwords和ngram的简单举例）

from sklearn.feature_extraction.text import CountVectorizercorpus=['Job was the charirman of Apple Inc., and he was very famous', 'I like to use apple computer', 'And I also like to eat apple']如上图，希望对列表中的三句话进行向量化处理vectorizer = CountVe.
复制链接

扫一扫