CountVectorizer 词频统计

最新推荐文章于 2022-07-03 23:59:11 发布

YPL_ZML

最新推荐文章于 2022-07-03 23:59:11 发布

阅读量2.3k

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/YPL_ZML/article/details/93906264

版权

机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

from sklearn.feature_extraction.text import CountVectorizer
import  jieba
# 实例化一个con_vec对象
# con_vec = CountVectorizer(min_df=1)


# 准备文本数据
# text = ['This is the first document.', 'This is the second second document.', 'And the third one.',
#         'Is this the first document?', ]

# 统计词语出现次数
# X = con_vec.fit_transform(text)
# feature__name = con_vec.get_feature_names()
# print(feature__name)
# print(X)
"""
(0, 1)	1
第一个值 属于第几个句子
第二个值 哪个词
1 词频
"""
# 将单词个数转化为单词个数矩阵。
# print(X.toarray())
# stop_words 去掉一些不重要的词
con_vec = CountVectorizer(min_df=1, stop_words=['之后', '玩完'])
text = '今天天气真好,我要去北京天安门玩，要去景山攻牙之后，玩完大明劫'
# 进行结巴分词，精确模式
text_list = jieba.cut(text, cut_all=False)
text_list = ",".join(text_list)
context = []
context.append(text_list)
print(context)

X = con_vec.fit_transform(context)
feature__name = con_vec.get_feature_names()
print(feature__name)
print(X.toarray())