python 利用sklearn自带的模块快速简单实现文章的 tfidf向量空间的表示

最新推荐文章于 2022-03-23 11:56:06 发布

phoebe_IT

最新推荐文章于 2022-03-23 11:56:06 发布

阅读量9.4k

点赞数 4

分类专栏：算法实现

本文链接：https://blog.csdn.net/u012448083/article/details/50955530

版权

算法实现专栏收录该内容

9 篇文章 0 订阅

订阅专栏

主要是利用 sklearn的 TfidfVectorizer（fromsklearn.feature_extraction.textimportTfidfVectorizer）对文章进行词典的提取，并对文章根据提取的词典利用tfidf原理，对文章进行向量空间的表示

'''
min_df:的含义
min_df is used for removing terms that appear too infrequently. For example:
•min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
•min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document".Thus, the default setting does not ignore any terms.

max_df:的含义
max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
•max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
•max_df = 25 means "ignore terms that appear in more than 25 documents".
The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

'''

# coding=utf-8
mydoclist = [u'温馨 提示 ： 家庭 畅享 套餐 介绍 、 主卡 添加 / 取消 副 卡 短信 办理 方式 , 可 点击 文档 左上方  短信  图标 即可 将 短信 指令 发送给 客户',
u'客户 申请 i 我家 ， 家庭 畅享 计划  后 ， 可 选择 设置 1 - 6 个 同一 归属 地 的 中国移动 网 内 号码 作为 亲情 号码 ， 组建 一个 家庭 亲情 网  家庭 内 ',
u'所有 成员 可 享受 本地 互打 免费 优惠 ， 家庭 主卡 号码 还 可 享受 省内 / 国内 漫游 接听 免费 的 优惠']
from sklearn.feature_extraction.text import CountVectorizer

# count_vectorizer = CountVectorizer(min_df=1)
# term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
# print "Vocabulary:", count_vectorizer.vocabulary_
#
# from sklearn.feature_extraction.text import TfidfTransformer
#
# tfidf = TfidfTransformer(norm="l2")
# tfidf.fit(term_freq_matrix)
#
# tf_idf_matrix = tfidf.transform(term_freq_matrix)
# print tf_idf_matrix.todense()
# from __future__ import print_function


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)
str=''
for i in tfidf_vectorizer.vocabulary_:
    str+=' '+i
print str
print tfidf_matrix.todense()
new_docs = [u'一个 客户 号码 只能 办理 一种 家庭 畅享 计划 套餐 ， 且 只能 加入 一个 家庭网']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_,type(tfidf_vectorizer.vocabulary_)
str=''
for i,j in sorted(tfidf_vectorizer.vocabulary_.items(), key=lambda d: d[1]):
    str+=' '+i
print str
print [ v for v in sorted(tfidf_vectorizer.vocabulary_.values())]
print sorted(tfidf_vectorizer.vocabulary_.items(), key=lambda d: d[1])


print new_term_freq_matrix.todense()

phoebe_IT

关注

4
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python 利用sklearn自带的模块快速简单实现文章的 tfidf向量空间的表示

# coding=utf-8mydoclist = [u'温馨提示：家庭畅享套餐介绍、主卡添加 / 取消副卡短信办理方式 , 可点击文档左上方短信图标即可将短信指令发送给客户',u'客户申请 i 我家，家庭畅享计划后，可选择设置 1 - 6 个同一归属地的中国移动网内号码作为亲情号码，组建
复制链接

扫一扫