【达观杯】数据竞赛学习篇（二）

最新推荐文章于 2019-04-21 21:59:28 发布

Loewi大湿

最新推荐文章于 2019-04-21 21:59:28 发布

阅读量124

点赞数

分类专栏： self_learning 文章标签：达观杯 tf-idf

本文链接：https://blog.csdn.net/weixin_42317507/article/details/89060645

版权

self_learning 专栏收录该内容

21 篇文章 2 订阅

订阅专栏

【任务2 - 学习TF-IDF理论并实践，使用TF-IDF表示文本】

TF-IDF复习

TFIDF–short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus, it is a weighting factor to a word/token. It is defined to be the product of term frequency and inverse document frequency:

代码实践

以前自己写过tf-idf的实现，小数据还能跑，大点的就凉凉。还是用现成的sklearn吧?

from sklearn.feature_extraction.text import TfidfVectorizer
from time import time

t0 = time() 
tfidf = TfidfVectorizer()
vectorizer = tfidf.fit(X_train['word_seg'])
X_train = vectorizer.fit_transform(X_train['word_seg'])
X_valid = vectorizer.fit_transform(X_valid['word_seg'])
duration = time() - t0
print("done in %fs" % (duration))
print("%d documents, %d features" % X_train.shape)
print("%d documents, %d features" % X_valid.shape)

#done in 125.134721s
#92049 documents, 824243 features
#10228 documents, 254316 features

回顾：【达观杯】数据竞赛学习篇（一）
https://blog.csdn.net/weixin_42317507/article/details/89047074

Loewi大湿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【达观杯】数据竞赛学习篇（二）

【任务2 - 学习TF-IDF理论并实践，使用TF-IDF表示文本】TF-IDF复习TFIDF–short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a ...
复制链接

扫一扫

专栏目录