【任务2 - 学习TF-IDF理论并实践,使用TF-IDF表示文本】
TF-IDF复习
TFIDF–short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus, it is a weighting factor to a word/token. It is defined to be the product of term frequency and inverse document frequency:
代码实践
以前自己写过tf-idf的实现,小数据还能跑,大点的就凉凉。还是用现成的sklearn吧?
from sklearn.feature_extraction.text import TfidfVectorizer
from time import time
t0 = time()
tfidf = TfidfVectorizer()
vectorizer = tfidf.fit(X_train['word_seg'])
X_train = vectorizer.fit_transform(X_train['word_seg'])
X_valid = vectorizer.fit_transform(X_valid['word_seg'])
duration = time() - t0
print("done in %fs" % (duration))
print("%d documents, %d features" % X_train.shape)
print("%d documents, %d features" % X_valid.shape)
#done in 125.134721s
#92049 documents, 824243 features
#10228 documents, 254316 features
回顾:【达观杯】数据竞赛学习篇(一)
https://blog.csdn.net/weixin_42317507/article/details/89047074