sklearn中TF-IDF方法的理解
TF-IDF是对于一个文本特征提取非常有用的办法。进行特征选择,对文本分类具有重要的意义。特征选择就是要想办法选出那些最能表征文本含义的词组元素 。特征选择不仅可以降低问题的规模,还有助于分类性能的改善,选取不同的特征对文本分类系统的性能有非常重要的影响。
采用TF-IDF进行特征提取,可以得到每句话的一个向量,其中每一个单词得到了一个对应的权值,用来进行接下来的训练等任务。
💡机器学习训练需要特征向量,特征向量要有权值,没有权值,只有一堆文字,怎么进行?所以特征提取其实也就是想办法找到一个合理的权值,形成一个向量的工作。
看一下官网的例子:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus) # X 是稀疏矩阵类型
X
(0, 1) 0.46979138557992045
(0, 2) 0.5802858236844359
(0, 6) 0.38408524091481483
(0, 3) 0.38408524091481483
(0, 8) 0.38408524091481483
(1, 5) 0.5386476208856763
(1, 1) 0.6876235979836938
(1, 6) 0.281088674033753
(1, 3) 0.281088674033753
(1, 8) 0.281088674033753
(2, 4) 0.511848512707169
(2, 7) 0.511848512707169
(2, 0) 0.511848512707169
(2, 6) 0.267103787642168
(2, 3) 0.267103787642168
(2, 8) 0.267103787642168
(3, 1) 0.46979138557992045
(3, 2) 0.5802858236844359
(3, 6) 0.38408524091481483
(3, 3) 0.38408524091481483
(3, 8) 0.38408524091481483
这个矩阵是一个稀疏矩阵,使用todense()
方法可以转为正常的矩阵:
X = X.todense()
X
matrix([[0. , 0.46979139, 0.58028582, 0.38408524, 0. ,
0. , 0.38408524, 0. , 0.38408524],
[0. , 0.6876236 , 0. , 0.28108867, 0. ,
0.53864762, 0.28108867, 0. , 0.28108867],
[0.51184851, 0. , 0. , 0.26710379, 0.51184851,
0. , 0.26710379, 0.51184851, 0.26710379],
[0. , 0.46979139, 0.58028582, 0.38408524, 0. ,
0. , 0.38408524, 0. , 0.38408524]])
矩阵索引问题
如果你仔细观察,你会发现矩阵中的索引和原语料文本并不对应,这是怎么回事呢?如何理解TfidfVectorizer.fit_transform(corpus)
的返回值呢?
词袋模型
回顾一下,TF-IDF是基于词袋模型:
通过vectorizer.get_feature_names()
方法可以查看词袋模型,也就是所有的词:
vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
构建词袋模型向量对应单词的字典:
index2word_dict = dict(zip(feature_names, [i for i in range(len(feature_names))]))
{'and': 0,
'document': 1,
'first': 2,
'is': 3,
'one': 4,
'second': 5,
'the': 6,
'third': 7,
'this': 8}
这样就可以理解了,TfidfVectorizer.fit_transform(corpus)
返回的稀疏矩阵每一列(A,B) C
的含义是:
A:文档索引 B:词袋模型中的词向量序号 C:对应的TF-IDF值
A: Document index B: Specific word-vector index C: TFIDF score for word B in document A
⚠️注意是B是词向量的索引,而不是列索引。
⚠️虽然词向量编号是全矩阵统一的,但是每个词的TF-IDF值是对应该行计算的。
构建文本对应TF-IDF值的字典:
def get_tfidf_words_dict(text):
tfidf_matrix= vectorizer.transform([text]).todense()
feature_index = tfidf_matrix[0,:].nonzero()[1]
tfidf_scores = zip([feature_names[i] for i in feature_index], [tfidf_matrix[0, x] for x in feature_index])
return dict(tfidf_scores)
for i in range(len(corpus)):
d = get_tfidf_words_dict(corpus[i])
print(d)
{'document': 0.46979138557992045, 'first': 0.5802858236844359, 'is': 0.38408524091481483, 'the': 0.38408524091481483, 'this': 0.38408524091481483}
{'document': 0.6876235979836938, 'is': 0.281088674033753, 'second': 0.5386476208856763, 'the': 0.281088674033753, 'this': 0.281088674033753}
{'and': 0.511848512707169, 'is': 0.267103787642168, 'one': 0.511848512707169, 'the': 0.267103787642168, 'third': 0.511848512707169, 'this': 0.267103787642168}
{'document': 0.46979138557992045, 'first': 0.5802858236844359, 'is': 0.38408524091481483, 'the': 0.38408524091481483, 'this': 0.38408524091481483}