【346】TF-IDF

Ref: 文本挖掘预处理之向量化与Hash Trick

Ref: 文本挖掘预处理之TF-IDF

Ref: sklearn.feature_extraction.text.CountVectorizer

Ref: TF-IDF与余弦相似性的应用(一):自动提取关键词

Ref: TF-IDF与余弦相似性的应用(二):找出相似文章

Ref: TF-IDF与余弦相似性的应用(三):自动摘要

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"]
>>> vectorizer=CountVectorizer()
>>> transformer = TfidfTransformer()
>>> tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
>>> print(tfidf)
  (0, 16)	0.4424621378947393
  (0, 15)	0.697684463383976
  (0, 4)	0.4424621378947393
  (0, 3)	0.348842231691988
  (1, 14)	0.45338639737285463
  (1, 9)	0.45338639737285463
  (1, 6)	0.3574550433419527
  (1, 5)	0.3574550433419527
  (1, 3)	0.3574550433419527
  (1, 2)	0.45338639737285463
  (2, 12)	0.5
  (2, 7)	0.5
  (2, 1)	0.5
  (2, 0)	0.5
  (3, 18)	0.3565798233381452
  (3, 17)	0.3565798233381452
  (3, 15)	0.2811316284405006
  (3, 13)	0.3565798233381452
  (3, 11)	0.3565798233381452
  (3, 10)	0.3565798233381452
  (3, 8)	0.3565798233381452
  (3, 6)	0.2811316284405006
  (3, 5)	0.2811316284405006
>>> print(vectorizer.get_feature_names())
['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']

说明:其中 (0, 16) 表示第一行文本,索引为 16 的词,对应的是“travel”,以此类推。

继续上面的信息,获取对应 term 的 tfidf 值,tfidf 变量对应的是 (4, 19) 矩阵的值,对应不同的句子,不同的 term。

>>> tfidf_array = tfidf.toarray()    #获取array,然后遍历array,并分别转为list
>>> names_list = vectorizer.get_feature_names()    #获取names的list
>>> for i in range(0, len(corpus)):
	print(corpus[i],'\n')
	tmp_list = tfidf_array[i].tolist()
	for j in range(0, len(names_list)):
		if tmp_list[j] != 0:
			if len(names_list[j])>=7:
				print(names_list[j],'\t',tmp_list[j])
			else:
				print(names_list[j],'\t\t',tmp_list[j])
	print('')

	
I come to China to travel 

china 		 0.348842231691988
come 		 0.4424621378947393
to 		 0.697684463383976
travel 		 0.4424621378947393

This is a car polupar in China 

car 		 0.45338639737285463
china 		 0.3574550433419527
in 		 0.3574550433419527
is 		 0.3574550433419527
polupar 	 0.45338639737285463
this 		 0.45338639737285463

I love tea and Apple  

and 		 0.5
apple 		 0.5
love 		 0.5
tea 		 0.5

The work is to write some papers in science 

in 		 0.2811316284405006
is 		 0.2811316284405006
papers 		 0.3565798233381452
science 	 0.3565798233381452
some 		 0.3565798233381452
the 		 0.3565798233381452
to 		 0.2811316284405006
work 		 0.3565798233381452
write 		 0.3565798233381452

>>> 

获取 TF(Term Frequency)

>>> X = vectorizer.fit_transform(corpus)
>>> X.toarray()
array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0],
       [0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]],
      dtype=int64)
>>> vector_array = X.toarray()
>>> for i in range(0, len(corpus)):
	print(corpus[i],'\n')
	tmp_list = vector_array[i].tolist()
	for j in range(0, len(names_list)):
		if tmp_list[j] != 0:
			if len(names_list[j])>=7:
				print(names_list[j],'\t',tmp_list[j])
			else:
				print(names_list[j],'\t\t',tmp_list[j])
	print('')

I come to China to travel 

china 		 1
come 		 1
to 		 2
travel 		 1

This is a car polupar in China 

car 		 1
china 		 1
in 		 1
is 		 1
polupar 	 1
this 		 1

I love tea and Apple  

and 		 1
apple 		 1
love 		 1
tea 		 1

The work is to write some papers in science 

in 		 1
is 		 1
papers 		 1
science 	 1
some 		 1
the 		 1
to 		 1
work 		 1
write 		 1

>>> 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值