python文章分类tf-idf案例_分享自用小工具：TF-IDF计算文档相似性的python实现

最新推荐文章于 2023-03-08 22:38:07 发布

weixin_39638708

最新推荐文章于 2023-03-08 22:38:07 发布

阅读量318

点赞数

文章标签： python文章分类tf-idf案例

本文信息本文由方法SEO顾问发表于2016-03-1112:53:11，共 3402 字，转载请注明：分享自用小工具：TF-IDF计算文档相似性的python实现_【方法SEO顾问】，如果我网站的文章对你有所帮助的话，来百度口碑给个好评呗！

首先感谢方法共享平台，哈哈。

先来说说实现思路

1、我从我的数据库中获取了一些文章的title

2、将title用jieba分词进行分词

3、使用一些第三方库计算出词频向量(其中计算

4、根据每两篇文档的词频向量计算其余弦相似性，公式如下：

5、根据人肉观察和计算结果，设定一个阀值，作为相似性推荐的参数值

需要安装的库有：

sklearn，jieba，simplejson，还有一个翻译包，不过可以改改代码然后不安装这个包。

测试结果：

1、两个完全相同的标题，得到的最大值是1.0。

2、两个完全不相同的标题，得到的最小值是0.0。

3、150行标题，计算速度是0.0xxx，速度还算可以。

4、10万行标题，计算速度是40秒左右，算是非常慢了。如果后续有优化版本，我会再放上来，毕竟支持大量文章中筛选出相似文章才是硬需求。

5、感觉0.5以上就挺相似的了。

脚本包注意事项：

1、我的站点是个繁体的站点，从中选出的标题jieba不能进行分词，于是我翻译后再分词的，还需要一个翻译包点击我。

2、开始是想自己用，于是中间在用json转来转去，现在感觉没必要，大家可以改改。

代码如下，我觉得可以扩展一下直接使用。

#!/usr/local/bin/python

#coding=utf-8

# daoxin 2016-3

import json, simplejson, sys, re

reload(sys)

sys.setdefaultencoding('utf8')

import jieba, time

import string

from sklearn import feature_extraction

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer

from collections import Counter

import math

#导入翻译模块；去git下载吧

sys.path.append('/Users/movespeed/Desktop/Python/fanti_jianti') #文件路径自己修改

from zhtools.langconv import *

#翻译模块结束

"""

cutWord function , save cut word results into json file

"""

def cutWord():

f = open('/Users/movespeed/Desktop/Python/title.json') #文件路径自己修改

"""

Open source file

"""

jiebafile = open('/Users/movespeed/Desktop/Python/jiebafile.json', 'w+'

) #文件路径自己修改

"""

save cutword result into a json file

"""

while 1:

line = f.readline()

if not line:

break

else:

title = json.loads(line)['title']

t_id = json.loads(line)['id']

tf_idf = json.loads(line)['tf-idf']

#seg_list = jieba.cut(title, cut_all=True)#the jieba cut function

title = Converter('zh-hans').convert(

title.decode('utf-8')

) # Translate func for jieba can't cut cht.

#print title

seg_list = jieba.cut(title, cut_all=True) #the jieba cut function

seg_list = str(",".join(seg_list))

seg_list = seg_list.split(',')

#print seg_list

result = []

for seg in seg_list:

seg = ','.join(seg.split(',')).decode('utf-8')

if (seg != '' and seg != "\n" and seg != "\n\n" and

seg != "_" and seg != "," and seg != "|"):

result.append(seg)

jsoninfo = json.dumps({"id": t_id,

"title": title,

"cut_word": result,

"tf_idf": None})

jiebafile.write(jsoninfo + '\n')

#The testting of cut word function

#cutWord()

"""

vector-values counter function

"""

filelist = open('/Users/movespeed/Desktop/Python/jiebafile.json', 'r'

) #文件路径自己修改

# Change json infomation into list , every item contains some chinese word split by one space

def ChangeJsonIntoList(filelist):

vectorList = list()

for doc in filelist.readlines():

#jsoninfo = str(json.loads(doc)['cut_word']).replace(',', '')

jsoninfo = ' '.join(json.loads(doc)['cut_word'])

#print jsoninfo

#print type(jsoninfo)

vectorList.append(jsoninfo)

return vectorList

# tf_idf function, return tfidf array

def Tf_Idf(vectorList):

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(vectorList)

counts = X.toarray()

transformer = TfidfTransformer()

tfidf = transformer.fit_transform(counts)

tfidf__ = tfidf.toarray()

return tfidf__

#Calculating cosine similarity

def CosValues(data1, data2):

tfidf__ = Tf_Idf(vectorList)

#print tfidf__ #输出全量文本词频向量稀疏矩阵。

#print tfidf__[16] #某行文本的向量

#print tfidf__[17] #某行文本的向量

Numb = 0

Agen = 0

Bgen = 0

for v, k in zip(tfidf__[data1], tfidf__[data2]):

Numb += v * k

Agen += v**2

Bgen += k**2

print '余弦值：',Numb / (math.sqrt(Agen) * math.sqrt(Bgen))

start = time.clock()

#vector-values function testing

vectorList = ChangeJsonIntoList(filelist)

print type(vectorList[0]), 'type check---!==!--!=='

print vectorList[17] #可以目测一下第A段文本

#print ' '.join(eval(vectorList[0]))

print '---------'

print vectorList[16] #可以目测一下第B段文本

#cutWord() #调用切词程序，生成分词json

CosValues(17, 16) #计算两段文本的相似性。两个参数是对应文本的行号。

end = time.clock()

print '耗时：',end - start # Take xxx seconds

"""

测试结果：

最大值：1.0

最小值：0.0

150段文本的速度：0.0xx秒

10万段文本的速度：40秒

"""

zip包下载。内有测试数据。改吧改吧就可以自测了。

weixin_39638708

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python文章分类tf-idf案例_分享自用小工具：TF-IDF计算文档相似性的python实现

本文信息本文由方法SEO顾问发表于2016-03-1112:53:11，共 3402 字，转载请注明：分享自用小工具：TF-IDF计算文档相似性的python实现_【方法SEO顾问】，如果我网站的文章对你有所帮助的话，来百度口碑给个好评呗！首先感谢方法共享平台，哈哈。先来说说实现思路1、我从我的数据库中获取了一些文章的title2、将title用jieba分词进行分词3、使用一些第三方库计算出词频...
复制链接

扫一扫