手动实现tf-idf

最新推荐文章于 2021-06-29 13:52:29 发布

weixin_34174422

最新推荐文章于 2021-06-29 13:52:29 发布

阅读量305

点赞数

文章标签： python 人工智能

原文链接：https://my.oschina.net/wangzonghui/blog/2991118

版权

2019独角兽企业重金招聘Python工程师标准>>>

项目研究时，使用sklearn实现tf-idf，在word2vec已tf-idf为权重时，sklearn使用上不太方便，也许个人理解不够，大脑充血手动实现了tf-idf。

理论说明：tf（term Frequency）词频，一句话中词出现次数。

idf（Inverse Document Frequency）逆文本概率指数，总文章数，除以出现文章数加1（可不加，防止出现文章数为0），再求对数。

tf-idf：tf*idf

使用场景：主要用来信息检索、数据挖掘的加权技术，可以比较两个文章的相似度。

核心代码如下：

def  dealIdf(list_context):
    '''
	计算 文章中词在其他文章出现次数
	list_context [[文章分词],[文章分词]]
    计算词出现文章数
    '''
    result_idf=[]
    for row in list_context:
        word_split=[]
        for word in row:
            word_split.append(word)
        
        word_split=list(set(word_split)) #拆分

        line={}
        for word in word_split:
            count=0
            for other in list_context:
                if row == other:
                    continue
                else:
                    if word in other:
                        count+=1
            line[word]=count
        result_idf.append(line)
    
    return result_idf
            
def dealTf(list_context):
    '''
    计算idf
	list_context [[文章分词],[文章分词]]
	返回结构为字典{词:次数}数
    '''
    result_tf=[]
    for row in list_context:
        print(row)
        word_split=[]
        for word in row:
            word_split.append(word)

        size=len(word_split)  #总词数
        online=list(set(word_split))  #去除
        line={}
        for word in online:
            count=0
            for all in word_split:
                if word == all:
                    count+=1
            line[word]=count/size
        result_tf.append(line)
    return result_tf;
	

def tfIdf(leng,line,row_idf,row_tf):
    '''
    leng:文章总数
    line:当前行词 数组
    row_idf:当前行各词idf
    row_tf:当前行tf值
	返回 line中各个词的tf-idf
    '''
    result_tfIdf=[]
    for token in line:
        result_tfIdf.append(row_tf[token] * math.log(leng/(row_idf[token]+1)))
    
    return result_tfIdf

具体应用根据具体情况使用。

转载于:https://my.oschina.net/wangzonghui/blog/2991118