TF-IDF的原理

最新推荐文章于 2024-07-30 18:04:32 发布

黑码

最新推荐文章于 2024-07-30 18:04:32 发布

阅读量324

点赞数

原文链接：https://blog.csdn.net/sun_brother/article/details/80360112

版权

转:https://blog.csdn.net/sun_brother/article/details/80360112

TF-IDF介绍

TF-IDF是NLP中一种常用的统计方法，用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度，通常用于提取文本的特征，即关键词。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。公式 𝑡𝑓𝑖𝑑𝑓=𝑡𝑓∗𝑖𝑑𝑓.

TF-IDF（Term Frequency-Inverse Document Frequency），词频-逆文档频率算法，它是一种统计方法，用于评估一字词对一文件集或一语料库的中的某一篇文档的重要性，字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。
总结一下，就是一个词在一篇文档中出现的次数越多，同时在其他所有文档中出现的次数越少，这个词越能代表这篇文当的内容。
比如‘的’这个字肯定在所有文档出现的次数都很多，不是一篇文档独有的，那这个词的重要性就很小，对应的TF-IDF值也很小，而在由体育新闻和娱乐新闻组成的文档集中，‘篮球’肯定在介绍篮球的新闻中出现的次数多，而在娱乐新闻中出现的次数少，那篮球这个词就能代表他所在文档的内容，这个词的TF-IDF就会比较大。
应用
TF-IDF加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外，因特网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜寻结果中出现的顺序。
TF：词频
词频指的是某一个指定词在一篇文档中出现的次数，为了避免词频偏向长文档（同一个词可能在长文档里比短文档里出现的次数多，而不管重要与否），所以用词出现的次数比文档的总词数作为归一化公式以防止它偏向长的文章。
TF=在某个文档中词条出现的次数该文档的所有词条数目TF=在某个文档中词条出现的次数该文档的所有词条数目

逆向文件频率（Inverse Document Frequency）
有些通用词在每个文档中都会大量出现，比如‘的’这样的词，用TF公式计算出来的权重肯定很大，但是这样的词无法反应一篇文档的主题，我们需要那些在一篇文档中出现的多而在其他文档中出现的少的词，这一类的词才能反映文档主题，显然TF是做不到这一点的，而逆向文件频率恰好可以做到这一点。
IDF的思想：
如果包含某个词条的文档越少，IDF越大，则说明词条具有很好的类别区分能力。按如下公式计算某个词条的IDF值：
IDF=log语料库中文档总数包含指定词条的文档数+1，分母加1是为了防止分母是0 IDF=log语料库中文档总数包含指定词条的文档数+1，分母加1是为了防止分母是0
某一文档中的高频词语，以及该词语在整个文档集合中的低文件频率，可以产生高权重的TF-IDF值，因此，TF-IDF倾向于过滤掉常见词，保留重要词语。

TF−IDF=TF∗IDF

上面是自己实践 TF-IDF 下面是 gensim

转:https://www.cnblogs.com/jclian91/p/9895410.html

import nltk
import math
import string
from nltk.corpus import stopwords     #停用词
from collections import Counter       #计数
from gensim import corpora, models, matutils

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

# 文本预处理
# 函数：text文件分句，分词，并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
                tokens.append(word)
    return tokens

# 对原始的text文件去掉停用词
# 生成count字典，即每个单词的出现次数
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用词
    count = Counter(filtered)
    print(count)
    # print(count.type)
    # m = input()
    return count


# 计算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 计算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

import numpy as np

# 对向量做规范化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    print(lst)
    m = input()
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector





# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
# print(count1)
# m = input()
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}

    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))