TF-IDF算法

最新推荐文章于 2024-06-15 19:03:40 发布

月亮&&六便士

最新推荐文章于 2024-06-15 19:03:40 发布

阅读量783

点赞数

本文链接：https://blog.csdn.net/therain123/article/details/126432922

版权

关键词由CSDN通过智能技术生成

提取文本特征一般有三种常用的模型：词袋模型、TF-IDF、word2vector

词袋模型：实质上类同于onehot编码，把所有文档的词装到一个袋子里，构成特征，对具体的一份文档或者句子，若出现该词则对应的特征值取1，否则取0；当然也可以用于统计该次出现的具体次数而不是仅仅标识该词是否出现。
TF-IDF：TF-IDF（Term Frequency-inverse Document Frequency）是一种针对关键词的统计分析方法。用于评估一个词对一个文件集或者一个语料库的重要程度。一个词的重要程度跟它在句子中出现的次数成正比，跟它在语料库出现的次数成反比。这种计算方式能有效避免常用词对关键词的影响，提高了关键词与文章之间的相关性。

word2vector：是深度网络的一种应用。基本含义为把每个词语映射成一个高纬空间的向量，词义或词性相近的词语映射成的向量比较接近。

一、原理介绍

TF-IDF在中文中指词频-逆向文件频率，由TF（词频）和IDF（逆向文件频率）两个部分组成。

其中，TF（词频）指的是某一个给定的词语在该文件中出现的次数，TF的计算公式为：TF=（某词在文档中出现的次数/文档的总词量）

IDF（逆向文档频率），主要用于降低所有文档中一些常见却对文档影响不大的词语的作用。IDF反应了一个词在所有文本中出现的频率，如果一个词在很多的文本中出现，那么它的IDF值应该低，而反过来如果一个词在比较少的文本中出现，那么它的IDF值应该高。包含某词语的文档越少，IDF值越大，说明该词语具有很强的区分能力，TDF的计算公式为：
IDF=loge（语料库中文档总数/包含该词的文档数+1）
+1的原因是避免分母为0

TF-IDF=TFxIDF
TF-IDF值越大表示该特征词对这个文本的重要性越大。

实现

sklearn实现

from sklearn.feature_extraction.text import TfidfVectorizer
corpus =["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]
vectorizer = TfidfVectorizer()
tdm = vectorizer.fit_transform(corpus)
space = vectorizer.vocabulary_
print(space)

参数解释：

vocabulary_：特征和特征在TD-IDF中位置的一个对应关系，比如上例中vocabulary_的输出为，可以看出每个特征词和TD-IDF矩阵列的对应关系：

stop_words：停用词集合，当为'english'时，ENGLISH_STOP_WORDS中定义的词会被忽略，如果为list，list中的单词即为要忽略的词；

max_df: 设定当某个词超过一个df(document frequency)的上限时就忽略该词。当为0~1的float时表示df的比例，当为int时表示df数量;

get_feature_names()：返回特征列表，接上例vectorizer.get_feature_names()返回结果

fit：load数据，并计算tf-idf值；

transform：将数据转化为matrix的形式；

fit_transform：load数据并将数据转化为matrix形式，等于fit+trasform；

利用gensim实现

# -*- coding: utf-8 -*-
from gensim import corpora, models, similarities
import logging
from collections import defaultdict
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
# 文档
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
 
# 1.分词，去除停用词
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print('-----------1----------')
print(texts)
# [['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
# ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived
# ', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'],
# ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]
 
# 2.计算词频
frequency = defaultdict(int)  # 构建一个字典对象
# 遍历分词后的结果集，计算每个词出现的频率
for text in texts:
    for token in text:
        frequency[token] += 1
# 选择频率大于1的词
texts = [[token for token in text if frequency[token] > 1] for text in texts]
print('-----------2----------')
print(texts)
# [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system',
# 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
 
# 3.创建字典（单词与编号之间的映射）
dictionary = corpora.Dictionary(texts)
# print(dictionary)
# Dictionary(12 unique tokens: ['time', 'computer', 'graph', 'minors', 'trees']...)
# 打印字典，key为单词，value为单词的编号
print('-----------3----------')
print(dictionary.token2id)
# {'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, 'response': 6, 'time': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}
 
# 4.将要比较的文档转换为向量（词袋表示方法）
# 要比较的文档
test_doc = ["Human computer interaction",'A survey of user opinion of computer system response time']
# 将文档分词并使用doc2bow方法对每个不同单词的词频进行了统计，并将单词转换为其编号，然后以稀疏向量的形式返回结果
new_vecs = [dictionary.doc2bow(doc.lower().split()) for doc in test_doc]
print('-----------4----------')
# print(new_vecs)
# [[(0, 1), (2, 1)]
 
# 5.建立语料库
# 将每一篇文档转换为向量
corpus = [dictionary.doc2bow(text) for text in texts]
print('-----------5----------')
print(corpus)
# [[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(1, 1), (4, 1), (5, 1), (8, 1)], [(0, 1), (5, 2), (8, 1)], [(4, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(3, 1), (10, 1), (11, 1)]]
 
# 6.初始化模型
# 初始化一个tfidf模型,可以用它来转换向量（词袋整数计数）表示方法为新的表示方法（Tfidf 实数权重）
tfidf = models.TfidfModel(corpus)
# 测试
test_doc_bow = [(0, 1), (1, 1)]
print('-----------6----------')
print(tfidf[test_doc_bow])
# [(0, 0.7071067811865476), (1, 0.7071067811865476)]
 
print('-----------7----------')
# 将整个语料库转为tfidf表示方法
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)
 
# 7.创建索引
index = similarities.MatrixSimilarity(corpus_tfidf)
 
print('-----------8----------')
# 8.相似度计算
new_vec_tfidf_ls = [tfidf[new_vec] for new_vec in new_vecs]  # 将要比较文档转换为tfidf表示方法
 
print('-----------9----------')
# 计算要比较的文档与语料库中每篇文档的相似度
sims = [index[new_vec_tfidf] for new_vec_tfidf in new_vec_tfidf_ls]
print(sims)