gensim学习笔记（二）- Topic and Transformations(TF-IDF, LSI)

最新推荐文章于 2024-06-04 16:21:44 发布

Johnson0722

最新推荐文章于 2024-06-04 16:21:44 发布

阅读量8.1k

点赞数 2

分类专栏： NLP 文章标签：自然语言处理 LSI tf-idf gensim

本文链接：https://blog.csdn.net/John_xyz/article/details/54744413

版权

NLP 专栏收录该内容

19 篇文章 3 订阅

订阅专栏

主题和转化

加载配置logging

>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

在上一个教程里，我们使用词袋模型，用向量来表示文档，在这节中，主要讨论向量空间的转化。首先，先从磁盘中加载保存的数据。

>>> from gensim import corpora, models, similarities
>>> if (os.path.exists("/tmp/deerwester.dict")):
>>>    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
>>>    corpus = corpora.MmCorpus('/tmp/deerwester.mm')
>>>    print("Used files generated from first tutorial")
>>> else:
>>>    print("Please run first tutorial to generate data set")

TF-IDF Transformation

转换（transformations）是标准的Python类，通常通过训练语料库的方式初始化

>>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

不同的转化需要不同的参数，在TF-IDF转化中，训练的过程就是简单的遍历训练语料库，然后计算文档中每个特征的频率。我们先来看个例子，在解释TF-IDF模型的具体转化过程

>>> doc_bow = [(0, 1), (1, 1)]
>>> print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors
[(0, 0.70710678), (1, 0.70710678)]

为什么会得到这样的结果呢？首先我们来看看训练集corpus

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

TF-IDF计算公式如下：

词频（term frequency，tf）:
在一份给定的文件里，词频（term frequency，tf）指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数（term count）的归一化，以防止它偏向长的文件。（同一个词语在长文件里可能会比短文件有更高的词数，而不管该词语重要与否。）对于在某一特定文件里的词语 $t_{i}$ 来说，它的重要性可表示为：
$tf_{i,j} = \frac{n_{i,j}}{\sum_kn_{k,j}}$
以上式子中 $n_{i,j}$ 是该词在文件 $d_{j}$ 中的出现次数，而分母则是在文件 $d_{j}$ 中所有字词的出现次数之和

逆向文件频率（inverse document frequency，idf）是一个词语普遍重要性的度量。某一特定词语的idf，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取以10为底的对数得到
$\mathrm {idf_{i}} =\lg {\frac {|D|}{|\{j:t_{i}\in d_{j}\}|}}$
$|D|$ ：语料库中的文件总数
$|\{j:t_{i}\in d_{j}\}|$ 包含词语 $t_{i}$ 的文件数目（即 $n_{i,j}\neq 0$ 的文件数目）如果词语不在数据中，就导致分母为零，因此一般情况下使用 $1+|\{j:t_{i}\in d_{j}\}|$

TF-IDF：
${\mathrm {tf{}idf_{{i,j}}}}={\mathrm {tf_{{i,j}}}}\times {\mathrm {idf_{{i}}}}$
某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的tf-id。因此，tf-idf倾向于过滤掉常见的词语，保留重要的词语。

需要注意的是，在tf-idf的计算过程中, tf的计算是局部的，而idf的计算是全局的

TF-IDF理论依据及不足

tf-idf算法是创建在这样一个假设之上的：对区别文档最有意义的词语应该是那些在文档中出现频率高，而在整个文档集合的其他文档中出现频率少的词语，所以如果特征空间坐标系取tf词频作为测度，就可以体现同类文本的特点。另外考虑到单词区别不同类别的能力，tf-idf法认为一个单词出现的文本频数越小，它区别不同类别文本的能力就越大。因此引入了逆文本频度idf的概念，以tf和idf的乘积作为特征空间坐标系的取值测度，并用它完成对权值tf的调整，调整权值的目的在于突出重要单词，抑制次要单词。但是在本质上idf是一种试图抑制噪声的加权，并且单纯地认为文本频率小的单词就越重要，文本频率大的单词就越无用，显然这并不是完全正确的。idf的简单结构并不能有效地反映单词的重要程度和特征词的分布情况，使其无法很好地完成对权值调整的功能，所以tf-idf法的精度并不是很高。

此外，在tf-idf算法中并没有体现出单词的位置信息，对于Web文档而言，权重的计算方法应该体现出HTML的结构特征。特征词在不同的标记符中对文章内容的反映程度不同，其权重的计算方法也应不同。因此应该对于处于网页不同位置的特征词分别赋予不同的系数，然后乘以特征词的词频，以提高文本表示的效果。

基于这个公式，我们就不难理解结果了。TF-IDF方法有多种表达式，但其思想都是一致的，即：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。
tfidf可以用它来转换将任何采用旧表示方法的向量（词袋整数计数）转换为新的表示方法（Tfidf 实数权重）
我们还可以在整个文档集上使用tfidf变换.

>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
...     print(doc)
[(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)]
[(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)]
[(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.70710678118654746), (10, 0.70710678118654746)]
[(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)]
[(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)]

在这个例子中，我们将训练集进行转化，但这只是一次偶然。一旦我们初始化了转化模型，也就是我们训练好了模型，我们可以转化任何向量，当然，这些向量来自相同的向量空间

注意：

Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that

LSI Transformation

我们先用一个简单的例子来了解一下LSI模型的基本原理。
假设我们有三个文档d1,d2,d3：
d1: Shipment of gold damaged in a fire.
d2: Delivery of silver arrived in a silver truck.
d3: Shipment of gold arrived in a truck.
为了尽量简化问题，我们不考虑停留词(stopwords)的影响，所有的单词小写表示。
Problem:Use Latent Semantic Indexing (LSI) to rank these documents for the query gold silver truck

Step 1 : 构造词-文档矩阵A，A中的每列代表每个文档，每行代表单词。查询矩阵q只有一列，代表查询的文档
这里写图片描述
Step 2: 使用奇异值分解分解矩阵A， $A = USV^T$

关于分解得到的三个矩阵U,S,V。我是这样理解的：

U的每一列代表一个主题，每一行代表文档集中的词，也就是Terms中的词，U矩阵中的值可以理解为每个主题和terms中每个词的相关性。
V中的每一行代表一个文档，也就是例子中的d1,d2,d3。每一列代表一类主题。V矩阵中的元素即为每个文档和每个主题之间的相关性。
S是一个对角矩阵，对角元素代表主题的强度？（对角元素即为根号特征值，其越大就表示越重要)

Step 3: 令k=2，我们可以得到如下近似.
这里写图片描述

Step 4: 在k=2的二维空间中，我们可以得到每个文档的向量表示.矩阵V中的每一行就代表着每个文档的向量表示，因为V中的每个元素就是每个文档和一类主题的相关性。

d1(-0.4945, 0.6492)
d2(-0.6458, -0.7194)
d3(-0.5817, 0.2469)

Step 5: 同理我们可以推算出查询”gold silver truck”的向量表示
这里写图片描述

Step 6: 接下来，我们就可以利用相似性度量（余弦相似度）来计算查询“glod silver truck”和每个文档之间的距离了
这里写图片描述

相信看到这里大家对LSI模型应该有了基本的了解，接下来看看gensim中如何使用LSI transformation.

>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
>>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

我们将上文得到的corpus_tfidf通过LSI转换到二维空间，因为num_topics = 2. 我们可以用models.LsiModel.print_topics()看看每个主题和词之间的关系

>>> lsi.print_topics(2)
topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

可以看出，第一个主题topic0和“trees”,”graph”,”minors”等关系较大，而第二个主题topic1则与“system”,”user”等关系较大

>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
...     print(doc)
[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"

当然，我们也可以将模型保存在本地，方便以后调用

>>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
>>> lsi = models.LsiModel.load('/tmp/model.lsi')

可用的转化（Available transformations）

Gensim中包含着几个非常流行的向量空间模型算法

TF-IDF转化：

TF-IDF初始化的时候输入时一个词带模型训练的语料库，在转化的时候，转化之后的向量和之前的向量维度相同

>>> model = models.TfidfModel(corpus, normalize=True)

Latent Semantic Indexing, LSI (or sometimes LSA) 转化：

transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard”

>>> model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

>>> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
>>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
>>> ...
>>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
>>> lsi_vec = model[tfidf_vec]
>>> ...

除此之外，还有很多种transformations,如Random Projections(RP), latent Dirchlet Allocation(LDA), Hierarchical Dirchlet Process(HDP),这里不一一介绍了。详细见http://radimrehurek.com/gensim/tut2.html