Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis

转载 2013年12月02日 13:43:52


Non-negative Matrix Factorization is frequently confused with Probabilistic Latent Semantic Analysis. These two methods are applied for document clustering and have similarities and differences. This article explains both methods in the most simple way possible.

How it was discovered

Imagine having 5 documents, 2 of them about environment and 2 of them about U.S. Congress and 1 about both, that means it says about government legislation process in protecting an environment. We need to write a program that unmistakably identifies category of each document and also returns a degree of belonging of each document to a particular category. For this elementary example we limit our vocabulary to 5 words: AIR, WATER, POLLUTION, DEMOCRAT, REPUBLICAN. Category ENVIRONMENT and category CONGRESS may contain all 5 words but with different probability. We understand that the word POLLUTION has more chances to be in the article about ENVIRONMENT than in the article about CONGRESS, but can theoretically be in both. Presume after an examination of our data we built following document-term table:

document/word air water pollution democrat republican
doc 1 3 2 8 0 0
doc 2 1 4 12 0 0
doc 3 0 0 0 10 11
doc 4 0 0 0 8 5
doc 5 1 1 1 1 1

We distinguish our categories by the group of words assigned to them. We decide that category ENVIRONMENT normally should contain only words AIR, WATER, POLLUTION and category CONGRESS should contain only words DEMOCRAT and REPUBLICAN. We build another matrix, each row of which represent category and contains counts for only words that assigned to each category. 

categories air water pollution democrat republican
ENVIRONMENT 5 7 21 0 0
CONGRESS 0 0 0 19 17

We change values from frequencies to probabilities by dividing them by sums in rows, which turns each row into probability distribution. 

Matrix W
categories air water pollution democrat republican
ENVIRONMENT 0.15 0.21 0.64 0 0
CONGRESS 0 0 0 0.53 0.47

Now we create another matrix that contains probability distribution for categories within each document that looks like follows:

Matrix D
doc 1 1.0 0.0
doc 2 1.0 0.0
doc 3 0.0 1.0
doc 4 0.0 1.0
doc 5 0.6 0.4

It shows that top two documents speak about environment, next two about congress and last document about both. Ratios 0.6 and 0.4 for the last document are defined by 3 words from environment category and 2 words from congress category. Now we multiply both matrices and compare the result with original data but in a normalized form. Normalization in this case is division of each row by the sum of all elements in rows. The comparison is shown side-by-side below:

Product of D * W
0.15 0.21 0.64 0.0 0.0
0.15 0.21 0.64 0.0 0.0
0.0 0.0 0.0 0.53 0.47
0.0 0.0 0.0 0.53 0.47
0.09 0.13 0.38 0.21 0.19
Normalized data N
0.23 0.15 0.62 0.0 0.0
0.06 0.24 0.70 0.0 0.0
0.0 0.0 0.0 0.48 0.52
0.0 0.0 0.0 0.61 0.39
0.2 0.2 0.2 0.2 0.2

The correlation is obvious. The the technical problem is to find constrained matrices W and D (given only the number of categories), product of which is the best match to original data in normalized form N.

Likelihood functions

We obtained matrices of decomposition in the above example by combining all words into groups. Obviously, we can not do that for the case when vocabulary is over hundred thousands, number of documents is over million and number of categories is unknown but presumed as relatively large number (let say 100). The generic approach used in both NMF and PLSA is maximization of likelihood function. It is very hard to understand the meaning of likelihood function for beginner and how these likelihood functions are constructed. I can say that they are introduced by experts in probability theory and considered technical subject. Here I try to explain one of them.

Let say we know that in document one the word AIR is met 3 times, the word WATER is met 2 times and the word POLLUTION is met 8 times. If we ask what is the probability that randomly selected word from document one is AIR, the answer is simple, it is 3/13. Same simple conclusion we can make for the word WATER, P{word = WATER} = 2/13, and for word POLLUTION, P{word = POLLUTION} = 8/13. Let us consider the function with three unknown probabilities 

L = 3 * P1 + 2 * P2 + 8 * P3 

where P1 is probability of word AIR, P2 is probability of word WATER and P3 is probability of word POLLUTION. Presume we do not know probabilities. Having this function and constraints P1 + P2 + P3 = 1.0, we can estimate these probabilities by looking for constrained maximum of L. Lagrange method works well in this case. If we obtain P1, P2 and P3 as values that maximize L, we get the above probabilities 3/13, 2/13 and 8/13. This is how it works and the above function is called likelihood. Obviously, we do not need to do that for this simple case. It was only an explanation of the matter of likelihood function. Generically, likelihood functions are used to estimate probabilities afer experiment is already conducted and frequencies of occurrence are known. Like in above example. We know the frequencies and looking for probabilities.


Some articles present this Non-negative Matrix Factorization as Probabilistic Latent Semantic Analysis (example), but it is not the same. The likelihood function for NMF is following:

and likelihood function for PLSA is different:

The conditional probability P(wi | dj) for the first element in data is, for example 3/13, but joint probability P(wi , dj) is 3/69. NMF algorithm is designed in presumption that sums of all probabilities in rows equal to 1.0, which is not true for joint probability. Maximization of likelihood function for PLSF and applying Maximization Expectation algorithm for obtaining a numerical solution lead to a set of following equations:

These formulas express everything via functions. P(w|z) and P(d|z) are functions of two variables. Their values similar to elements of matrices W and DP(z) is distribution function for categories. In matrix notation it will be diagonal matrix Z with probabilities of each category on the principal diagonal. P(w|z,d)is function of three variables. Numerator of E-step expression is product of single element of W, D and Z. Denominator of E-step is a scalar representing inner product of row and column of D and W times correspondent diagonal element of Z. We can think of P(w|z,d) as set of matrices of the first rank for each given z. The size of each of these matrices match the size of document-term matrix. Numerators in M-step expressions are Hadamard products of source data and these matrices of the first rank. In computation of P(z) we add all elements of this Hadamard product. In computation of P(w|z) we add only columns and in computation of P(d|z) we add only rows. The meaning of denominators in M-step is normalization, i.e. making sums of all elements in rows equal to 1.0 to match definition of probabilities. 

Both algorithms NMF and PLSA return approximately the same result if Z is not involved (set to constant and not alternate). Since NMF use relative or normalized values in rows, and PLSA use absolute values, the results match, when sums in rows of document-term matrix N are approximately equal. The result of PLSA is skewed when some documents are larger in size. It is not good or bad, it is different. P(z) brings up some computational stability issues. It does not affect P(w|z) because it is filtered in normalization step, but it affects computation of P(d|z) by skewing values in columns and destroying the result. When it converges, it converges to the same result as without it. On that reason I simply removed P(z) from computation of P(d|z). This is the only difference I introduced in my DEMO (link at the top). I found this computational issue also mentioned in other papers. Some even introduced new parameter called inverse computational temperature and use it for stabilizing P(z). That is overkill. There is no need for a new theory. The problem is in fast changes in P(z). It can be solved by dampen these values during computation by averaging them with values from few previous steps or something similar. Some implementations look like they use P(z) but they actually don't. I found one example of PLSA in java . Although P(z) is computed in iterations the values are always the same, so it does not affect the result. 

I'm not the only one who noticed that NMF and PLSA are computationally close. There is even theoretical proofthat these two methods are equivalent. I found them close but not equivalent. To see the difference you can simply multiply any row in data matrix by the constant. The difference is provided by usage of additional termP(z)


关于LSA(Latent Semantic Analysis)主题模型的个人理解

  • cang_sheng_ta_ge
  • cang_sheng_ta_ge
  • 2015年07月01日 11:43
  • 2825


原文地址: 每个做过或者正在做研究工作的人都会关注一些自己认为有价值的、活跃的研究组和个人的主页...
  • roslei
  • roslei
  • 2016年08月05日 17:06
  • 1081

潜在语义分析Latent semantic analysis note(LSA)原理及代码实现

Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing(LSI),从字面上的意思理解就是通过分析文档去发现这些文档中潜在的意思和概念。假设...
  • bob007
  • bob007
  • 2014年06月13日 16:40
  • 11576

潜在语义索引(Latent Semantic Indexing)

潜在语义索引(Latent Semantic Indexing)是一个严重依赖于SVD的算法,本文转载自之前吴军老师《数学之美》和参考文献《机器学习中的数学》汇总。 ———————————— 在自...
  • u011450885
  • u011450885
  • 2015年06月15日 10:27
  • 1498

推荐系统学习笔记之一 综述

在推荐系统的系列笔记中预计将会简要记录这些:一些推荐系统的基础应用: Content Based Recommendation System 基于内容的推荐系统 Collaborative Filte...
  • asd136912
  • asd136912
  • 2017年10月17日 11:26
  • 241


  • github_36326955
  • github_36326955
  • 2017年05月08日 14:06
  • 913

【推荐系统】概率矩阵分解 probabilistic matrix factorization

前言: PMF沿用了MF矩阵分解的思路, 目标都是求出精确的U和V 在传统MF方法中,优化目标是保证R和UV乘积的差值最小 而在PMF中,R和UV的差值变成了一个高斯概率函数,通过最大后...
  • qq_27032425
  • qq_27032425
  • 2017年11月14日 23:47
  • 199


SVD++是基于SVD(Singular Value Decomposition)的一种改进算法。SVD是一种常用的矩阵分解技术,是一种有效的代数特征提取方法。SVD在协同过滤中的主要思路是根据已有的...
  • winone361
  • winone361
  • 2015年10月26日 17:48
  • 926


近些年,深度学习在语音识别、图像处理、自然语言处理等领域都取得了很大的突破与成就。相对来说,深度学习在推荐系统领域的研究与应用还处于早期阶段。 携程在深度学习与推荐系统结合的领域也进行了相关的研...
  • sinat_36709248
  • sinat_36709248
  • 2017年05月18日 15:21
  • 824

机器学习(5) 推荐 矩阵分解(Matrix Factorization)

稍微看了一些关于推荐方面的资料,做一下简单的总结。 推荐任务定义: 在一个标准推荐任务中,我们有mm个用户(user),nn个物品(item),以及一个稀疏评分矩阵R(R∈Rm∗n)...
  • GZHermit
  • GZHermit
  • 2017年06月29日 18:57
  • 1647
您举报文章:Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis