LSA概述
Latent Semantic Analysis
简单来说,就是将word
和document
透射到concept space
,然后在concept space
中聚类,以实现语义级别的检索等功能。
LSA的核心,有以下几点:
parse
阶段,将文档表示为bags of words
,同时忽略掉stop words
以及标点符号。例如实例中的parse(self, doc)
函数,输出一个字典对象,key
是word
,value
是出现的文档序号的list
(同一篇文档可能出现同一个词多次,因此list
中的值不唯一)。build
阶段,构建count Matrix
,行是word
,列是document
,对应的值是对应的word
在document
中出现的频数。SVD
,基于SVD
上发现比较大的奇异值,并且投射到concept space
。picture
,实现二维空间的可视化,发现聚类模式
LSA的使用,基于以下假设:
- 文档被表示为
bags of words
,也就是只考虑一篇文章中的词的频率而不考虑其顺序。 - 相同概念的词(表示相同或者近似内容)的词总会被聚类在一起
- 不考虑多义词,每个单词只确定其唯一含义
LSA注意
- 得到
Count Matrix
后,最好进行TF-IDF
,来决定对应词在对应文档的重要性权值。 - 下面的实例中省略了第一个维度,因为第一个维度表征一个平均参数,具体来说就是这个文档平均有多少个词,或者这个词平均在多少个文档出现,意义不大因此省略。但是更加通用的做法是先对
Count Matrix
进行列的normalize
,这样的话就不用省略第一个维度,缺点是这样会让sparse matrix
变得dense
。
LSA优缺点
优点
- 将词和文档都聚类到同样的概念空间,因此可以在概念空间上实现聚类,并且可以实现词和文档的相互查询(比如根据词在概念空间上检索相应的文档)。
- 概念空间的维度相比原矩阵小得多,并且这些维度中包含的信息多噪音少。
LSA
是一种global algorithm
,容易让我们发现难以观察到的模式信息等。
缺点
- 假设
Gaussian distribution
和Frobenius norm
,不一定适合所有的问题。比如,文章中的单词遵从Poisson distribution
而不是Gaussian distribution
。 - 不能处理多义词的问题,假设每个单词只有一个意思。
- 严重依赖
svd
,计算量相对较大。
LSA实例
选用的9个文档标题分别是:
- The Neatest Little Guide to Stock Market Investing
- Investing For Dummies, 4th Edition
- The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
- The Little Book of Value Investing
- Value Investing: From Graham to Buffett and Beyond
- Rich Dad’s Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
- Investing in Real Estate, 5th Edition
- Stock Investing For Dummies
- Rich Dad’s Advisors: The ABC’s of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss
Count Matrix
为
SVD
分解后,根据矩阵
S
<script type="math/tex" id="MathJax-Element-67">S</script>对角线上奇异值的平方进行重要性排序,结果如下所示:
根据Book Title Matrix
的聚类方法结果如下,使用维度2,3进行简单的聚类:
Dim2 | Dim3 | Titles |
---|---|---|
red | red | 7,9 |
red | blue | 6 |
blue | red | 2,4,5,8 |
blue | blue | 1,3 |
根据Book Title Matrix
和word matrix
的聚类方法结果如下,同样使用维度2,3进行简单的聚类:
%pylab inline
from numpy import zeros
from scipy.linalg import svd
#following needed for TFIDF
from math import log
from numpy import asarray, sum
import matplotlib.pyplot as plt
titles = ["The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"
]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''
class LSA(object):
def __init__(self, stopwords, ignorechars):
self.stopwords = stopwords
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower().translate(None, self.ignorechars)
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
#考虑wdict['book']会不会出现[0,0]如果book在0中出现两次
self.wdict[w] = [self.dcount]
self.dcount += 1
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
def calc(self):
self.U, self.S, self.Vt = svd(self.A)
def picture0(self):
'''
根据奇异值的平方画出奇异值的重要性的bar图
'''
plt.bar(left=range(len(self.S)) ,height=(self.S**2)/sum(self.S**2),align="center")
plt.xticks(range(len(self.S)))
plt.title("The Importance of Each Singular Value")
plt.xlabel(u"Singular Values")
plt.ylabel(u"Importance")
def picture1(self):
'''
画出瓦片图
'''
plt.set_cmap('bwr')
plt.pcolor(-1*self.Vt[0:3,:])
plt.colorbar()
plt.yticks(np.arange(3)+0.5,['Dim1','Dim2','Dim3',])
plt.xticks(np.arange(9)+0.5,[i[0]+i[1] for i in zip(['T']*9 ,map(str,range(1,10)))])
plt.gca().invert_yaxis()
plt.gca().set_aspect('equal')
plt.xlabel("Book Titles")
plt.ylabel("Dimensions")
plt.title("Top 3 Dimensions of Each Book Title")
def picture2(self):
'''
画出散点图加上点的注释,投影到概念空间
'''
TitleX = -1*self.Vt[1,:]
TitleY = -1*self.Vt[2,:]
WordX = -1*self.U[:,1]
WordY = -1*self.U[:,2]
#画Word图的形状和注释
Words = self.keys
plt.plot(WordX,WordY,'rs')
for i in range(len(Words)):
plt.annotate(Words[i],xy=(WordX[i],WordY[i]),xytext=(2, 6),textcoords='offset points',color='red')
#画Title图的形状和注释
Titles = [i[0]+i[1] for i in zip(['T']*9 ,map(str,range(1,10)))]
plt.plot(TitleX,TitleY,'bo')
for i in range(len(TitleX)):
plt.annotate(Titles[i],xy=(TitleX[i],TitleY[i]),xytext=(2, 2),textcoords='offset points',color='blue')
plt.title('XY plots of Words and Titles')
plt.xlabel('Dimension 2')
plt.ylabel('Dimension 1')
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
def printA(self):
print 'Here is the count matrix'
print self.A
def printSVD(self):
print 'Here are the singular values'
print self.S
print 'Here are the first 3 columns of the U matrix'
print -1*self.U[:, 0:3]
print 'Here are the first 3 rows of the Vt matrix'
print -1*self.Vt[0:3, :]