Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 第三部分 (转载)

Part 4 - Clustering by Color

用颜色聚类

We can also turnthe numbers into colors. For instance, here is a color display that correspondsto the first 3 dimensions of the Titles matrix that we showed above. Itcontains exactly the same information, except that blue shows negative numbers,red shows positive numbers, and numbers close to 0 are white. For example,Title 9, which is strongly positive in all 3 dimensions, is also strongly redin all 3 dimensions.

我们可以把数字转换为颜色。例如,下图表示了标题矩阵3个维度的颜色分布。除了蓝色表示负值,红色表示正值,它包含了和矩阵同样的信息。例如,标题9在所有三个维度上正数值都较大,那么它在3个维度上都会很红。

 

 

 

We can use thesecolors to cluster the titles. We ignore the first dimension for clusteringbecause all titles are red. In the second dimension, we have the followingresult.

我们能够利用这些颜色来把标题聚类。我们在聚类中忽略第一维度,因为所有的都是红色。在第二维度,我们有如下结果:

Dim2

Titles

red

6-7, 9

blue

1-5, 8

Using the thirddimension, we can split each of these groups again the same way. For example,looking at the third dimension, title 6 is blue, but title 7 and title 9 arestill red. Doing this for both groups, we end up with these 4 groups.

在加上第三维度,我们可以继续划分。例如,在维度3上,标题6是蓝色,但是标题7和标题9依然是红色。最终我们得到如下几个分组:

Dim2

Dim3

Titles

red

red

7, 9

red

blue

6

blue

red

2, 4-5, 8

blue

blue

1, 3

It’s interestingto compare this table with what we get when we graph the results in the nextsection.


Part 5 - Clustering by Value

按值聚类

Leaving out thefirst dimension, as we discussed, let's graph the second and third dimensionsusing a XY graph. We'll put the second dimension on the X axis and the thirddimension on the Y axis and graph each word and title. It's interesting tocompare the XY graph with the table we just created that clusters thedocuments.

去掉维度1,让我们用xy轴坐标图来画出第二维和第三维。第二维作为X、第三维作为Y,并且把每个词和标题都画上去。比较下这个图和刚才聚类的表格会非常有意思。

In the graphbelow, words are represented by red squares and titles are represented by bluecircles. For example the word "book" has dimension values (0.15,-0.27, 0.04). We ignore the first dimension value 0.15 and graph"book" to position (x = -0.27, y = 0.04) as can be seen in the graph.Titles are similarly graphed.

在下图中,词表示为红色方形,标题表示为蓝色圆圈。例如,词“book”有坐标值(0.15, -0.27,0.04)。这里我们忽略第一维度0.15 把点画在(x = -0.27, y =0.04)。标题也是一样。

 

 

 

One advantage ofthis technique is that both words and titles are placed on the same graph. Notonly can we identify clusters of titles, but we can also label the clusters bylooking at what words are also in the cluster. For example, the lower leftcluster has titles 1 and 3 which are both about stock market investing. Thewords "stock" and "market" are conveniently located in thecluster, making it easy to see what the cluster is about. Another example isthe middle cluster which has titles 2, 4, 5, and, to a somewhat lesser extent,title 8. Titles 2, 4, and 5 are close to the words "value" and"investing" which summarizes those titles quite well.

这个技术的一个有点是词和标题都在一张图上。不仅我们可以区分标题的聚类,而且我们可以把聚类中的词给这个聚类打上标签。例如左下的聚类中有标题1和标题3都是关于股票市场投资(stock market investing)的。Stock和market可以方便的定位在这个聚类中,让描述这个聚类变得容易。其它也类似。

Advantages, Disadvantages, and Applications of LSA

LSA的优势、劣势以及应用

Latent SemanticAnalysis has many nice properties that make it widely applicable to manyproblems.

1.    First, the documents and words end up being mapped to thesame concept space. In this space we can cluster documents, cluster words, andmost importantly, see how these clusters coincide so we can retrieve documentsbased on words and vice versa.

2.    Second, the concept space has vastly fewer dimensionscompared to the original matrix. Not only that, but these dimensions have beenchosen specifically because they contain the most information and least noise.This makes the new concept space ideal for running further algorithms such astesting different clustering algorithms.

3.    Last, LSA is an inherently global algorithm that looks attrends and patterns from all documents and all words so it can find things thatmay not be apparent to a more locally based algorithm. It can also be usefullycombined with a more local algorithm such as nearest neighbors to become moreuseful than either algorithm by itself.

LSA有着许多优良的品质使得其应用广泛:

         1. 第一,文档和词被映射到了同一个“概念空间”。在这个空间中,我们可以把聚类文档,聚类词,最重要的是可以知道不同类型的聚类如何联系在一起的,这样我们可以通过词来寻找文档,反之亦然。

         2. 第二,概念空间比较原始矩阵来说维度大大减少。不仅如此,这种维度数量是刻意为之的,因为他们包含了大部分的信息和最少的噪音。这使得新产生的概念空间对于运行之后的算法非常理想,例如尝试不同的聚类算法。

         3. 最后LSA天生是全局算法,它从所有的文档和所有的词中找到一些东西,而这是一些局部算法不能完成的。它也能和一些更局部的算法(最近邻算法nearest neighbors)所结合发挥更大的作用。

There are a fewlimitations that must be considered when deciding whether to use LSA. Some ofthese are:

1.    LSA assumes a Gaussian distribution and Frobenius normwhich may not fit all problems. For example, words in documents seem to followa Poisson distribution rather than a Gaussian distribution.

2.    LSA cannot handle polysemy (words with multiple meanings)effectively. It assumes that the same word means the same concept which causesproblems for words like bank that have multiple meanings depending on whichcontexts they appear in.

3.    LSA depends heavily on SVD which is computationallyintensive and hard to update as new documents appear. However recent work hasled to a new efficient algorithm which can update SVD based on new documents ina theoretically exact sense.

当选择使用LSA时也有一些限制需要被考量。其中的一些是:

         1. LSA假设Gaussiandistribution and Frobenius norm,这些假设不一定适合所有问题。例如,文章中的词符合Poissondistribution而不是Gaussian distribution。

         2. LSA不能够有效解决多义性(一个词有多个意思)。它假设同样的词有同样的概念,这就解决不了例如bank这种词需要根据语境才能确定其具体含义的。

         3. LSA严重依赖于SVD,而SVD计算复杂度非常高并且对于新出现的文档很难去做更新。然而,近期出现了一种可以更新SVD的非常有效的算法。

In spite ofthese limitations, LSA is widely used for finding and organizing searchresults, grouping documents into clusters, spam filtering, speech recognition,patent searches, automated essay evaluation, etc.

不考虑这些限制,LSA被广泛应用在发现和组织搜索结果,把文章聚类,过滤作弊,语音识别,专利搜索,自动文章评价等应用之上。

As an example,iMetaSearch uses LSA to map search results and words to a “concept” space.Users can then find which results are closest to which words and vice versa.The LSA results are also used to cluster search results together so that yousave time when looking for related results.

例如,iMetaSearch利用LSA去把搜索结果和词映射到“概念”空间。用户可以知道那些结果和那些词更加接近,反之亦然。LSA的结果也被用在把搜索结果聚在一起,这样你可以节约寻找相关结果的时间。

转载于:https://www.cnblogs.com/appler/archive/2012/02/02/2335911.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值