信息检索导论第十八章笔记(英文)

Matrix decompositions and latent semantic indexing

term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.

  1. develop a class of operations from linear algebra, known as matrix decomposition
  2. use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix
  3. examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing

Linear algebra review

  • eigenvalues of C

For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:

在这里插入图片描述

The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.

In a similar fashion, the left eigenvectors of C are the M-vectors y such that

在这里插入图片描述

The number of nonzero eigenvalues of C is at most rank©.

Note:

  1. 在这里插入图片描述
  1. the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is small
  2. For a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.

Matrix decompositions

a square matrix can be factored into the product of matrices derived from its eigenvectors

  • Two theorems
  1. Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition

在这里插入图片描述

where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order

在这里插入图片描述

If the eigenvalues are distinct, then this decomposition is unique.

在这里插入图片描述

  1. Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition

在这里插入图片描述

在这里插入图片描述

build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.

Term–document matrices and singular value decompositions

M * N term-document matrix C, thus C is very unlikely to be symmetric.

  • Theorem SVD

在这里插入图片描述

  • Illustration of the SVD

在这里插入图片描述

there are two cases:

  1. M > N
  2. M < N

Low-rank approximations

  • Forbenius Norm

Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be

在这里插入图片描述

the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy

When k is far smaller than r , we refer to Ck as a low-rank approximation.

  • The SVD can be used to solve the low-rank matrix approximation problem.
    We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:

在这里插入图片描述

The rank of Ck is at most k.

this procedure yields the matrix of rank k with the lowest possible Frobenius error.

在这里插入图片描述

  • the form of Ck

在这里插入图片描述

where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.

Latent semantic indexing

LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing

  1. use SVD to construct a low-rank approximation Ck to the term-document matrix

  2. map each row/column to a k-dimensional space

  3. use the new k-dimensional LSI represnetation to compute similarities between vectors

在这里插入图片描述

query vector is mapped into its representation in the LSI space

Note:
The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection
a value of k in the low hundreds can actually increase precision on some query benchmarks.
This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy
LSI works best in applications where there is little overlap between queries and documents.

  • soft clustering

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值