Problems with the simple vector approaches to similarity
- looking for hidden similarities in data
- based on matrix decomposition
Assume that we have 7 Documents with 9 terms.
e.g. Document 1 contains term 6 and term9.
The document term matrix should be
ℝ9×7, a column represents a document and a raw represents a term.
Remark: we have to normalize our matrix before svd.
Apply the svd decomposition
UΣ2is the 2 rank approximation of the TERM(2 dimension),
Σ2VTis the 2 rank approximation of the DOCUMENT(2 dimension).
ATA∈ℝ7×7is the document-document similarity matrix. AAT∈ℝ9×9is the term-term similarity matrix.
Latent semantic indexing(LSI, identical to LSA)
- Dimensionality reduction = identification of hidden(latent) concepts
- query matching in latent space