# Dimensionality reduction

• looking for hidden similarities in data
• based on matrix decomposition

# Example

• Assume that we have 7 Documents with 9 terms.

e.g. Document 1 contains term 6 and term9.

• The document term matrix should be 9×7$\mathbb{R}^{9 \times 7}$, a column represents a document and a raw represents a term.

Remark: we have to normalize our matrix before svd.

• Apply the svd decomposition

M9×7=U9×9Σ9×7VT

• Σ$\Sigma$

• Rank 2 Σ$\Sigma$

UΣ2$U\Sigma_2$ is the 2 rank approximation of the TERM(2 dimension),
Σ2VT$\Sigma_2V^T$ is the 2 rank approximation of the DOCUMENT(2 dimension).

## Question

what do ATA$A^TA$ and AAT$AA^T$ mean if A is a document-term matrix 9×7$\mathbb{R}^{9 \times 7}$?

• ATA7×7$A^TA \in \mathbb{R}^{7 \times 7}$ is the document-document similarity matrix.
• AAT9×9$AA^T \in \mathbb{R}^{9 \times 9}$ is the term-term similarity matrix.

# Latent semantic indexing(LSI, identical to LSA)

• Dimensionality reduction = identification of hidden(latent) concepts
• query matching in latent space

