Dimensinality reduction and topic modeling

最新推荐文章于 2024-08-19 18:18:09 发布

MarxistZ

最新推荐文章于 2024-08-19 18:18:09 发布

阅读量145

点赞数

分类专栏： NLP 文章标签：自然语言处理

本文链接：https://blog.csdn.net/lost_pork/article/details/109993601

版权

NLP 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Dimensinality reduction and topic modeling

0 Abstract

BOW is not effective facing synonym and polysemy.
Dimension reduction can solve this, represent a document in lower-dimension, and reflect concepts.

Two froms of dimension reduction:

latent semantic indexing using spectral decomposition
Topic modeling PLSI & LDA using probabilistic model to find the co-occurrence patterns which correspond to semantic topic

Then a survey of advances to apply these techiniques to large and envolving datasets and to in corporate network and contextual information

1 Introduction

Index of Hebrew difficulties
- suppress difference sin which were not significant.
- preserving diffrences that might affect the sematics.
core task automated text-mining
- synonymy & ploysemy
Bag of words (BOW)
- acounting for frequency ignoring order
- high dimensionality and sparse (term-document matrix)
relationship between Clustering, Reduction and Topic modeling
- (Discriminate method) Clustering & soft clustering: uses similarity, natural cluster but hard to interpretate. soft (associate with multiple clusters)
- (Discriminate method) Dimension reduction: BOW representation, more original and coupling, still hard to interpretate.
- (Generative method) Topic modeling: combination of soft clustering and dimension reduction.

2 latnet semantic indexing

projecting documents into semantic space, analysising at conceptual level.

Overcome synonymy and ploysemy (help term-based information retrieval)
From 1980s
- information retrieval
- assigning papers to reviewers
- cross-lingual retrieval
Based on the approximation SVD of the term-document matrix, view low-rank space as semantic concepts space
Procedure
- SVD
  $\Sigma V^{T}$
- low-rank approximation (minimizes spectral norm and Frobenius norm).
  $\begin{aligned} \hat{X} &=\hat{U} \hat{\Sigma} \hat{V}^{T} =\left[\begin{array}{lll} \boldsymbol{U}_{1} & \ldots & \boldsymbol{U}_{K} \end{array}\right]\left[\begin{array}{lll} \sigma_{1} & & \\ & \ddots & \\ & & & \sigma_{K} \end{array}\right]\left[\begin{array}{c} \boldsymbol{V}_{1}^{T} \\ \vdots \\ \boldsymbol{V}_{K}^{T} \end{array}\right] \end{aligned}$
- document and term representations
  $\boldsymbol{X}_{d}=\hat{U} \hat{\Sigma} \hat{\boldsymbol{X}}_{d} \qquad \boldsymbol{T}_{v}=\hat{V} \hat{\Sigma} \hat{\boldsymbol{T}}_{v}$
- application
  - Information retrieval
    $\hat{\boldsymbol{q}}=\hat{\Sigma}^{-1} \hat{U}^{T} \boldsymbol{q}$
  - Document similarity: One solved non-identifiability of the SVD [63]
  - Term similartiy
Implementation
- Term-Document Matrix
  - DF*IDF
  - language pyramid model [70]
- Computation
  - Lanczos algorithm (for sparse matrix)
- Handling changes: corpus changes
  - fold in: computing new ones using original decomposition. Efficient O(KN).
  - Updating semantic space: [ 8 (1995) , 52 (1994) , 74 (1999) ]
    - [74]: $\left[\hat{X} X^{\prime}\right]$ instead of $\left[X X^{\prime}\right]$
Some Analysis

MarxistZ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Dimensinality reduction and topic modeling

Dimensinality reduction and topic modeling0 AbstractBOW not effective facing synonym and polysemy.Dimension reduction can solve this, represent a document in lower-dimension, and reflect concepts.Two froms of dimension reduction:latent semantic index
复制链接

扫一扫