Dimensinality reduction and topic modeling

Dimensinality reduction and topic modeling

0 Abstract

BOW is not effective facing synonym and polysemy.
Dimension reduction can solve this, represent a document in lower-dimension, and reflect concepts.

Two froms of dimension reduction:

  • latent semantic indexing using spectral decomposition
  • Topic modeling PLSI & LDA using probabilistic model to find the co-occurrence patterns which correspond to semantic topic

Then a survey of advances to apply these techiniques to large and envolving datasets and to in corporate network and contextual information

1 Introduction

  • Index of Hebrew difficulties
    • suppress difference sin which were not significant.
    • preserving diffrences that might affect the sematics.
  • core task automated text-mining
    • synonymy & ploysemy
  • Bag of words (BOW)
    • acounting for frequency ignoring order
    • high dimensionality and sparse (term-document matrix)
  • relationship between Clustering, Reduction and Topic modeling
    • (Discriminate method) Clustering & soft clustering: uses similarity, natural cluster but hard to interpretate. soft (associate with multiple clusters)
    • (Discriminate method) Dimension reduction: BOW representation, more original and coupling, still hard to interpretate.
    • (Generative method) Topic modeling: combination of soft clustering and dimension reduction.

2 latnet semantic indexing

projecting documents into semantic space, analysising at conceptual level.

  • Overcome synonymy and ploysemy (help term-based information retrieval)

  • From 1980s

    • information retrieval
    • assigning papers to reviewers
    • cross-lingual retrieval
  • Based on the approximation SVD of the term-document matrix, view low-rank space as semantic concepts space

  • Procedure

    • SVD
      X = U Σ V T X=U \Sigma V^{T} X=UΣVT
    • low-rank approximation (minimizes spectral norm and Frobenius norm).
      X ^ = U ^ Σ ^ V ^ T = [ U 1 … U K ] [ σ 1 ⋱ σ K ] [ V 1 T ⋮ V K T ] \begin{aligned} \hat{X} &=\hat{U} \hat{\Sigma} \hat{V}^{T} =\left[\begin{array}{lll} \boldsymbol{U}_{1} & \ldots & \boldsymbol{U}_{K} \end{array}\right]\left[\begin{array}{lll} \sigma_{1} & & \\ & \ddots & \\ & & & \sigma_{K} \end{array}\right]\left[\begin{array}{c} \boldsymbol{V}_{1}^{T} \\ \vdots \\ \boldsymbol{V}_{K}^{T} \end{array}\right] \end{aligned} X^=U^Σ^V^T=[U1UK]σ1σKV1TVKT
    • document and term representations
      X d = U ^ Σ ^ X ^ d T v = V ^ Σ ^ T ^ v \boldsymbol{X}_{d}=\hat{U} \hat{\Sigma} \hat{\boldsymbol{X}}_{d} \qquad \boldsymbol{T}_{v}=\hat{V} \hat{\Sigma} \hat{\boldsymbol{T}}_{v} Xd=U^Σ^X^dTv=V^Σ^T^v
    • application
      • Information retrieval
        q ^ = Σ ^ − 1 U ^ T q \hat{\boldsymbol{q}}=\hat{\Sigma}^{-1} \hat{U}^{T} \boldsymbol{q} q^=Σ^1U^Tq
      • Document similarity: One solved non-identifiability of the SVD [63]
      • Term similartiy
  • Implementation

    • Term-Document Matrix
      • DF*IDF
      • language pyramid model [70]
    • Computation
      • Lanczos algorithm (for sparse matrix)
    • Handling changes: corpus changes
      • fold in: computing new ones using original decomposition. Efficient O(KN).
      • Updating semantic space: [ 8 (1995) , 52 (1994) , 74 (1999) ]
        • [74]: [ X ^ X ′ ] \left[\hat{X} X^{\prime}\right] [X^X] instead of [ X X ′ ] \left[X X^{\prime}\right] [XX]
  • Some Analysis

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值