Simultaneous sparse estimation of canonical vectors in the p>> N setting

Paper can be found in
http://arxiv.org/abs/1403.6095

1. Goal

In this paper, the authors are concerned with multi-group classification(LDA) and want to estimate G-1 canonical vectors simultaneously rather than estimate in a sequential fashion.

2. Method

2.1 ratio-trace version of LDA

Suppose ΣbRp×p and ΣwRp×p are population between and within group covariance matrices. Note that ratio-trace version of linear discriminant analysis is dealing with the following optimization problem:

maxTr((VTΣwV)1VTΣbV)

which can be solved via generalized eigen decomposition of Σ1wΣb .

proof: Taking derivative with respect to V and setting to zero,

Tr((VTΣwV)1VTΣbV)V=2ΣwV(VTΣwV)1VTΣbV(VTΣwV)1+2ΣbV(VTΣwV)1=0

how to get this formula can be found in
http://blog.csdn.net/comeyan/article/details/50514610

we have

ΣwV(VTΣwV)1VTΣbV=ΣbVΣ1wΣbV=V(VTΣwV)1VTΣbVΣ1wΣbV=VBΘB1Σ1wΣbVB=VBΘ

Note that:

  • The third equivalence is because we can simultaneously diagonalize VTΣwV and VTΣV using B , which is to say(Θ is a diagonal matrix)
    B1VTΣwVB=I,B1VTΣbVB=Θ

    So
    VTΣwV=I,VTΣbV=BΘB1

    how to diagonalize two matrices simultaneously can be found in
    http://blog.csdn.net/comeyan/article/details/50521034
  • The fourth equivalence tells us that VB is the generalized eigen vectors of Σ1wΣb

Summarize:
From the above analysis, we know that to solve the ratio-trace version of LDA is to solve a generalized eigen decomposition problem, which can be done by diagonaling these two matrices simultaneously.

2.2 Model

Since the ratio-trace version of LDA is to solve a generalized eigen decomposition problem of Σ1wΣb and the eigenvectors are unique only up to normalization, Taking advantage of the unique of the eigenspace in defining a scale-invariant classification rule.

Notations to be used:

  • G is the total number of groups.
  • rank(Σb)=G1
  • μi,i=1,2,,G is the mean of each group.
  • πi,i=1,2,,G is the prior probability of each group

The goal of this paper is to find find G1 eigenvectors Φ corresponding to non-zero eigenvalues of Σ1wΣb . *

Note that the G1 vectors can be formularized in a closed form.
Proposition 1(population version). The following decomposition holds: Σb=ΔΔT , where for r=1,2,,G1 the r th column of Δ has the form

Δr=πr+1(ri=1πi(μiμr+1))ri=1πrr+1i=1πi

Proposition 2(sample version). The following decomposition holds: Σ^b=DDT , where for r=1,2,,G1 the r th column of D has the form

Dr=nr+1(ri=1ni(μ^iμ^r+1))Nri=1nir+1i=1ni

the proof of sample version can be found in(using orthogonal contrasts of unbalanced data)
http://blog.csdn.net/COMEYAN/article/details/50521276
orthogonal contrasts can be found in

Formularize Σb in this form of low-rank decomposition is because it has a closed form and have intuitive interpretation in terms of the differences between the group means.

Then give the closed form of generalized eigen vectors Φ

Proposition 3. Define Δ and D as in last two propositions. There exists a matrix POG1 such that

Φ=Σ1wΔP

Moreover, if Σ^w is nonsingular, there exists a matrix ROG1 such that
Φ^=Σ^1wDR

If we use Mahalanobis distance for classification, the classification function is free of orthogonal scale, so we can use a simple projection matrix instead.

Φ~=Σ1wΔ

Now it is sufficient to estimate Φ~=Σ1wΔ , which can be defined as

Φ~=argminVRp×G112Σ1/2wVΣ1/2wΔ2F=12tr(VTΣwV2VTΔ)

Sample-version
Model step 1

V~=argminVRp×G112tr(VTΣ^wV2VTD)

To get sparse solution, adding group-lasso penalty
Model step 2

V^=argminVRp×G112tr(VTΣ^wV2VTD)+λi=2pVi2

But the objective function can be unbounded below when Σ^w is singular. So regularization of Σ^w is needed, Σ~w=Σ^w+ρI

Model step 3

V^=argminVRp×G112tr(VTΣ~wV2VTD)+λi=2pVi2=argminVRp×G112tr(VTΣ^wV)+ρ2V1ρD2F+λi=2pVi2

But this means letting V^ being close to 1ρD , which is not true. So replacing Σ^w by Σ^t , we have the final model

Model step 4

V^=argminVRp×G112tr(VTΣ^tV2VTD)+λi=2pVi2=argminVRp×G112tr(VTΣ^wV)+12VTDI2F+λi=2pVi2

  • the first item is to minimize the within group variability,
  • the second item is to control the level of the between group variability
  • the third item is to induce sparsity

3. Theory

The model has model selection property and the misclassification error coincides with the population rule.

4. Algorithm

Block coordinate descent algorithm is used to get the solution. As it takes advantage of warm starts when solving for a range of tuning parameters and is onr of the fastest algorithm for smooth losses with separable regularizers.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值