常用无监督降维方法简述

最新推荐文章于 2024-06-19 08:11:40 发布

止于至玄

最新推荐文章于 2024-06-19 08:11:40 发布

阅读量1.6k

点赞数

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/philthinker/article/details/70212147

版权

Machine Learning 专栏收录该内容

23 篇文章 3 订阅

订阅专栏

Unsupervised Dimension Reduction

Data with high dimension is always difficult to tackle. One hand is that it requires tremendous computation resource. On the other hand, it is not so objective as the one with low dimension. Therefore, dimension reduction is one of the key tricks to tackle it.

Linear Dimension Reduction

In order to reduce the dimension of samples, such as transform $\{x_{i}\}_{i=1}^{n}$ into $\{z_{i}\}_{i=1}^{n}$ with little lose of information, we can use linear transformation :

z i = T x i

$z_{i}=Tx_{i}$
Before doing that, it is necessary to make sure the average of training set

{xi}ni=1 $\{x_{i}\}_{i=1}^{n}$ to be zero, i.e. centralization. So if it were not true, we should move the frame :

x i \leftarrow x i - 1 n \sum i' = 1 n x i'

$x_{i}\leftarrow x_{i}-\frac{1}{n}\sum_{i'=1}^{n}x_{i'}$

Principal Component Analysis, PCA

PCA, as you can see in the following contents, is the simplest linear dimension reduction method. Suppose that $z_{i}$ is the orthogonal projection of $x_{i}$ . Then we require that $TT^{T}=I_{m}$ . By the same way in LS methods we try to reduce the lose of information as little as possible, i.e. we try to minimize:

\sum i = 1 n ∥ T T T x i - x i ∥ 2 = - t r (T C T T) + t r (C)

$\sum_{i=1}^{n}\|T^{T}Tx_{i}-x_{i}\|^{2}=-\mathrm{tr}\left(TCT^{T}\right)+\mathrm{tr}(C)$
where

C $C$ is the covariance of training set:

C = \sum i = 1 n x i x T i

$C=\sum_{i=1}^{n}x_{i}x_{i}^{T}$
In summary, PCA is defined as

max T \in R m \times d t r (T C T T) s . t . T T T = I m

$\max_{T\in\mathbb{R}^{m\times d}}\mathrm{tr}\left(TCT^{T}\right)\quad s.t. \quad TT^{T}=I_{m}$
Consider the eigenvalues of

C $C$ :

C ξ = λ ξ

$C\xi=\lambda\xi$
Define the eigenvalues and corresponded eigen vectors as

λ1≥λ2≥⋯≥λd≥0 $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{d}\geq 0$ and

ξ1,…,ξd $\xi_{1},\dots,\xi_{d}$ respectively. Then we get :

T = (ξ 1, \dots, ξ m) T

$T=(\xi_{1},\dots,\xi_{m})^{T}$

Here is a simple example:

n=100;
%x=[2*randn(n,1) randn(n,1)];
x=[2*randn(n,1) 2*round(rand(n,1))-1+randn(n,1)/3];
x=x-repmat(mean(x),[n,1]);
[t,v]=eigs(x'*x,1);

figure(1); clf; hold on; axis([-6 6 -6 6]);
plot(x(:,1),x(:,2),'rx');
plot(9*[-t(1) t(1)], 9*[-t(2) t(2)]);

这里写图片描述

Locality Preserving Projections

In PCA, the structure of clusters in origin training set may be changed, which is not true in locality preserving projections. It is another version of linear dimension reduction.
Define the similarity between $x_{i}$ and $x_{i'}$ as $W_{i,i'}\geq 0$ . When they are similar to large degree $W_{i,i'}$ is of a large value and vice versa. Since similarity is symmetric, we require $W_{i,i'}=W_{i',i}$ . There are several normal forms of similarity, such as the Gaussian Similarity:

W i, i' = exp (- ∥ x i - x i ' ∥ 2 2 t 2)

$W_{i,i'}=\exp\left( -\frac{\|x_{i}-x_{i'}\|^{2}}{2t^{2}} \right)$
where

t>0 $t>0$ is a tunable parameter.
For the purpose of holding the structure of clusters, it is necessary to hypothesis that similar

xi $x_{i}$ would be transformed to similar

zi $z_{i}$ . That is to say, we ought to minimize:

1 2 \sum i, i' = 1 n W i, i' ∥ T x i - T x i' ∥ 2

$\frac{1}{2}\sum_{i,i'=1}^{n}W_{i,i'}\|Tx_{i}-Tx_{i'}\|^{2}$
However, to avoid the solution

T=0 $T=0$ , we require

T X D X T T T = I m

$TXDX^{T}T^{T}=I_{m}$
where

X=(x1,⋯,xn)∈Rd×n $X=(x_{1},\cdots,x_{n})\in\mathbb{R}^{d\times n}$ ,

D $D$ is a diagonal matrix:

D i, i' = ⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ \sum i'' = 1 n W i, i'' 0 (i = i') (i \neq i')

$D_{i,i'}=\left\{ \begin{aligned} &\sum_{i''=1}^{n}W_{i,i''} \quad & (i=i')\\ &0 & (i\neq i') \end{aligned} \right.$
If we set

L=D−W $L=D-W$ , then we can represent our optimization goal as

min T \in R m \times d t r (T X L X T T T) s . t . T X D X T T T = I m

$\min_{T\in\mathbb{R}^{m\times d}}\mathrm{tr}\left( TXLX^{T}T^{T} \right)\quad s.t. \quad TXDX^{T}T^{T}=I_{m}$
So how to solve it? Consider the method we use in PCA:

X L X T ξ = λ X D X T ξ

$XLX^{T}\xi = \lambda XDX^{T}\xi$
Then define the generalized eigenvalues and eigen vectors as

λ1≥λ2≥⋯λd≥0 $\lambda_{1}\geq\lambda_{2}\geq\cdots\lambda_{d}\geq 0$ and

ξ1,…,ξd $\xi_{1},\dots,\xi_{d}$ respectively. Therefore

T = (ξ d, ξ d - 1, \dots, ξ d - m + 1) T

$T=(\xi_{d},\xi_{d-1},\dots,\xi_{d-m+1})^{T}$ .

n=100;
%x=[2*randn(n,1) randn(n,1)];
x=[2*randn(n,1) 2*round(rand(n,1))-1+randn(n,1)/3];
x=x-repmat(mean(x),[n,1]);
x2=sum(x.^2,2);
W=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*x'));
D=diag(sum(W,2)); L=D-W;
z=x'*D*x;
z=(z+z')/2;
[t,v]=eigs(x'*L*x,z,1,'sm');

figure(1); clf; hold on; axis([-6 6 -6 6]);
plot(x(:,1),x(:,2),'rx');
plot(9*[-t(1) t(1)], 9*[-t(2) t(2)]);

这里写图片描述

Kernalized PCA

Let us turn to methods of nonlinear dimension reduction. Due to the time limit, we may not analyze it as deep as the linear one.
When it comes to nonlinearity, kernal functions are sure to be highlighted. Take the Gaussian Kernal function for example:

K (x, x') = exp (- ∥ x - x ' ∥ 2 2 h 2)

$K(x,x')=\exp\left( -\frac{\|x-x'\|^{2}}{2h^{2}} \right)$
Here we will not take the eigenvalues of

C $C$ into account as we did in PCA, but the eigenvalues of kernal matrix

Kα=λα $K\alpha=\lambda\alpha$ , where the

(i,i′) $(i,i')$ th element is

K(xi,xi′) $K(x_{i},x_{i'})$ . Hence

K∈Rn×n $K\in\mathbb{R}^{n\times n}$ . Note that dimension of the kernal matrix

K $K$ depends only on the number of samples.
However, centralization is necessary:

K \leftarrow H K H

$K\leftarrow HKH$
where

H = I n - 1 n \times n / n

$H=I_{n}-1_{n\times n}/n$

1n×n $1_{n\times n}$ is a matrix with all the elements to be one. The final outcome of kernalized PCA is:

(z 1, \dots . z n) = (1 λ 1 - - \sqrt α 1, \dots, 1 λ m - - - \sqrt α m) T H K H

$(z_{1},\dots.z_{n})=\left( \frac{1}{\sqrt{\lambda_{1}}}\alpha_{1},\cdots, \frac{1}{\sqrt{\lambda_{m}}}\alpha_{m}\right)^{T}HKH$
where

α1,…,αm $\alpha_{1},\dots,\alpha_{m}$ are

m $m$ eigen vectors corresponded with

m $m$ largest eigenvalues of

HKH $HKH$ .

止于至玄

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
常用无监督降维方法简述

Unsupervised Dimension ReductionData with high dimension is always difficult to tackle. One hand is that it requires tremendous computation resource. On the other hand, it is not so objective as the on...
复制链接

扫一扫

专栏目录