Coursera-Mathematics for Machine Learning: PCA Week4-1

Introduction

We have:

  • a dataset X X X in R D R^D RD consisting of n vectors (n training examples)

  • n vectors X 1 X_1 X1, …, X n X_n Xn where X i X_i Xi are D-dimensional vectors

Objective:

  • find a low dimensional representation of the data that is as similar to X X X as possible.

Three important concepts

1. linear combination of the basis vectors

The first one is that every vector in R D R^D RD can be represented as a linear combination of the basis vectors.

X n X_n Xn can be written as the sum of i = 1 to D of β i n \beta_{in} βin * b i b_i bi.

  • D个basis vectors b i b_i bi的线性加和
  • b i b_i bi are an orthonormal bases of R D R^D RD
  • orthonormal: 正交(各自垂直)
    在这里插入图片描述

2. orthogonal projection onto 1-D subspace

在这里插入图片描述
We can interpret β i n \beta_{in} βin as the orthogonal projection of X n X_n Xn onto the one dimensional subspace spanned by the i t h i_{th} ith basis vector.

3. orthogonal projection of X onto the subspace spanned by the M basis vectors

在这里插入图片描述
If we have an orthonormal basis b 1 b_1 b1 to b m b_m bm of R D R^D RD and we define B B B to be the matrix that consists of these orthonormal basis vectors.

Then the projection of X X X onto the subspace, we can write as X ~ \widetilde X X is B ∗ B . T ∗ X B * B.T * X BB.TX.

That means X ~ \widetilde X X is the orthogonal projection of X onto the subspace spanned by the M basis vectors.

And B . T ∗ X B.T * X B.TX are the coordinates of X ~ \widetilde X X with respect of the basis vectors collected in the matrix B B B. This is also called the code, so coordinates or code.

PCA

The key idea in PCA

To find a lower dimensional representation X ~ n \widetilde X_n X n that can be expressed using fewer basis vectors, let’s say M.

Assumptions:

  • The data is centered, that means the dataset has mean zero.
  • b 1 b_1 b1 to b D b_D bD are an orthonormal bases of R D R^D RD.

Generally, we can write any X ~ n \widetilde X_n X n in the following way:
a sum i equals one to M of beta i_n times b_i plus the sum of i equals M plus one to D of beta i_n times b_i.

This entire thing is still living in R D R^D RD. So, we took our general way of writing any vector in R D R^D RD which comes from property one, and we split the sum in property one into two sums. One is living in an M-dimensional subspace and the other one is living in a D minus M-dimensional subspace which is an orthogonal complement to this particular subspace.

In PCA, we ignore the second term, so we get rid of this part.

在这里插入图片描述

the principal subspace

And then we call the subspace that is spanned by the basis vectors b 1 b_1 b1 to b M b_M bM the principal subspace. So b 1 b_1 b1 to b M b_M bM span the principal subspace.

Although X ~ n \widetilde X_n X n is still a D-dimensional vector, it lives in an M-dimensional subspace of R D R^D RD and only M coordinates; β n 1 \beta_{n1} βn1 to β n M \beta_{nM} βnM are necessary to represent it.

So, these ones are the coordinate of this, X ~ n \widetilde X_n X n vector. The betas of n also called the code of the coordinates of X ~ n \widetilde X_n X n with respect to the basis vectors b 1 b_1 b1 to b M b_M bM.

Objective

And the setting now is as follows. Assuming we have data X 1 X_1 X1 to X n X_n Xn, we want to find parameters β i n \beta_{in} βin and orthonormal basis vectors b i b_i bi, such that the average squared reconstruction era is minimised.

the average squared reconstruction error

J: the average squared reconstruction error
在这里插入图片描述

example

Let’s have a look at an example. We have data living in two dimensions and now we want to find a good one dimensional subspace such that the squared or average squared reconstruction error of the original data points and their corresponding projection is minimised.
在这里插入图片描述

Here I’m plotting the original dataset with their corresponding projections onto one dimensional subspaces and I’m cycling through a couple of options of subspaces and you can see that some of these projections are significantly more informative than others and in PCA we are going to find the best one. Our approach is to compute the partial derivatives of J J J with respect to the parameters.

在这里插入图片描述
The parameters are the β i n \beta_{in} βin and the b i b_i bi.

We set the partial derivatives of J with respect to these parameters to zero and solve for the optimal parameters. But one observation we can already make. And that observation is that the parameters only enter this loss function through X ~ n \widetilde X_n X n.

This means that in order to get our partial derivatives, we need to apply the chain rule. So, we can write d J d_J dJ by d either β i n \beta_{in} βin or b i b_i bi can be written as d J d_J dJ by d X ~ n d_{\widetilde X_n} dX n tilde times d X ~ n d_{\widetilde X_n} dX n by d either β i n \beta_{in} βin or b i b_i bi.

We can already compute the first part.
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值