Machine Learning 07 - Unsupervised Learning

正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。

7.1 Clustering

7.1.1 K-means algorithm

Intuition

K-means algorithm has two steps :

  • Cluster assignment
  • Move centroid step

The algorithm illustrations is show in the picture below :

steps

Symbols

  • c(i) c ( i ) : index of cluster ( 1,2,K 1 , 2 , … K ) to which example x(i) x ( i ) is currently assigned
  • μk μ k : cluster centroid k(μkRn) k ( μ k ∈ R n )
  • μc(i) μ c ( i ) : cluster centroid of cluster to which example x(i) x ( i ) has been assigned

Optimization objective

minc(1),,c(m),μ1,,μKJ(c(1),,c(m),μ1,,μK)=1mi=1mx(i)μc(i)2 min c ( 1 ) , ⋯ , c ( m ) , μ 1 , ⋯ , μ K J ( c ( 1 ) , ⋯ , c ( m ) , μ 1 , ⋯ , μ K ) = 1 m ∑ i = 1 m ‖ x ( i ) − μ c ( i ) ‖ 2

K-means algorithm - Algorithm 4

Randomly initialize K K cluster centroids μ1,μ2,,μKRn
Repeat{
for i=1 i = 1 to m m
c(i):= c ( i ) := index (from 1 1 to K) of cluster centroid closest to x(i) x ( i )
for k k 1 to K
μk:= μ k := average (mean) of points assigned to cluster k k
}

7.1.2 Important tricks

We randomly choose the K K cluster centroids, and different case result in different optimal solution, which may cause local optimal.

For example :

local optimal

Random Initialization

For i=1 to 100 100 {
Randomly initialize K-means.
Run K-means. Get c(1),,c(m),μ1,,μK c ( 1 ) , ⋯ , c ( m ) , μ 1 , ⋯ , μ K .
Compute cost function (distortion)
J(c(1),,c(m),μ1,,μK) J ( c ( 1 ) , ⋯ , c ( m ) , μ 1 , ⋯ , μ K )
}
Pick clustering that gave lowest cost J(c(1),,c(m),μ1,,μK) J ( c ( 1 ) , ⋯ , c ( m ) , μ 1 , ⋯ , μ K ) .

For k=2to10 k = 2 t o 10 , random initialization behave well, when k k is large, it is easy to get a good solution at a time.

Number of Clusters

Choosing the number of clusters is a matter of option. It is often based on experience.

One way to try (but not always effective) is Elbow method, draw the JK figure, and choose K.

Elbow method

Sometimes, K-means is used for some later/downstream purpose. Evaluate K-means based on metric for how well it performs for that later purpose.

7.2 Dimensionality Reduction

7.2.1 Intuition

The intuition from 2D to 1D and from 3D to 2D is showed below :
intuition

Application : Data Compress, Data Visualization …

7.2.2 Principal Component Analysis

Reduce from n n -dimension to k-dimension, what the PCA do is :

Find k k vectors u(1),u(2),u(k)Rn onto which to project the data so as to minimaze the projection error.

Principal Component Analysis - Algorithm 5

Preprocessing “feature scaling” / “mean normalization” (ensure zero mean)
Calculate the covariance matrix :
Σ=1mmi=1(x(i))(x(i))T Σ = 1 m ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T (mark Sigma = Σ Σ )
Do the single value decomposition :
[U, S, V] = svd(Sigma);
Ureduce = U(:, 1 : k);
z = Ureduce’ * x;

Reconstruction from Compressed Representation

x(i)=Ureducez(i),i=1,2,,m x ( i ) = U reduce z ( i ) , i = 1 , 2 , ⋯ , m

7.2.3 Choose the k k

Here the k (dimension of z z ) is also call number of principal components.

Typically, choose k to be smallest value so that

1mmi=1x(i)x(i)approx21mmi=1x(i)20.01 1 m ∑ i = 1 m ‖ x ( i ) − x a p p r o x ( i ) ‖ 2 1 m ∑ i = 1 m ‖ x ( i ) ‖ 2 ≤ 0.01

The number 0.01 0.01 indicates that 99% 99 % of variance is retained.

An easier way to calculate is showed below :

Choose the k k - Algorithm 6

[U, S, V] = svd(Sigma)
Pick smallest value of k for which

ki=1Siimi=1Sii0.99 ∑ i = 1 k S i i ∑ i = 1 m S i i ≥ 0.99

7.2.4 Advice for applying PCA

Supervisied learning speedup

Given a dataset : (x(1),y(1)),(x(2),y(2)),(x(m),y(m)) ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ ( x ( m ) , y ( m ) ) , x(i)Rn x ( i ) ∈ R n

  • Extract inputs and get unlabeled dataset.
  • Apply PCA algorithm.
  • Get new training set.

Finally, we get new training set : (z(1),y(1)),(z(2),y(2)),(z(m),y(m)) ( z ( 1 ) , y ( 1 ) ) , ( z ( 2 ) , y ( 2 ) ) , ⋯ ( z ( m ) , y ( m ) ) , z(i)Rk z ( i ) ∈ R k

Note :

  • Mapping x(i)z(i) x ( i ) → z ( i ) should be defined by running PCA only the training set.
  • This mapping can be applied to cross validation and test sets.

Bad use of PCA : To prevent overfitting

That is : use z(i) z ( i ) instead of x(i) x ( i ) to reduce the number of features to k<n k < n

Reason : PCA will throw away some valuable information.

Consider machine learning without PCA first

Before implementing PCA, first try running whatever to get with the raw/original data. Only if that doesn’t do idealy, them implement PCA.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值