1 Clustering
1.1 Introduction to Unsupervised Learning
UNSL uses an unlabeled training set. We don’t have the vector of expected results, and we only have a dataset of features to find structure. UNSL is good for:
- Market segmentation
- Social network analysis
- Orgnaization computer clusters
- Astronomical data analysis
1.2 K-Means Algorithm
- Initialize K points called centroids randomly.
- Assign all examples into one of each of K groups which the example is closest to.
- Move each centroid to the average center of all the examples inside the group
- Re-run step 2 and 3 until we have found our clusters
Main variables
K: number of clusters
Training set: x(1), x(2), … x(m), #no y(i)
n: number fo features
x(i): n*1
Note: not use x0=1 convention
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
Repeat:
for i = 1 to m:
c(i):= index (from 1 to K) of cluster centroid closest to x(i) #assign cluster step
for k = 1 to K:
mu(k):= average (mean) of points assigned to cluster k #move centroid step
- assign cluster step
c(i) = argmin( || x(i) - mu(k)|| ^ 2) - move centroid step
mu(k) = (1/n) * (x(k1) + x(k2) + … + x(kn)) #x(ki) are training examples assigned to group k
if no example is assigned to a cluster centroid, we can randomly re-initialize that centroid to a new point or eliminate that cluster group.
K-means can still be useful on non-seperated clusters, eg T-shirts is S,M and L size.
1.3 Optimization Ojective
Cost function
J(c(1), …, c(m), mu1, …, muK) = (1/m) sum(||x(i)-muc(i)||^2)
J is called the distortion of the tranining examples.
Optimization objective
minimize J on c and mu, or minimize the average of the distances of every training example to its centroid.
- in the assign cluster step
minimize J with c(1), … , c(m), holding mu(1), … , mu(K) fixed. - in the move centroid step
minimize J with mu(1), …, mu(k)
K-means is not possible for the cost function to increase. It should always descend.
1.4 Recommanded Method for Random Initialization
- Have K < m
- Randomly pick K training examples
- Set mu(1), …, mu(k) to these K training examples
K-means can get stuck in local optima. To decrease the chance, we can run on many different random initializations, especially if K < 10.
1.5 Choosing the Number of Clusters
The elbow method: I got it
2 Dimensionality Reduction
2.1 Motivation of DMRD
- Data Compression: 2D to 1D, or 3D to 2D, …, or nD to KD
- Visualization: lower dimention can be plotted, and some features can be combined to plot a ouline of the most features.
2.2 Principle Component Analysis(PCA) Problem Formulation
- 2D(x1, x2) to 1D(z), z is good, z’ is not good
- PCA vs. LinReg
2.3 PCA Algorithm
- Data processing before PCA
- mean normalization
mu(j)= (1/m) sum x(i)
x(j) = x(j) - mu(j) - feature scaling
x(j) = x(j) / s(j)
-
Compute “covariance matrix”
-
Compute “eigenvectors” of covariance matrix
-
Take the first k columns of the U matrix and compute z
2.4 Reconstruction from Compressed Representaion
since z = Ureduce’ * x
so x = Ureduce * z #x is the approximations of the original data
2.5 Choosing the Number of PCA
one way is usiing the following convention:
- Compute the average squared projection error
- computer the total variation in the data
- choose k to be the smallest value such that:
Fortunately we can use svd funciton to ease the process:
2.6 Advice for applying PCA
- Speed up Supervised learning
Suppose training set: (x(1), y(1)), (x(2), y(2), …, (x(m), y(m))
Extract all inputs, x(1), x(2), … ,x(m)
Use PCA to reduce dimension as z(1), z(2), …, z(m) (eg: reduce n=10000 to K=1000)
Use New training set: (z(1), y(1)), (z(2), y(2), …, (z(m), y(m)) - Bad use of PCA: to prevent overfitting
Because PCA may be ommiting important information of y(i)
Use regularization instead! - When is the right time to use PCA?
First try with the original/raw data x(i) without PCA. Only if that doesn’t do what we want, then implement PCA and consider usinig z(i)
3 ex7
3.1 K-means find closest centroids
% my init code here, runing rusult is ok, but failed to pass grader afer submitting,
% although ex7.m got the expected answer.
%m = size(X, 1);
%for i = 1:m,
% xi = X(i,:);
% clist = zeros(1, size(X,2));
% // for each given x(of m examples), compute distance from xi to K centroids
% // keep K result in a clist(vector), then use min to get the index.
% for j = 1:K,
% miuj = centroids(j,:);
% clist(j) = (xi-miuj) * (xi-miuj)';
% end;
% [vmin, indexmin] = min(clist);
% idx(i) = indexmin;
%end;
% have to ref some online assignment below, then it passed the grader.
% this method is much cleaner
m = size(X, 1);
for i = 1:m,
% first computer a value
minIndex = 1;
minDist = (X(i,:)-centroids(1,:)) * (X(i,:)-centroids(1,:))';
for j = 2:K,
curDist = (X(i,:)-centroids(j,:)) * (X(i,:)-centroids(j,:))';
if (curDist < minDist),
minIndex = j;
minDist = curDist;
end;
end;
idx(i) = minIndex;
end;
Is there matrix method?
3.2 K-means compute centroid means
% in order to use matrix, expend vector idx to matrix, with "1" on the idx number
Idx_matrix = zeros(m, K);
for i = 1:m,
Idx_matrix(i,idx(i)) = 1;
end;
% in Idex_matrix' (K * m)
% row1: the number of "1" on this row, stands for the number of closest to centroid 1;
% row2: the number of "1" on this row, stands for the number of closest to centroid 2;
% ...
% rowK: the number of "1" on this row, stands for the number of closest to centroid K;
centroids = Idx_matrix' * X;
% compute the sum of xi that are closest to centroid k, but not the mean
% so next is to count the number of xi that are closest to centroid k, aka Ck
for j = 1:K,
idx_vec(j,1) = sum(Idx_matrix'(j,:)); // count "1" of each row in Idex_matrix
end;
% divide Ck for each miuj, but sum(Matrix) just do the same, so the code is more simple:
centroids = centroids ./ (sum(Idx_matrix))';