【K-Means Clustering and Principal Component Analysis】

我想吃虾虾�

于 2020-08-16 11:31:07 发布

阅读量295

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/Chen1189/article/details/108029220

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

K-Means Clustering and Principal Component Analysis

K-Means Clustering总结

a method to automatically cluster similar data examples together （Unsupervised）
given a training set $\{x^{(1)},\ldots x^{(m)}\}$ , and want to group the data into a few cohesive ‘clusters’.
iterative procedure that starts by guessing the initial centroids, and then refines this guess by repeatedly assigning examples to their closest centroids and then recomputing the centroids based on the assignments

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to c^(i), the index
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

The inner-loop of the algorithm repeatedly carries out two steps:

Assigning each training example to its closest centroid
Recomputing the mean of each centroid using the points assigned to it.

random random initializations

prevent local optimal
pick the one yields the lowst distortion (cost function)

% Initialize the centroids to be random examples

%Randomly reorder the indicies of examples
randidx = randperm(size(X,1));
% Take the first K examples
centroids = X(randidx(1:K),:);

Finding closest centroids

$c^{(i)} :=j \quad\text{that minimizes}\quad\left||x^{(i)}-\mu_j\right||^2$
MatLab 代码：

function idx = findClosestCentroids(X, centroids)

% Set K (number of centroid)
K = size(centroids, 1);
% number of training examples -- (Mx2) because 2 dimensional,has 2 colums
M = size(X,1);

% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);

for sample = 1:M
    distance_array = zeros(1,K); % need a unique distance array for each traning example
    % this distance arrau should be a one-dimensional vector (1xK)
    for centroid = 1:K
        %distance_array(1,centroid) = sqrt (sum(power((X(sample,:)-centroids(centroid,:)),2)));
        % another way of calculating this:
        distance_array(1, centroid) = ((X(sample,:)' - centroids(centroid,:)').^2)' * ((X(sample,:)' - centroids(centroid,:)').^2);
    end
    [~, d_idx] = min(distance_array);
    idx(sample,1) = d_idx;
end

end

Computing centroid means

function centroids = computeCentroids(X, idx, K)

% Useful variables
[m n] = size(X);
% m-number of training exmaple, n-dimensions

% You need to return the following variables correctly.
centroids = zeros(K, n);


% Instructions: Go over every centroid and compute mean of all points that
%               belong to it. Concretely, the row vector centroids(i, :)
%               should contain the mean of the data points assigned to
%               centroid i.

%idx -- a mx1 vector contains which centroid the example belongs to

for i = 1:K
    centroids(i,:) = sum(X((idx==i),:))/length(X((idx==i),:));
end

% method2:
for k=1:K % for-loop over the centroids 
   centroids(k, :) = mean(X(idx==k, :));
end
   
end

Principal Component Analysis - Dimensionality reduction

PCA consists of two computational steps:

compute the covariance(协方差) matrix of the data (denoted as Sigma)

MATLAB’s svd(Singular value decomposition) function to compute the eigenvectors $U_1, U_2,\ldots,U_n$ --> These will correspond to the principal components of variation in the data.

 [U, S, V] = svd(Sigma);

U will contain the principal components and S will contain a diagonal matrix.

Before PCA: preprocessing data:

normalize the data by subtracting the mean value of each feature from the dataset
scaling each dimension so that they are in the same range
feature scaling function in MATLAB (return normalized X and mean value mu, and standard deviation sigma):

function [X_norm, mu, sigma] = featureNormalize(X)

mu = mean(X);
X_norm = bsxfun(@minus, X, mu);

sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);

end

PCA implementation

function [U, S] = pca(X)

% Useful values
[m, n] = size(X);

% You need to return the following variables correctly.
U = zeros(n);
S = zeros(n);


% compute covariance matrix (2x300)(300x2) --> 2x2
Sigma = (1.0/m) * X' * X;


[U, S, ~] = svd(Sigma);

end

Dimensionality reduction

project data to lower dimension $x^{(i)}\rightarrow z^{(i)}$

Projecting the data onto the principal components

Given: dataset X, the principal components U, and the desired number of dimensions to reduce to K
project each example in X onto the top K components in U
top K components in U are given by the first K columns of U, that is U_reduce = U(:, 1:K)

function Z = projectData(X, U, K)
%PROJECTDATA Computes the reduced data representation when projecting only 
%on to the top k eigenvectors
%   Z = projectData(X, U, K) computes the projection of 
%   the normalized inputs X into the reduced dimensional space spanned by
%   the first K columns of U. It returns the projected examples in Z.
%

% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the projection of the data using only the top K 
%               eigenvectors in U (first K columns). 
%               For the i-th example X(i,:), the projection on to the k-th 
%               eigenvector is given as follows:
%                    x = X(i, :)';
%                    projection_k = x' * U(:, k);
%

% matrix multiplication
Ureduced = U(:, 1:K); % U: nxn, Ureduced: nxK
% the Z we want: mxK, X: mxn
Z = X * Ureduced

% method2: 
for i=1:size(X, 1),
    for k=1:K,
        x = X(i, :)';
        projection_k = x' * U(:, k);
        Z(i, k) = projection_k;
    end
end

end

Recover data

project each example in Z back onto the original space

function X_rec = recoverData(Z, U, K)

% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1)); % size should be mxn


Ureduced = U (:, 1:K); % nxK matrix
% Z is mxK
X_rec = Z * Ureduced';


% method 2:
for i=1:size(Z, 1),
    for j=1:size(U,1),
        v = Z(i, :)';
        recovered_j = v' * U(j, 1:K)';
        X_rec(i, j) = recovered_j;
    end
end

end

我想吃虾虾�

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【K-Means Clustering and Principal Component Analysis】

K-Means Clustering总结a method to automatically cluster similar data examples together （Unsupervised）given a training set${x^{(1)},\ldots x^{(m)}} $where ), and want to group the data into a few cohesive ‘clusters’.
复制链接

扫一扫

专栏目录