机器学习——第八周

K-means算法(无监督学习)

在该算法的每次循环中,做下面两件事情:

  1. 簇分配
  2. 移动聚类中心
    直到该算法收敛。
    在这里插入图片描述
    在这里插入图片描述

随机初始化聚类中心

可以随机选择K个样本点作为初始的聚类中心。

PCA

这是一篇讲解的比较清楚的博客

编程作业1

computeCentroids.py

# X 表示样本矩阵,idx是每个样本的分类,K表示聚类个数

import numpy as np
def computeCentroids(X, idx, K):
    n = X.shape[1]
    new_centroids = np.zeros((K, n))
    for i in range(K):
        new_centroids[i] = np.mean(X[idx == i], axis=0)

    return new_centroids

ex7.py

#实现 K均值算法,并将其运用于图片压缩
''''''
'''
%% Machine Learning Online Class
%  Exercise 7 | Principle Component Analysis and K-Means Clustering
%
%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  exercise. You will need to complete the following functions:
%
%     pca.m
%     projectData.m
%     recoverData.m
%     computeCentroids.m
%     findClosestCentroids.m
%     kMeansInitCentroids.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%
'''
import scipy.io as scio
import numpy as np

'''
%% ================= Part 1: Find Closest Centroids ====================
%  To help you implement K-Means, we have divided the learning algorithm
%  into two functions -- findClosestCentroids and computeCentroids. In this
%  part, you should complete the code in the findClosestCentroids function.
%
'''
from findClosestCentroids import *

print('Part1: Finding closest centroids.')

# Load an example dataset that we will be using
data = scio.loadmat('D:\课程相关\吴恩达机器学习\Andrew-NG-Meachine-Learning-master\Andrew-NG-Meachine-Learning-master\machi'
                    'ne-learning-ex7\machine-learning-ex7\ex7\ex7data2.mat')
X = data['X']

# Select an initial set of centroids
K = 3 # 表示分成三个聚类
initial_centroids = np.array([[3,3], [6, 2], [8, 5]])

# Find the closest centroids for the examples using the
# initial_centroids
idx = findClosestCentroids(X, initial_centroids)

print('Closest centroids for the first 3 examples: ')
print(idx[:3])
print('(the closest centroids should be 1, 3, 2 respectively)')

input('Program paused. Press enter to continue.')

'''
%% ===================== Part 2: Compute Means =========================
%  After implementing the closest centroids function, you should now
%  complete the computeCentroids function.
%
'''
from computeCentroids import *

print('Part2: Computing centroids means.')

# Compute means based on the closest centroids found in the previous part.
centroids = computeCentroids(X, idx, K)

print('Centroids computed after initial finding of closest centroids: ')
print(centroids)
print('(the centroids should be')
print('[ 2.428301 3.157924 ]')
print('[ 5.813503 2.633656 ]')
print('[ 7.119387 3.616684 ]')

input('Program paused. Press enter to continue.')


'''
%% =================== Part 3: K-Means Clustering ======================
%  After you have completed the two functions computeCentroids and
%  findClosestCentroids, you have all the necessary pieces to run the
%  kMeans algorithm. In this part, you will run the K-Means algorithm on
%  the example dataset we have provided.
%
'''
from runkMeans import *

print('Part3: Running K-Means clustering on example dataset.')

# Settings for running K-Means
K = 3
max_iters = 10
'''
% For consistency, here we set centroids to specific values
% but in practice you want to generate them automatically, such as by
% settings them to be random examples (as can be seen in
% kMeansInitCentroids).
'''
initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])

# Run K-Means algorithm. The 'true' at the end tells our function to plot
# the progress of K-Means
centroids, idx = runkMeans(X, initial_centroids, max_iters, plot=False)
print('K-Means Done.')

input('Program paused. Press enter to continue.')
'''
%% ============= Part 4: K-Means Clustering on Pixels ===============
%  In this exercise, you will use K-Means to compress an image. To do this,
%  you will first run K-Means on the colors of the pixels in the image and
%  then you will map each pixel onto its closest centroid.
%
%  You should now complete the code in kMeansInitCentroids.m
%
'''
#在一个图像直接的24位颜色表示中,每个像素点都用三个8位无符号整数表示,用这些数确定红、绿、蓝相应的值。
#这种编码称为RGB编码。这个图片的图像存在成千种颜色,在这个练习中,我们将把颜色的种类数减少到16种(K)
#通过这种约减,就可以高效的表示一个图片。具体来说,您只需要存储选定的16种颜色的RGB值,对于图像中的每个
#像素,您现在只需要在该位置存储颜色的索引(只需要4位就可以表示16中可能性)
#这这个练习中,我们通过K均值算法选择16中颜色压缩图片。具体来说,我们将把原图像中的每个像素都作为一个数据样本来看待。
#并且使用K均值算法来选择16种颜色来对原来的像素分类。
from PIL import Image
from kMeansInitCentroids import *

print('Part4: Running K-Means clustering on pixels from an image.')

img = Image.open(r'D:/课程相关/吴恩达机器学习/Andrew-NG-Meachine-Learning-master/Andrew-NG-Meachine-Learning-master/machine-'
                 'learning-ex7/machine-learning-ex7/ex7/bird_small.png')

A = np.asarray(img)###加载图片,并将图片存储在三维矩阵中,矩阵中的第一维和第二维表示位置,第三维表示
#红、绿、蓝三种颜色。
###A.shape  (128, 128, 3)
A = A / 255 # Divide by 255 so that all values are in the range 0 - 1

# Size of the image
img_size = A.shape

# Reshape the image into an Nx3 matrix where N = number of pixels.
# Each row will contain the Red, Green and Blue pixel values
# This gives us our dataset matrix X that we will use K-Means on.
X = A.reshape(img_size[0] * img_size[1], 3)
##将矩阵的大小重新整一下
# Run your K-Means algorithm on this data
# You should try different values of K and max_iters here
K = 3
max_iters = 10

# When using K-Means, it is important the initialize the centroids
# randomly.
# You should complete the code in kMeansInitCentroids.m before proceeding
initial_centroids = kMeansInitCentroids(X, K)  #随机初始化聚类中心

# Run K-Means
centroids, idx = runkMeans(X, initial_centroids, max_iters, False) #运行 K 均值方法

input('Program paused. Press enter to continue.')

'''
%% ================= Part 5: Image Compression ======================
%  In this part of the exercise, you will use the clusters of K-Means to
%  compress an image. To do this, we first find the closest clusters for
%  each example. After that, we
'''
print('Part5: Applying K-Means to compress an image.')

# Find closest cluster members
idx = findClosestCentroids(X, centroids)
##通过K均值算法,将每个像素点所属聚类做出分类
# Essentially, now we have represented the image X as in terms of the
# indices in idx.
idx = idx.astype(int)
idx.tolist()##转换成整形并且转换成列表?????
# We can now recover the image from the indices (idx) by mapping each pixel
# (specified by its index in idx) to the centroid value
X_recovered = centroids[idx]
##将样本点转换成其所属的聚类中心的位置
# Reshape the recovered image into proper dimensions
X_recovered = X_recovered.reshape((img_size[0], img_size[1], 3))

# 展示原始图片
plt.subplot(1, 2, 1)
plt.imshow(A)
plt.title('Original')

# 展示压缩后的图片
plt.subplot(1, 2, 2)
plt.imshow(X_recovered)
plt.title(print('Compressed, with {} colors.'.format(K)))

plt.show()

input('Program paused. Press enter to continue.')



findClosestCentroids.py

# 聚类中心是initial_centroids,根据这些点,对样本点X进行分类
# 对于X中的每一个样本,都需要将其和initial_centroids中的向量做差,再比较差值向量
# 的长度,长度最小的那个便是分类
import numpy as np

def findClosestCentroids(X, initial_centroids):
    m = X.shape[0] # 得到样本数
    id = np.zeros(m)
    for i in range(m):
        x_sample = X[i, :] #得到第i个样本
        sub_vectors = x_sample - initial_centroids
        dis = sub_vectors.dot(sub_vectors.T)
        id[i] = np.argsort(dis.diagonal())[0]
    return id

kMeansInitCentroids.py

import numpy as np
import random

## X表示可以选择的样本,K表示聚类的个数
def kMeansInitCentroids(X, K):
    random_range = np.arange(0, X.shape[0])
    random.shuffle(random_range)
    centroids = X[random_range[0:K], :]
    return centroids

runkMeans.py

##这个不知道怎么写,直接拿来了博客的代码
import matplotlib.pyplot as plt
import numpy as np
from findClosestCentroids import *
from computeCentroids import *

def runkMeans(X, initial_centroids, max_iters, plot):  # plot设置是否进行可视化
    #如果plot是true,则进行图像的绘制
    if plot:
        plt.figure()

    (m, n) = X.shape  # m样本数  n样本特征数
    K = initial_centroids.shape[0]  # 聚类中心数量
    centroids = initial_centroids
    previous_centroids = centroids
    idx = np.zeros(m)  # 存放每个样本所属的聚类中心序号

    # 运行k-means
    for i in range(max_iters):  # 外循环
        print('K-Means iteration {}/{}'.format((i + 1), max_iters))

        idx = findClosestCentroids(X, centroids)  # 第一个内循环 为每个样本找到最近的聚类中心

        if plot:
            plot_progress(X, centroids, previous_centroids, idx, K, i)  # 画出此时簇分配的状态
            previous_centroids = centroids
            input('Press ENTER to continue')

        centroids = computeCentroids(X, idx, K)  # 第2个内循环  更新聚类中心
    if (plot):
        plt.show()
    return centroids, idx  # 返回最终聚类中心的位置  和每个样本所属的聚类中心序号


def plot_progress(X, centroids, previous, idx, K, i):
    plt.scatter(X[:, 0], X[:, 1], c=idx, s=15)  # 不同聚类中心用不同的颜色表示

    plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', c='black', s=25)  # 标出聚类中心

    for j in range(centroids.shape[0]):  # 为更新后的聚类中心和之前的聚类中心连线
        draw_line(centroids[j], previous[j])

    plt.title('Iteration number {}'.format(i + 1))

def draw_line(p1, p2):
    plt.plot(np.array([p1[0], p2[0]]), np.array([p1[1], p2[1]]), c='black', linewidth=1)


编程作业2

displayData.py

''''''
'''
%DISPLAYDATA Display 2D data in a nice grid
%   [h, display_array] = DISPLAYDATA(X, example_width) displays 2D data
%   stored in X in a nice grid. It returns the figure handle h and the
%   displayed array if requested.
'''
import numpy as np
import matplotlib.pyplot as plt
##将矩阵表示的像素展示位图像,已经见过一次,还是不会写,从其他博客copy过来的
def displayData(x):
    (m, n) = x.shape

    # 计算每个图像的长度和宽度
    example_width = np.round(np.sqrt(n)).astype(int)
    example_height = (n / example_width).astype(int)

    # 计算要展示的图片的行数和列数
    display_rows = np.floor(np.sqrt(m)).astype(int)
    display_cols = np.ceil(m / display_rows).astype(int)

    # 每个图片之间的间距
    pad = 1

    # 将来要展示图片的矩阵
    display_array = - np.ones((pad + display_rows * (example_height + pad),
                              pad + display_rows * (example_height + pad)))

    # 设置每个要展示的图片
    curr_ex = 0
    for j in range(display_rows):
        for i in range(display_cols):
            if curr_ex > m:
                break

            # Copy the patch
            # Get the max value of the patch
            max_val = np.max(np.abs(x[curr_ex]))
            display_array[pad + j * (example_height + pad) + np.arange(example_height),
                          pad + i * (example_width + pad) + np.arange(example_width)[:, np.newaxis]] = \
                          x[curr_ex].reshape((example_height, example_width)) / max_val
            curr_ex += 1

        if curr_ex > m:
            break

    # Display image
    plt.figure()
    plt.imshow(display_array, cmap='gray', extent=[-1, 1, -1, 1])
    plt.axis('off')


ex7_pca.py

''''''
'''
%% Machine Learning Online Class
%  Exercise 7 | Principle Component Analysis and K-Means Clustering
%
%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  exercise. You will need to complete the following functions:
%
%     pca.m
%     projectData.m
%     recoverData.m
%     computeCentroids.m
%     findClosestCentroids.m
%     kMeansInitCentroids.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%
'''
import scipy.io as scio
import matplotlib.pyplot as plt

'''
%% ================== Part 1: Load Example Dataset  ===================
%  We start this exercise by using a small dataset that is easily to
%  visualize
%
'''
print('Part1: Visualizing example dataset for PCA.')
'''
%  The following command loads the dataset. You should now have the
%  variable X in your environment
'''
data = scio.loadmat('D:\课程相关\吴恩达机器学习\Andrew-NG-Meachine-Learning-master\Andrew-NG-Meachine-Learning-master\machi'
                    'ne-learning-ex7\machine-learning-ex7\ex7\ex7data1.mat')
X = data['X']
#  Visualize the example dataset
plt.scatter(X[:, 0], X[:, 1], edgecolors='blue', facecolors='none')
plt.axis([0.5, 6.5, 2, 8]) #axis square
plt.show()
input('Program paused. Press enter to continue.')

'''
%% =============== Part 2: Principal Component Analysis ===============
%  You should now implement PCA, a dimension reduction technique. You
%  should complete the code in pca.m
%
'''
from featureNormalize import *
from pca import *

print('Part2: Running PCA on example dataset.')

#  Before running PCA, it is important to first normalize X
X_norm, mu, sigma = featureNormalize(X)
#返回值依次是特征缩放之后的结果,每个特征的均值,和特征的特征向量
#  Run PCA
U, S = pca(X_norm)

#  Compute mu, the mean of the each feature

# Draw the eigenvectors centered at mean of data. These lines show the
# directions of maximum variations in the dataset.
plt.plot(mu, mu + 1.5 * S[0] * U[:,0], linewidth=2)
plt.plot(mu, mu + 1.5 * S[1] * U[:,1], linewidth=2)

print('Top eigenvector: ')
print(' U(:,0) = {} {}'.format(U[0,0], U[1 ,0]))
print('(you should expect to see -0.707107 -0.707107)')

input('Program paused. Press enter to continue.\n');


'''
%% =================== Part 3: Dimension Reduction ===================
%  You should now implement the projection step to map the data onto the
%  first k eigenvectors. The code will then plot the data in this reduced
%  dimensional space.  This will show you what the data looks like when
%  using only the corresponding eigenvectors to reconstruct it.
%
%  You should complete the code in projectData.m
%
'''
from projectData import *
from recoverData import *

print('Part3: Dimension reduction on example dataset.')


#  Project the data onto K = 1 dimension
K = 1
Z = projectData(X_norm, U, K)
print('Projection of the first example: ', Z[0])
print('(this value should be about 1.481274)')

X_rec  = recoverData(Z, U, K)
print('Approximation of the first example: {} {}\n'.format(X_rec[0, 0], X_rec[0, 1]))
print('(this value should be about  -1.047419 -1.047419)')
#  Plot the normalized dataset (returned from pca)
plt.scatter(X_norm[:, 0], X_norm[:, 1], edgecolors='blue', facecolors='none')
plt.axis([-4, 3, -4, 3]) #axis square

#  Draw lines connecting the projected points to the original points

plt.scatter(X_rec[:, 0], X_rec[:, 1], edgecolors='red', facecolors='none')
for i in range(X_norm.shape[0]):
    plt.plot([X_norm[i, 0],X_rec[i, 0]], [X_norm[i, 1], X_rec[i, 1]], color='black')

plt.show()
input('Program paused. Press enter to continue.')

'''
%% =============== Part 4: Loading and Visualizing Face Data =============
%  We start the exercise by first loading and visualizing the dataset.
%  The following code will load the dataset into your environment
%
'''
from displayData import *
print('Part4: Loading face dataset.')

# Load Face dataset

data = scio.loadmat('D:\课程相关\吴恩达机器学习\Andrew-NG-Meachine-Learning-master\Andrew-NG-Meachine-Learning-master\machi'
                    'ne-learning-ex7\machine-learning-ex7\ex7\ex7faces.mat')
X = data['X']
#  Display the first 100 faces in the dataset
displayData(X[:100, :])
plt.show()
input('Program paused. Press enter to continue.')

'''
%% =========== Part 5: PCA on Face Data: Eigenfaces  ===================
%  Run PCA and visualize the eigenvectors which are in this case eigenfaces
%  We display the first 36 eigenfaces.
%
'''
print('Running PCA on face dataset.(this might take a minute or two ...)')

#  Before running PCA, it is important to first normalize X by subtracting
#  the mean value from each feature
X_norm, mu, sigma = featureNormalize(X)

#  Run PCA
U, S = pca(X_norm)

#  Visualize the top 36 eigenvectors found
displayData(U[:, :36].T)
plt.show()
input('Program paused. Press enter to continue.')


'''
%% ============= Part 6: Dimension Reduction for Faces =================
%  Project images to the eigen space using the top k eigenvectors
%  If you are applying a machine learning algorithm
'''
print('Dimension reduction for face dataset.')

K = 100
Z = projectData(X_norm, U, K)

print('The projected data Z has a size of: ')
print(Z.shape)

input('Program paused. Press enter to continue.')

'''
%% ==== Part 7: Visualization of Faces after PCA Dimension Reduction ====
%  Project images to the eigen space using the top K eigen vectors and
%  visualize only using those K dimensions
%  Compare to the original input, which is also displayed
'''
print('Part7: Visualizing the projected (reduced dimension) faces.')

K = 100
X_rec  = recoverData(Z, U, K)

# Display normalized data

displayData(X_norm[:100,:])
plt.title('Original faces')

# Display reconstructed data from only k eigenfaces

displayData(X_rec[:100,:])
plt.title('Recovered faces')
plt.show()

input('Program paused. Press enter to continue.')


'''
%% === Part 8(a): Optional (ungraded) Exercise: PCA for Visualization ===
%  One useful application of PCA is to use it to visualize high-dimensional
%  data. In the last K-Means exercise you ran K-Means on 3-dimensional
%  pixel colors of an image. We first visualize this output in 3D, and then
%  apply PCA to obtain a visualization in 2D.

from skimage import io
from kMeansInitCentroids import *
from runkMeans import *

# Reload the image from the previous exercise and run K-Means on it
# For this to work, you need to complete the K-Means assignment first
A = io.imread('bird_small.png')
# If imread does not work for you, you can try instead
#   load ('bird_small.mat');

A = A / 255
img_size = A.shape
X = A.reshape((img_size[0] * img_size[1], 3))
K = 16
max_iters = 10
initial_centroids = kMeansInitCentroids(X, K)
centroids, idx = runkMeans(X, initial_centroids, max_iters, False)

#  Sample 1000 random indexes (since working with all the data is
#  too expensive. If you have a fast computer, you may increase this.
sel = np.random.randint(X.shape[0], size=1000)

#  设置颜色
palette = hsv(K)
colors = palette(idx(sel), :)

#  Visualize the data and centroid memberships in 3D

scatter3(X(sel, 1), X(sel, 2), X(sel, 3), 10, colors)
plt.title('Pixel dataset plotted in 3D. Color shows centroid memberships')
input('Program paused. Press enter to continue.')


## === Part 8(b): Optional (ungraded) Exercs ise: PCA for Visualization ===
# Use PCA to project this cloud to 2D for visualization

# Subtract the mean to use PCA
X_norm, mu, sigma = featureNormalize(X)

# PCA and project the data to 2D
U, S = pca(X_norm)
Z = projectData(X_norm, U, 2)

# Plot in 2D
plotDataPoints(Z(sel, :), idx(sel), K)
plt.title('Pixel dataset plotted in 2D, using PCA for dimensionality reduction')
input('Program paused. Press enter to continue.')
'''

featureNormalize.py

import numpy as np
##返回特征缩放之后的样本
##返回每个特征的均值
##返回每个特征的标准差
def featureNormalize(X):
    X_mean = np.mean(X, axis=0) # 求样本每个特征的均值
    sigma = np.std(X, axis=0, ddof=1) #加入ddof参数,计算每一列的无偏差标准差
    X -= X_mean  # 使得样本所有特征的均值变为0
    return X / sigma, X_mean, sigma

kMeanInitCentroids

import numpy as np
import random

## X表示可以选择的样本,K表示聚类的个数
def kMeansInitCentroids(X, K):
    random_range = np.arange(0, X.shape[0])
    random.shuffle(random_range)
    centroids = X[random_range[0:K], :]
    return centroids

pca.py

import scipy.linalg

def pca(X_norm):
    Sigma = X_norm.T.dot(X_norm) / X_norm.shape[0] ##计算协方差矩阵
    U, S, V = scipy.linalg.svd(Sigma) ##进行奇异值分解(计算特征向量)
    ##Sigma表示要进行分解的矩阵,
    ##返回值U是 保留奇异向量作为列的酉矩阵。
    ##S是 奇异值,按非递增顺序排序
    ##V是 以奇异向量为行的酉矩阵
    return U, S

projectData.py

#X表示要进行维数约简的样本,U表示基底,K表示要约减到的维数
def projectData(X_norm, U, K):
    U_reduce = U[:, :K]
    return X_norm.dot(U_reduce)

recoverData.py

def recoverData(Z, U, K):
    return Z.dot(U[:, :K].T)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值