Unsupervised Learning
task
- learning a distribution from sample(GMM/VAE)
- clustering(PAC)
- feature learning
按照算法目的,无监督算法大体可分为上述三类,其中第一类和第三类有时差别会不那么大,k-means算法属于无监督学习里面的clustering。
Clustering
Algorithm
Given: data set D = {x1, x2, …, xn}
Output: group the data into K cluster
- find assignment γn,k ∈{0,1}
-s.t ∑k∈[K] γn,k=1 for fixed n - find the cluster centers μ1, … ,μK ∈ RD
如果选用欧式距离作为量度min∑|| xn - z ||2 = min∑|| xn -μ + μ - z ||2
= min∑|| xn -μ ||2 + ∑|| μ -z ||2 + 2 ∑ < x - μ, μ - z>
当 z = μ, 取得最小值
K-Means Algorithm
Step
- 初始化中心点μ0,...,μK
- 将每个点划分到最近的中心点,形成K个cluster {γn,kt+1} = argmin F({γn,kt}, {μk}t)
- 重新计算每个cluster的中心 {μn,kt+1} = argmin F({γn,kt+1}, {μk}t)
- 直到cluster不发生变化或者达到最大迭代次数
Distance Measure
不同的距离量度对算法是有影响的,主要量度有- 欧式距离
- 曼哈顿距离
- 切比雪夫距离 Different ways to initialize
- 随机选取K个中心点
- 将每个点分成一个cluster
- K-means等复杂算法
Characteristics
- 算法可收敛
- 算法受初始化影响较大
- 容易陷入全局最优
- 解决np-hard问题
Python实现
# -*- coding: utf-8 -*- # dataSet m个n维测试样本 # disMeas 距离量度,选择欧式距离 # createInitCent 初始化k中心点,选择随机算法 # clusterAssment key=样本点 [key][1] = 所属cent [key][2] = 到中心点量度 # centCoids k中心 size = k*n import numpy as np import math from sklearn.datasets import make_blobs import matplotlib.pyplot as plt def distEclud(vetA, vetB): return math.sqrt(sum(pow(vetA - vetB, 2))) def randCent(dataSet, k): # 随机产生k个中心点 n = np.shape(dataSet)[1] # 数据集维数 centCoids = np.mat(np.zeros((k, n))) # 生成k中心点矩阵,注意是矩阵不是数组 for j in range(n): minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centCoids[:,j] = np.mat(minJ + rangeJ * np.random.rand(k, 1)) # np.random.rand(m, n)随机产生0-1之间m*n数组 return centCoids def kMeans(dataSet, k, disMeas = distEclud, createInitCent=randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m, 2))) centCoids = randCent(dataSet, k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): minDist = float('inf') minIndex = -1 # 查找最优中心 for j in range(k): disIJ = disMeas(centCoids.A[j,:], dataSet[i,:]) if disIJ < minDist: minDist = disIJ; minIndex = j if minIndex != clusterAssment[i,0]: clusterChanged = True # 更新Gamma clusterAssment[i,:] = minIndex, minDist**2 for cent in range(k): # 更新Mu,更改中心位置 pstInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]] # matrix.A将矩转化为array,np.nonzero布尔数组转换成一组整数数组,存储不为0下标 centCoids[cent,:] = np.mean(pstInClust, axis = 0) #求平均,axis不设值,对m*n求平均,返回实数 #axis=0对各列求平均,返回1*n数组 #axis=1对各行求平均,返回m*1数组 return centCoids, clusterAssment def cout(y, y_test): cout = 0 for i in range(len(y)): if y[i] == y_test[i]: cout += 1 return float(cout/len(y)) def draw(data, center): length = len(center) fig = plt.figure plt.scatter(data[:,0], data[:,1], s=25, c='b', alpha=0.4) for i in range(length): plt.scatter(center[i,0], center[i,1], c = 'r') plt.show() def main(): # 加载数据集make_blobs X, y = make_blobs(random_state=1) centCoids, clusterAssment = kMeans(X, 3) accuracy = cout(y, clusterAssment[:,0]) print(centCoids) draw(X, centCoids) print("accuracy = ", round(accuracy,2)) main()