在处理离散数据分类时,我们常用的一个聚类方法就是KMeans。
KMeans是基于距离的排他方法,通过指定K的值,KMeans算法将数据构建成K个划分,每一个划分即一个聚类。所有聚类不为空,且每个点有且仅属于一个聚类
传统KMeans聚类
- 从数据离散点中随机选取K个点作为K个聚类的质心;
- 测量其他离散点,将其归入距离最小质心的聚类,并重新计算相应聚类的质心;
- 重复步骤2,直到聚类质心不产生变化或变化幅度小于阈值;
这里我们一般选用欧氏距离作为距离测量,为了防止由于初始选取的点较差,导致无法成功分类,这里设置了最大循环次数。
下面介绍一个实现传统KMeans算法所需的一些函数知识
- np.linalg.norm:求范数,这里为L2范数
- np.tile: 将指定数组广播到指定维度
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
def normalize(X, axis=-1, p=2):
"""
normalize the data set X
:param X: the input data set
:param axis: the axis of the data set to calculate norm
:param p: Lp norm
:return:
"""
lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))
lp_norm[lp_norm == 0] = 1
return X / np.expand_dims(lp_norm, axis)
def euclidean_distance(one_sample, X):
"""
Calculate the euclidean distance between one sample with all data X
:param one_sample: the sample to calculate
:param X: all samples point in data
:return: euclidean distance (a array of shape(X.shape()[0], null)
"""
one_sample.reshape(1, -1)
X.reshape(X.shape[0], -1)
distance = np.power(np.tile(one_sample, (X.shape[0], 1)), 2).sum(axis=1)
return distance
class Kmeans():
def __init__(self, X, k=2, max_iterations=500, varepsilon=0.0001):
"""
:param X: the data set
:param k: the number to classify
:param max_iterations: the max time to iteration
:param varepsilon: the threshold
"""
self.X = X
self.k = k
self.max_iterations = max_iterations
self.varepsilon = varepsilon
def init_random_centroids(self):
"""
randomly choose k centroids from all sample points
:param X: all data set
:return: a array containing k centroids of shape(k, X.shape()[1])
"""
n_samples, n_features = np.shape(self.X)
centroids = np.zeros((self.k, n_features))
for i in range(self.k):
centroid = self.X[np.random.choice(range(n_samples))]
centroids[i] = centroid
return centroids
def closest_centroid(self, sample, centroids):
"""
Get the index of nearest centroid with sample from centroids
:param sample: the sample you want to get
:param centroids: all centroids in data set
:return: the index of a centroid
"""
distances = euclidean_distance(sample, centroids)
closest_index = np.argmin(distances)
return closest_index