k-中心
定义
k-中心和k-均值很像,不同的是形心的更新选择,k-均值是通过求得均值进行更新形心的,而k-中心是随机选择k个对象作为初始的k个簇的代表点,反复用非代表点来代替代表点,直到找到误差平方和最小的那个点来作为数据中心点。这样划分方法是基于最小化所有对象与其参照点之间的相异度之和的原则来执行的,这是 k-medoids 方法的基础。
主要步骤:
- 一开始也是随机选择K个初始对象
- 将每个剩余的对象分配在临近的簇中当中去,计算当前代价
- 随机选择一个非质心对象作为新质心
- 计算用新质心的总代价S
- If S <旧质心, t就替换用当前质心代替旧质心,形成新的k个簇
重复整个过程 知道质心位置不在发生变化。
小结
与k-means相比,k-中心对噪声和离群点的处理具有更好的鲁棒性,中心点不容易受离群点影响
与k-means相同,k-中心也要事先指定具体的簇数k
k-medoids 方法的执行代价比 k-means 算法高,当n和k较大时,计算开销变得很大,远高于k-means
代码实现
class KMedoids():
def __init__(self, data, k, r):
self.data = data
self.k = k
self.r = r
self.E = np.empty(data.shape[0])
self.label = np.empty(data.shape[0])
self.centroids, self.centroids_id = self.__init(data, k ,r)
self.__convergence = False
def __distance(self, x, y):
return sum((x-y)**2)**(1/2)
def __init(self, data, k, r):
rs = np.random.RandomState(r)
num, dim = data.shape
randid = rs.randint(num, size=k)
centroids = np.empty((k, dim))
centroids_id = np.empty(k, dtype=int)
for i, v in enumerate(randid):
centroids[i] = data.iloc[v, :]
centroids_id[i] = v
return centroids, centroids_id
def __converge(self, pre):
return True if False not in (pre==self.centroids) else False
def __update(self):
rs = np.random.RandomState(self.r)
for i in range(self.k):
# Get non-center point in a cluster
noncenter_ids = self.data.loc[self.label==i,:].index
randid = rs.choice(noncenter_ids, 1)[0]
point = self.data.iloc[randid, :]
# Compute S
origin_cost = np.sum(self.E[self.label==i])
new_cost = self.E[randid]
for j in range(len(noncenter_ids)):
other_point = self.data.iloc[noncenter_ids[j],:]
new_cost += self.__distance(point, other_point)
S = new_cost - origin_cost
if S < 0:
self.centroids[i] = point
self.centroids_id = randid
def fit(self):
num, dim = self.data.shape
iteration = 0
while not self.__convergence:
print('Iteration %d :' % iteration)
pre_centroids = self.centroids.copy()
for i in range(num):
min_dist, neighbour = np.inf, -1
for j in range(self.k):
dist = self.__distance(data.iloc[i,:], self.centroids[j])
if dist < min_dist:
min_dist, neighbour = dist, j
self.label[i] = j
self.E[i] = self.__distance(self.data.iloc[i,:],
self.centroids[int(self.label[i])])**2
# update centroids
self.__update()
# check if conveged
self.__convergence = self.__converge(pre_centroids)
iteration += 1
if iteration == 200:
break