sklearn聚类算法meanshift

最新推荐文章于 2024-06-24 19:08:35 发布

patrickpdx

最新推荐文章于 2024-06-24 19:08:35 发布

阅读量914

点赞数

分类专栏： sklearn学习系列文章标签：聚类算法 python 机器学习聚类算法

本文链接：https://blog.csdn.net/Jinyindao243052/article/details/107340583

版权

sklearn学习系列专栏收录该内容

13 篇文章 4 订阅

订阅专栏

官方参考文档

算法原理

翻譯器：https://cn.bing.com/translator
手動修改
這是一種無監督學習聚類算法，不需要知道標簽和要分成幾類

MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near duplicates to form the final set of centroids.

【平均移聚類能夠在樣本密度平滑的樣本數據中發現聚類中心。它是一種基於質心(centroid)的算法,原理是計算給定區域內點的質心並以此作爲聚類中心的新 candidate 值。然後在後續階段篩選這些 candidate 值,以消除幾乎重複的匹配項,從而形成最終的聚類中心。】

Given a candidate centroid for iteration , the candidate is updated according to the following equation:
【給定候選質心,根據以下等式更新候選質心:】

$x_i^{t+1} = m(x_i^t)$

Where $N(x_i)$ is the neighborhood of samples within a given distance around $x_i$ and $m$ is the mean shift vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. This is computed using the following equation, effectively updating a centroid to be the mean of the samples within its neighborhood:
【 $N(x_i)$ 為 $x_i$ 在給定距離的鄰域內的樣本集, $m$ 為指向點密度增加最快的方向的mean shift vector , 對每個質心都計算出一個 $m$ 向量。使用以下方程計算,有效地將質心值更新為其鄰域內樣本的平均值:】
$m(x_i) = \frac{\sum_{x_j \in N(x_i)}K(x_j - x_i)x_j}{\sum_{x_j \in N(x_i)}K(x_j - x_i)}$

The algorithm automatically sets the number of clusters, instead of relying on a parameter bandwidth, which dictates the size of the region to search through. This parameter can be set manually, but can be estimated using the provided estimate_bandwidth function, which is called if the bandwidth is not set.
【該算法會自動設置cluster數量,而不是依賴於參數bandwidth,bandwidth指示要搜索的區域大小。可以手動設置此參數,但可以使用提供的estimate_bandwidth函數進行估計,如果未設置bandwidth,則會自動調用該函數。】

The algorithm is not highly scalable, as it requires multiple nearest neighbor searches during the execution of the algorithm. The algorithm is guaranteed to converge, however the algorithm will stop iterating when the change in centroids is small.
【該算法不是高度可擴展的,因為它需要在算法執行期間進行次個最近的鄰域搜索。該演算法保證收斂,但當質心變化較小時,算法將停止。】

算法實現

Reference :

sklearn.cluster.MeanShift

用法可參考本人上一篇博客：sklearn聚类算法afﬁnity propagation

程序：

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
# 每个样本的标签，长度为样本个数的1D向量
labels = ms.labels_
# 聚类中心
cluster_centers = ms.cluster_centers_
# 标签的种类
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()