sklearn聚类算法meanshift

官方参考文档

算法原理

翻譯器:https://cn.bing.com/translator
手動修改
這是一種無監督學習聚類算法,不需要知道標簽和要分成幾類

MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near duplicates to form the final set of centroids.

【平均移聚類能夠在樣本密度平滑的樣本數據中發現聚類中心。它是一種基於質心(centroid)的算法,原理是計算給定區域內點的質心並以此作爲聚類中心的新 candidate 值。然後在後續階段篩選這些 candidate 值,以消除幾乎重複的匹配項,從而形成最終的聚類中心。】

Given a candidate centroid for iteration , the candidate is updated according to the following equation:
【給定候選質心,根據以下等式更新候選質心:】

x i t + 1 = m ( x i t ) x_i^{t+1} = m(x_i^t) xit+1=m(xit)

Where N ( x i ) N(x_i) N(xi) is the neighborhood of samples within a given distance around x i x_i xi and m m m is the mean shift vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. This is computed using the following equation, effectively updating a centroid to be the mean of the samples within its neighborhood:
N ( x i ) N(x_i) N(xi) x i x_i xi 在給定距離的鄰域內的樣本集, m m m 為指向點密度增加最快的方向的mean shift vector , 對每個質心都計算出一個 m m m 向量。使用以下方程計算,有效地將質心值更新為其鄰域內樣本的平均值:】
m ( x i ) = ∑ x j ∈ N ( x i ) K ( x j − x i ) x j ∑ x j ∈ N ( x i ) K ( x j − x i ) m(x_i) = \frac{\sum_{x_j \in N(x_i)}K(x_j - x_i)x_j}{\sum_{x_j \in N(x_i)}K(x_j - x_i)} m(xi)=xjN(xi)K(xjxi)xjN(xi)K(xjxi)xj

The algorithm automatically sets the number of clusters, instead of relying on a parameter bandwidth, which dictates the size of the region to search through. This parameter can be set manually, but can be estimated using the provided estimate_bandwidth function, which is called if the bandwidth is not set.
【該算法會自動設置cluster數量,而不是依賴於參數bandwidth,bandwidth指示要搜索的區域大小。可以手動設置此參數,但可以使用提供 的estimate_bandwidth函數進行估計,如果未設置bandwidth,則會自動調用該函數。】

The algorithm is not highly scalable, as it requires multiple nearest neighbor searches during the execution of the algorithm. The algorithm is guaranteed to converge, however the algorithm will stop iterating when the change in centroids is small.
【該算法不是高度可擴展的,因為它需要在算法執行期間進行次個最近的鄰域搜索。該演算法保證收斂,但當質心變化較小時,算法將停止。】

算法實現

Reference :

sklearn.cluster.MeanShift

用法可參考本人上一篇博客:sklearn聚类算法affinity propagation

程序:

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
# 每个样本的标签,长度为样本个数的1D向量
labels = ms.labels_
# 聚类中心
cluster_centers = ms.cluster_centers_
# 标签的种类
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是MeanShift的Matlab代码演示[^1]: ```matlab % Mean Shift Clustering Example % Code written by Dr. Matthew E. Martin % Assistant Professor, Department of Computer Science % The University of Oklahoma % % This code demonstrates the Mean Shift clustering algorithm % % The data set used consists of 1000 points in 3D space. % These points are divided into two clusters. % One cluster is centered at position (30,30,30) and the other % at position (80,80,80) % Each point in the cluster is normally distributed with a standard % deviation of 5 units. % % The Mean Shift clustering algorithm is then run on this data set and the % resulting clusters are plotted using different colors for better visualization % % NOTE: This code is for educational purposes only and is not intended % for commercial use without permission from the author. % % Code is provided "as is" and the author assumes no responsibility % for any errors or problems that may arise from using this code. % Create Data Set x = [randn(1000,1)*5+30 randn(1000,1)*5+30 randn(1000,1)*5+30; ... randn(1000,1)*5+80 randn(1000,1)*5+80 randn(1000,1)*5+80]; % Implement Mean Shift ms = MeanShift(); ms.bandwidth = 8; ms.min_points = 10; result = ms.cluster(x); % Plot Results figure; hold on; scatter3(result(:,1),result(:,2),result(:,3)); view(-115,40); % Define MeanShift Class classdef MeanShift properties bandwidth = 8; min_points = 10; end methods function cluster_result = cluster(obj,X) n = size(X,1); labels = zeros(n,1); cluster_center = []; visited = false(n,1); for i=1:n if ~visited(i) visited(i) = true; [new_cluster,labels] = obj.pointsInRange(X,X(i,:),visited); while size(new_cluster,1) > 0 [new_cluster,labels2] = obj.pointsInRange(X,new_cluster(1,:),visited); if size(new_cluster,1) >= obj.min_points labels(labels2) = size(cluster_center,1)+1; cluster_center = [cluster_center; mean(new_cluster)]; end visited(labels2) = true; new_cluster(1,:) = []; end if labels(i) == 0 cluster_center = [cluster_center; X(i,:)]; labels(i) = size(cluster_center,1); end end end cluster_result = cluster_center(labels,:); end function [new_cluster,labels] = pointsInRange(obj,X,x,visited) distance = sqrt(sum((X-repmat(x,size(X,1),1)).^2,2)); in_range = distance < obj.bandwidth; labels = find(in_range); new_cluster = X(in_range,:); new_cluster = new_cluster(~visited(in_range),:); end end end ``` 另外,还可以使用Python实现Mean Shift聚类算法,以下是Python代码示例[^2]: ```python import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth from itertools import cycle #Input dataset X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) #Estimate bandwidth bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500) #Fit mean shift algorithm to data ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) ms.fit(X) #Extract cluster assignments for each data point labels = ms.labels_ #Extract centroids centroids = ms.cluster_centers_ #Number of clusters n_clusters_ = len(np.unique(labels)) #Plot result print("Number of estimated clusters : %d" % n_clusters_) colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk') for k, col in zip(range(n_clusters_), colors): my_members = labels == k cluster_center = centroids[k] plt.plot(X[my_members, 0], X[my_members, 1], col + '.') plt.plot(cluster_center, cluster_center, 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show() ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值