【系列说明】
本系列用于复习与回顾机器学习的方法,总结算法流程,适当剖析源代码,列出适合算法的数据集,以及重要的调参参数。
案例1-中心漂移聚类(MeanShift)方法
数据集内容:
(1)numpy生成数据集,平面数据集,即二维向量,用矩阵 A(2*10000)表示
(2)设立三个数据的中心点 centers,分别为(1, 1), (-1, -1), (1, -1)
(3)每一类中数据点的标准差cluster_std 为0.6时,恰好(有数据粘合)能够区分这些类
(4)设置带宽,函数用作于mean-shift算法估计带宽,如果MeanShift函数没有传入bandwidth参 数,MeanShift会自动运行estimate_bandwidth 函数说明如下 :
def estimate_bandwidth(X, quantile=0.3, n_samples=None, random_state=0,
n_jobs=1):
"""Estimate the bandwidth to use with the mean-shift algorithm.
That this function takes time at least quadratic in n_samples. For large
datasets, it's wise to set that parameter to a small value.
Parameters
----------
X : array-like, shape=[n_samples, n_features]
Input points.
quantile : float, default 0.3
should be between [0, 1]
0.5 means that the median of all pairwise distances is used.
n_samples : int, optional
The number of samples to use. If not given, all samples are used.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
n_jobs : int, optional (default = 1)
The number of parallel jobs to run for neighbors search.
If ``-1``, then the number of jobs is set to the number of CPU cores.
Returns
-------
bandwidth : float
The bandwidth parameter.
"""
#根据r