聚类算法学习----之----sklearn.cluster.KMeans

最新推荐文章于 2024-08-09 23:32:16 发布

清萝卜头

最新推荐文章于 2024-08-09 23:32:16 发布

阅读量1.3w

点赞数 9

分类专栏： python 文章标签： sklearn.cluster.KMea sklearn

本文链接：https://blog.csdn.net/xiaoql520/article/details/78269539

版权

本文详细介绍了sklearn.cluster.KMeans聚类算法的输入参数，包括n_clusters、init、n_init、max_iter、tol等，并解释了各参数的作用和默认值。此外，还提到了算法的收敛条件、预计算距离的选项以及如何选择进程数量。文章最后讨论了算法的属性，如cluster_centers_、labels_和inertia_，并提供了相关示例和参考资料。

摘要由CSDN通过智能技术生成

class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)

（一）输入参数：

（1）n_clusters：分成的簇数（要生成的质心数）=====>整型，[可选]，默认值=8；

n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.

（2）init：初始化质心的方法====>有三个可选值：'k-means++'， 'random'，或者传递一个ndarray向量，默认为'k-means++'

‘k-means++’ 用一种智能的方法选定初始质心从而能加速迭代过程的收敛，参见 k_init 的解释获取更多信息。
‘random’ 随机从训练数据中选取初始质心。
如果传递的是一个ndarray，则应该形如 (n_clusters, n_features) 并给出初始质心。

init : {‘k-means++’, ‘random’ or an ndarray}

Method for initialization, defaults to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

（3）n_init:：用不同的质心初始化值运行算法的次数====>整型，默认值=10次，最终解是在inertia意义下选出的最优结果。

（ps：每一次算法运行时开始的centroid seeds是随机生成的, 这样得到的结果也可能有好有坏. 所以要运行算法n_init次, 取其中最好的。）

n_init : int, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

（4）max_iter：算法每次迭代的最大次数====>整型，默认值=300

max_iter : int, default: 300

Maximum number of iterations of the k-means algorithm for a single run.

（5）tol：与inertia结合来确定收敛条件====> float型，默认值= 1e-4

tol : float, default: 1e-4

Relative tolerance with regards to inertia to declare convergence

（6）precompute_distances：预计算距离，计算速度更快但占用更多内存 ====>类型：（auto，True，False）三个值可选，,默认值=“auto”

‘auto’：如果样本数乘以聚类数大于 12million 的话则不预计算距离‘’

‘True‘：总是预先计算距离。

‘False‘：永远不预先计算距离。

这个参数会在空间和时间之间做权衡，如果是True 会把整个距离矩阵都放到内存中，auto 会默认在数据样本大于featurs*samples 的数量大于12e6 的时候False,False时

核心实现的方法是利用Cpython 来实现的