scikit-learn包的学习资料-CSDN博客

http://scikit-learn.org/stable/modules/clustering.html#k-means

http://my.oschina.net/u/175377/blog/84420

K-Means clustering参数说明：

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001,precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)

n_clusters : int, optional, default: 8

The number of clusters to form as well as the number of centroids to generate.

集群形成的数量以及质心产生的数量。

max_iter : int, default: 300

Maximum number of iterations of the k-means algorithm for a single run.

k-means算法的一个单一运行的最大迭代数。

n_init : int, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

不同质心的种子的k - means算法将运行的次数。最终结果将是n_init次连续运行的最好的输出。

init : {‘k-means++’, ‘random’ or an ndarray}

Method for initialization, defaults to ‘k-means++’:

初始化的方法,默认为“k - means + +”:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.“k - means + +”:用优化的方式来加速收敛，以选择k-mean初始集群中心。

‘random’: choose k observations (rows) at random from data for the initial centroids.

‘random’:从数据中随机的选择k个观测值作为初始的聚类中心。

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

如果一个n胃数组传递,它的形状应该是(n_clusters n_features)，并给出初始中心。

precompute_distances : {‘auto’, True, False}

Precompute distances (faster but takes more memory).

预计算的距离(更快,但需要更多的内存)。

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

‘auto’：当n_samples * n_clusters > 1200万时，不要预先计算距离。这对应于使用双精度数据会带来平均大约100 mb的开销。

True : always precompute distances

False : never precompute distances

tol : float, default: 1e-4

Relative tolerance with regards to inertia to declare convergence

对于精度的惯性收敛

n_jobs : int

The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.用于计算的工作量。这是通过计算每个n_init并行运行。

If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

random_state : integer or numpy.RandomState, optional

The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.

verbose : int, default 0

Verbosity mode.

copy_x : boolean, default True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean.
cluster_centers_ : array, [n_clusters, n_features] Coordinates of cluster centers labels_ : : Labels of each point inertia_ : float Sum of distances of samples to their closest cluster center.

copy_x : boolean, default True

When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean.

cluster_centers_ : array, [n_clusters, n_features]