Today’s Job
Today’s job is main about the source reading of plot_color_quantization.py and k_means_.py under scikit-learn-0.15.2\sklearn\cluster in scikit-learn-0.15.2.
Gains
- pairwise_distances_argmin:
Compute minimum distances between one point and a set of points. - shuffle:
Shuffle arrays or sparse matrices in a consistent way - Lloyd’s algorithm and Vorlonoi Diagram
- check_random_state(seed):
Turn seed into a np.random.RandomState instance - inertia:
Sum of distances of samples to their closest cluster center. - labels assignment is also called the E-step of EM
computation of the means is also called the M-step of EM - _tolerance(X, tol):
Return a tolerance which is independent of the dataset
Quesions to be solved
def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None):
“”“Init n_clusters seeds according to k-means++
Selects initial cluster centers for k-mean clustering in a smart way
to speed up convergence. see: Arthur, D. and Vassilvitskii, S.
“k-means++: the advantages of careful seeding”. ACM-SIAM symposium
on Discrete algorithms. 2007Version ported from http://www.stanford.edu/~darthur/kMeansppTest.zip,
which is the implementation used in the aforementioned paper.