#by muzhen
'''
Algorithm:
initialize k centroidsfor sample in full samples:
calculate the distance between sample and each centroids by using distance function,\
select the centroids which has the least distance with the sample and use its cluster as the sample's cluster
update the k centroids by using the mean of all corresponding samples
for and update again and again until any sample's cluster is not change or the loop count up to the appointed number.
'''
Background
The k-means problem is to find cluster centers that minimize the intra-class variance, i.e. the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard,[4] the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.
However, the k-means algorithm has at least two major theoretic shortcomings:
- First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.[5]
- Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.
The k-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.
Example of a sub-optimal clustering
To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum of squared distances of cluster points to the centroid of their assigned clusters, consider the example of four points in R2 that form an axis-aligned rectangle whose width is greater than its height.
If k = 2 and the two initial cluster centers lie at the midpoints of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm converges immediately, without moving these cluster centers. Consequently, the two bottom data points are clustered together and the two data points forming the top of the rectangle are clustered together—a suboptimal clustering because the width of the rectangle is greater than its height.
Now, consider stretching the rectangle horizontally to an arbitrary width. The standard k-means algorithm will continue to cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm perform arbitrarily poorly with respect to the k-means objective function.
from sklearn.cluster import KMeans
KMeans(self, n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto',\
verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')
Attributes
----------cluster_centers_ : array, [n_clusters, n_features]
Coordinates of cluster centers
labels_ :
Labels of each point
inertia_ : float
Sum of distances of samples to their closest cluster center.
understand inertia is necessary to understand parameters!
Parameters
----------n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of
centroids to generate.
max_iter : int, default: 300
Maximum number of iterations of the k-means algorithm for a
single run.
n_init : int, default: 10
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
init : {'k-means++', 'random' or an ndarray}
Method for initialization, defaults to 'k-means++':
'k-means++' : selects initial cluster centers for k-mean
clustering in a smart way to speed up convergence. See section
Notes in k_init for more details.
The exact k-means++ algorithm is as follows:
- Choose one center uniformly at random from among the data points.
- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
- Repeat Steps 2 and 3 until k centers have been chosen.
- Now that the initial centers have been chosen, proceed using standard k-means clustering.
why k-means++?
The intuition behind this approach is that spreading out the k initial cluster centers is a good thing.
'random': choose k observations (rows) at random from data for
the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
algorithm : "auto", "full" or "elkan", default="auto"
K-means algorithm to use. The classical EM-style algorithm is "full".
The "elkan" variation is more efficient by using the triangle
inequality, but currently doesn't support sparse data. "auto" chooses
"elkan" for dense data and "full" for sparse data.
elkan:https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf
The optimized algorithm is based on the fact that most distance calculations in standard k-means are redundant.
precompute_distances : {'auto', True, False}
Precompute distances (faster but takes more memory).
'auto' : do not precompute distances if n_samples * n_clusters > 12
million. This corresponds to about 100MB overhead per job using
double precision.
True : always precompute distances
False : never precompute distances
precompute_distances mean what?
if precompute_distances:
return _labels_inertia_precompute_dense(X, x_squared_norms,
centers, distances)
_labels_inertia_precompute_dense:
Compute labels and inertia using a full distance matrix.This will overwrite the 'distances' array in-place.
But precompute_distances like full and is opposed to elkan ?
tol : float, default: 1e-4
Relative tolerance with regards to inertia to declare convergence
def _tolerance(X, tol):
"""Return a tolerance which is independent of the dataset"""
if sp.issparse(X):
variances = mean_variance_axis(X, axis=0)[1]
else:
variances = np.var(X, axis=0)
return np.mean(variances) * tol
tol = _tolerance(X, tol)
center_shift_total = squared_norm(centers_old - centers)
if center_shift_total <= tol:
if verbose:
print("Converged at iteration %d: "
"center shift %e within tolerance %e"
% (i, center_shift_total, tol))
break
---<sklearn source code>
n_jobs : int
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is
used at all, which is useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
are used.
random_state : integer or numpy.RandomState, optional
The generator used to initialize the centers. If an integer is
given, it fixes the seed. Defaults to the global numpy random
number generator.
from ..utils import check_random_state
random_state =check_random_state(random_state)
seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
---<sklearn source code> this code only use in parallelisation of k-means runs
verbose : int, default 0
Verbosity mode.
copy_x : boolean, default True
When pre-computing distances it is more numerically accurate to center
the data first. If copy_x is True, then the original data is not
modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
methods
fit (X[, y]) | Compute k-means clustering. |
fit_predict (X[, y]) | Compute cluster centers and predict cluster index for each sample. |
fit_transform (X[, y]) | Compute clustering and transform X to cluster-distance space. |
get_params ([deep]) | Get parameters for this estimator. |
predict (X) | Predict the closest cluster each sample in X belongs to. |
score (X[, y]) | Opposite of the value of X on the K-means objective. |
set_params (\*\*params) | Set the parameters of this estimator. |
transform (X[, y]) | Transform X to a cluster-distance space. |
Tips:
1,how to determine k?2,how to choose distance function?
3,how to evaluate the effect of clustering?
'''