sklearn.cluster.KMeans

#sklearn.cluster.KMeans
#by muzhen

'''

Algorithm:

initialize k centroids
for sample in full samples:
    calculate the distance between sample and each centroids by using distance function,\
    select the centroids which has the least distance with the sample and use its cluster as the sample's cluster
update the k centroids by using the mean of all corresponding samples
for and update again and again until any sample's cluster is not change or the loop count up to the appointed number.
'''

Background

The k-means problem is to find cluster centers that minimize the intra-class variance, i.e. the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard,[4] the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

However, the k-means algorithm has at least two major theoretic shortcomings:

  • First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.[5]
  • Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.

The k-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

Example of a sub-optimal clustering

To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum of squared distances of cluster points to the centroid of their assigned clusters, consider the example of four points in R2 that form an axis-aligned rectangle whose width is greater than its height.

If k = 2 and the two initial cluster centers lie at the midpoints of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm converges immediately, without moving these cluster centers. Consequently, the two bottom data points are clustered together and the two data points forming the top of the rectangle are clustered together—a suboptimal clustering because the width of the rectangle is greater than its height.

Now, consider stretching the rectangle horizontally to an arbitrary width. The standard k-means algorithm will continue to cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm perform arbitrarily poorly with respect to the k-means objective function.


from sklearn.cluster import KMeans



KMeans(self, n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto',\
       verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')

Attributes

----------
cluster_centers_ : array, [n_clusters, n_features]
    Coordinates of cluster centers


labels_ :
    Labels of each point


inertia_ : float
    Sum of distances of samples to their closest cluster center.

understand inertia is necessary to understand parameters!


Parameters

----------
n_clusters : int, optional, default: 8
    The number of clusters to form as well as the number of
    centroids to generate.


max_iter : int, default: 300
    Maximum number of iterations of the k-means algorithm for a
    single run.


n_init : int, default: 10
    Number of time the k-means algorithm will be run with different
    centroid seeds. The final results will be the best output of
    n_init consecutive runs in terms of inertia.


init : {'k-means++', 'random' or an ndarray}
    Method for initialization, defaults to 'k-means++':



    'k-means++' : selects initial cluster centers for k-mean
    clustering in a smart way to speed up convergence. See section
    Notes in k_init for more details.


The exact k-means++ algorithm is as follows:

  1. Choose one center uniformly at random from among the data points.
  2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
  4. Repeat Steps 2 and 3 until k centers have been chosen.
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

why k-means++?

The intuition behind this approach is that spreading out the k initial cluster centers is a good thing.


    'random': choose k observations (rows) at random from data for
    the initial centroids.


    If an ndarray is passed, it should be of shape (n_clusters, n_features)
    and gives the initial centers.


algorithm : "auto", "full" or "elkan", default="auto"
    K-means algorithm to use. The classical EM-style algorithm is "full".
    The "elkan" variation is more efficient by using the triangle
    inequality, but currently doesn't support sparse data. "auto" chooses
    "elkan" for dense data and "full" for sparse data.

elkan:https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf

The optimized algorithm is based on the fact that most distance calculations in standard k-means are redundant. 

precompute_distances : {'auto', True, False}
    Precompute distances (faster but takes more memory).


    'auto' : do not precompute distances if n_samples * n_clusters > 12
    million. This corresponds to about 100MB overhead per job using
    double precision.


    True : always precompute distances


    False : never precompute distances
precompute_distances mean what?

if precompute_distances:
            return _labels_inertia_precompute_dense(X, x_squared_norms,
                                                    centers, distances)

_labels_inertia_precompute_dense: 

Compute labels and inertia using a full distance matrix.This will overwrite the 'distances' array in-place.

But precompute_distances like full and is opposed to elkan ?


tol : float, default: 1e-4
    Relative tolerance with regards to inertia to declare convergence

def _tolerance(X, tol):
    """Return a tolerance which is independent of the dataset"""
    if sp.issparse(X):
        variances = mean_variance_axis(X, axis=0)[1]
    else:
        variances = np.var(X, axis=0)
    return np.mean(variances) * tol        

tol = _tolerance(X, tol)

center_shift_total = squared_norm(centers_old - centers)
if center_shift_total <= tol:
       if verbose:
            print("Converged at iteration %d: "
                  "center shift %e within tolerance %e"
                  % (i, center_shift_total, tol))
        break

                                                 ---<sklearn source code>


n_jobs : int
    The number of jobs to use for the computation. This works by computing
    each of the n_init runs in parallel.


    If -1 all CPUs are used. If 1 is given, no parallel computing code is
    used at all, which is useful for debugging. For n_jobs below -1,
    (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
    are used.


random_state : integer or numpy.RandomState, optional
    The generator used to initialize the centers. If an integer is
    given, it fixes the seed. Defaults to the global numpy random
    number generator.


from ..utils import check_random_state

random_state =check_random_state(random_state)

seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)

                                            ---<sklearn source code>  this code only use in parallelisation of k-means runs


verbose : int, default 0
    Verbosity mode.


copy_x : boolean, default True
    When pre-computing distances it is more numerically accurate to center
    the data first.  If copy_x is True, then the original data is not
    modified.  If False, the original data is modified, and put back before
    the function returns, but small numerical differences may be introduced
    by subtracting and then adding the data mean.


methods

fit(X[, y]) Compute k-means clustering.
fit_predict(X[, y]) Compute cluster centers and predict cluster index for each sample.
fit_transform(X[, y]) Compute clustering and transform X to cluster-distance space.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict the closest cluster each sample in X belongs to.
score(X[, y]) Opposite of the value of X on the K-means objective.   
set_params(\*\*params) Set the parameters of this estimator.
transform(X[, y]) Transform X to a cluster-distance space.
generally speaking,score (X)  is opposite of the value of attribute inertia_



'''

Tips:

1,how to determine k?
2,how to choose distance function?
3,how to evaluate the effect of clustering?
'''
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值