Clustering Analysis
K-means
some basic words
- centroid:质心
- medoid:most representative or most frequently occurring points
Steps
- Randomly pick k k k centroids from the sample points as initial cluster centers
- Assign each sample to the nearest centroid μ ( j ) , j ∈ { 1 , . . . , k } \mu^{(j)},j \in \{ 1,...,k\} μ(j),j∈{ 1,...,k}
- Move the centroids to the center of the samples that were assigned to it.
- Repeat the step 2 and 3 until the cluster assignment do not change or a user-defined tolerance or a maximum number of iterations is reached.
SSE
Based on the Euclidean distance metric, we can describe the k-means algorithm as a imple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors(SSE), which is sometimes also called cluster inertia.
d ( x , y ) 2 = ∑ j = 1 m ( x j − y j ) 2 = ∣ ∣ x − y ∣ ∣ 2 2 d(x,y)^2 = \sum_{j=1} ^{m} (x_j -y_j)^2 = ||x - y||_2^2 d(x,y)2=j=1∑m(xj−yj)2=∣∣x−y∣∣22
S S E = ∑ i = 1 n ∑ j = 1 k w ( i , j ) ∣ ∣ x ( i ) − μ ( j ) ∣ ∣ 2 2 SSE = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2 SSE=i=1∑nj=1∑kw(i,j)∣∣x(i)−μ(j)∣∣22
Here, μ ( j ) \mu^{(j)} μ(j) is the representative point (centroid) for cluster j j j, and w ( i , j ) w^{(i,j)} w(i,j) =1 if the sample x ( i ) x^{(i)} x(i) is in cluster j and w ( i , j ) w^{(i,j)} w(i,j) =0 otherwise.
K-means++
Place the initial centroids far away from each other via the k-means++ algorithm.
Steps
- Initialize an empty set M M M to store the k centroids being selected.
- Randomly choose the first centroid μ j \mu^j μj from the input samples and assign it to M M M.
- For each sample x ( i ) x^{(i)} x(i) that is not in M M M , find the minimum squared distance d ( x ( i ) , M