本文结构安排
- 经典聚类算法:线性聚类 Kmeans
- 经典聚类算法:非线性聚类 DBSCAN、谱聚类
- 新兴聚类算法:DenPeak,RCC
K-means
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Given a set of observations ( x 1 , x 2 , . . . , x n ) (x_{1},x_{2},..., x_{n}) (x1,x2,...,xn),where each observation is a d-dimensional real vector,k-means clustering aims to partition the n observations into k ( ∈ n ) k(\in n) k(∈n) set ( S 1 , S 2 , . . . , S k ) (S_{1},S_{2},..., S_{k}) (S1,S2,...,Sk) so as to minimize the within-cluster sum of variance,the objective is to find:
a r g min s ∑ i = 1 k ∑ x ∈ S i ∣ ∣ x − μ i ∣ ∣ 2 arg\min_{s}\sum_{i=1}^{k}\sum_{x \in S_{i}}||x-\mu_{i}||^{2} argsmini=1∑kx∈Si∑∣∣x−μi∣∣2
where μ i \mu_{i} μi is the mean of points in S i S_{i} Si.
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm.It
is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scienti c literature.
Consider a set of points in some space to be clustered. For the purpose of DBSCAN clustering, the points are classi ed as core points,density-reachable points and outliers, as follows:
-
A point p is a core point if at least minPts points are within distance ε \varepsilon ε ( ε \varepsilon ε