Ch5 Clustering – Modern Statistics for Modern Biology
http://web.stanford.edu/class/bios221/book/Chap-Clustering.html
book
5.3 Distance
- Euclidean
- Manhattan (1 norm)
- Maximum (infty norm)
- Weighted Euclidean
- Minkowski (p norm)
- Edit, Hamming (string)
- Binary
- Jaccard
- Correlation based distance
5.4 Nonparametric mixture detection
5.4.1 k-methods:
Besides the distance measure, the main choice to be made is the number of clusters k
- PAM
- k-means
- k-medoids
5.4.2 Tight clusters with resampling
- Strong forms:
Repeating a clustering procedure multiple times on the same data, but with different starting points creates strong forms according to Diday and Brito (1989). - tight clusters
Repeated subsampling of the dataset and applying a clustering method will result in groups of observations that are “almost always” grouped together; these are called tight clusters (Tseng and Wong 2005).
The study of strong forms or tight clusters facilitates the choice of the number of clusters
5.5 Clustering examples: flow cytometry and mass cytometry
5.5.3 Density-based clustering
Data sets such as flow cytometry, that contain only a few markers and a large number of cells, are amenable to density-based clustering.
It has the advantage of being able to cope with clusters that are not necessarily convex.
One implementation of such a method is called dbscan.
How does density-based clustering (dbscan) work ?
The building block of dbscan is the concept of density-reachability: a point q is directly density-reachable from a point p if it is not further away than a given threshold ϵ, and if p is surrounded by sufficiently many points such that one may consider p (and q) be part of a dense region. We say that q is density-reachable from p if there is a sequence of points p1,…,pn. p1,…,pn with p1=p and pn=q, so that each pi+1 is directly density-reachable from pi.
A cluster is then a subset of points that satisfy the following properties:
- All points within the cluster are mutually density-connected.
- If a point is density-connected to any point of the cluster, it is part of the cluster as well.
- Groups of points must have at least MinPts points to count as a cluster.
5.6 Hierarchical clustering
5.7 Validating and choosing the number of clusters
We formalize this with the within-groups sum of squared distances (WSS):
W
S
S
k
=
∑
l
=
1
k
∑
x
i
∈
C
l
d
2
(
x
i
,
x
l
ˉ
)
WSS_k=\sum_{l=1}^{k} \sum_{x_{i} \in C_{l}}d^2(x_i,\bar{x_{l}})
WSSk=l=1∑kxi∈Cl∑d2(xi,xlˉ)
One idea is to look at WSSk as a function of k. This will always be a decreasing function, but if there is a pronounced region where it decreases sharply and then flattens out, we call this an elbow and might take this as a potential sweet spot for the number of clusters.
5.7.1 Using the gap statistic
Algorithm for computing the gap statistic (Tibshirani, Walther, and Hastie 2001):
- Cluster the data with kk clusters and compute WSSk for the various choices of k.
- Generate B plausible reference data sets, using Monte Carlo sampling from a homogeneous distribution and redo Step 1 above for these new simulated data. This results in B new within-sum-of-squares for simulated data W∗kb, for b=1,…,B.
Compute the gap(k)-statistic:
g
a
p
(
k
)
=
l
k
ˉ
−
l
o
g
W
S
S
k
gap(k)=\bar{l_k}-logWSS_k
gap(k)=lkˉ−logWSSk with
l
k
ˉ
=
1
B
∑
b
l
o
g
W
k
b
∗
\bar{l_k}=\frac{1}{B}\sum_b logW_{kb}^{*}
lkˉ=B1b∑logWkb∗
Note that the first term is expected to be bigger than the second one if the clustering is good (i.e., the WSS is smaller); thus the gap statistic will be mostly positive and we are looking for its highest value.
We can use the standard deviation to help choose the best k.
sd
k
2
=
1
B
−
1
∑
b
=
1
B
(
log
(
W
k
b
∗
)
−
l
‾
k
)
2
\text{sd}_k^2 = \frac{1}{B-1}\sum_{b=1}^B\left(\log(W^*_{kb})-\overline{l}_k\right)^2
sdk2=B−11b=1∑B(log(Wkb∗)−lk)2
Several choices are available, for instance, to choose the smallest k such that
gap
(
k
)
≥
gap
(
k
+
1
)
−
s
k
+
1
′
where
s
k
+
1
′
=
sd
k
+
1
1
+
1
/
B
.
\text{gap}(k) \geq \text{gap}(k+1) - s'_{k+1}\qquad \text{where } s'_{k+1}=\text{sd}_{k+1}\sqrt{1+1/B}.
gap(k)≥gap(k+1)−sk+1′where sk+1′=sdk+11+1/B.