Clustering Notes of MSMB

最新推荐文章于 2023-06-03 09:06:34 发布

Meichen8

最新推荐文章于 2023-06-03 09:06:34 发布

阅读量190

点赞数

分类专栏： Python tutorial 文章标签： Clustering

本文链接：https://blog.csdn.net/weixin_39996285/article/details/102996509

版权

Python tutorial 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

Ch5 Clustering – Modern Statistics for Modern Biology

http://web.stanford.edu/class/bios221/book/Chap-Clustering.html
book

5.3 Distance

Euclidean
Manhattan (1 norm)
Maximum (infty norm)
Weighted Euclidean
Minkowski (p norm)
Edit, Hamming (string)
Binary
Jaccard
Correlation based distance

5.4 Nonparametric mixture detection

5.4.1 k-methods:

Besides the distance measure, the main choice to be made is the number of clusters k

PAM
k-means
k-medoids

5.4.2 Tight clusters with resampling

Strong forms:
Repeating a clustering procedure multiple times on the same data, but with different starting points creates strong forms according to Diday and Brito (1989).
tight clusters
Repeated subsampling of the dataset and applying a clustering method will result in groups of observations that are “almost always” grouped together; these are called tight clusters (Tseng and Wong 2005).

The study of strong forms or tight clusters facilitates the choice of the number of clusters

5.5 Clustering examples: flow cytometry and mass cytometry

5.5.3 Density-based clustering

Data sets such as flow cytometry, that contain only a few markers and a large number of cells, are amenable to density-based clustering.
It has the advantage of being able to cope with clusters that are not necessarily convex.
One implementation of such a method is called dbscan.

How does density-based clustering (dbscan) work ?

The building block of dbscan is the concept of density-reachability: a point q is directly density-reachable from a point p if it is not further away than a given threshold ϵ, and if p is surrounded by sufficiently many points such that one may consider p (and q) be part of a dense region. We say that q is density-reachable from p if there is a sequence of points p₁,…,p_n. p₁,…,p_n with p₁=p and p_n=q, so that each p_i+1 is directly density-reachable from p_i.

A cluster is then a subset of points that satisfy the following properties:

All points within the cluster are mutually density-connected.
If a point is density-connected to any point of the cluster, it is part of the cluster as well.
Groups of points must have at least MinPts points to count as a cluster.

5.6 Hierarchical clustering

5.7 Validating and choosing the number of clusters

We formalize this with the within-groups sum of squared distances (WSS):
$WSS_k=\sum_{l=1}^{k} \sum_{x_{i} \in C_{l}}d^2(x_i,\bar{x_{l}})$
One idea is to look at WSS_k as a function of k. This will always be a decreasing function, but if there is a pronounced region where it decreases sharply and then flattens out, we call this an elbow and might take this as a potential sweet spot for the number of clusters.

5.7.1 Using the gap statistic

Algorithm for computing the gap statistic (Tibshirani, Walther, and Hastie 2001):

Cluster the data with kk clusters and compute WSSk for the various choices of k.
Generate B plausible reference data sets, using Monte Carlo sampling from a homogeneous distribution and redo Step 1 above for these new simulated data. This results in B new within-sum-of-squares for simulated data W^∗_kb, for b=1,…,B.

Compute the gap(k)-statistic:
$gap(k)=\bar{l_k}-logWSS_k$ with $\bar{l_k}=\frac{1}{B}\sum_b logW_{kb}^{*}$
Note that the first term is expected to be bigger than the second one if the clustering is good (i.e., the WSS is smaller); thus the gap statistic will be mostly positive and we are looking for its highest value.

We can use the standard deviation to help choose the best k.
$\text{sd}_k^2 = \frac{1}{B-1}\sum_{b=1}^B\left(\log(W^*_{kb})-\overline{l}_k\right)^2$
Several choices are available, for instance, to choose the smallest k such that
$\text{gap}(k) \geq \text{gap}(k+1) - s'_{k+1}\qquad \text{where } s'_{k+1}=\text{sd}_{k+1}\sqrt{1+1/B}.$