Clustering Notes of MSMB

Ch5 Clustering – Modern Statistics for Modern Biology

http://web.stanford.edu/class/bios221/book/Chap-Clustering.html
book

5.3 Distance

  1. Euclidean
  2. Manhattan (1 norm)
  3. Maximum (infty norm)
  4. Weighted Euclidean
  5. Minkowski (p norm)
  6. Edit, Hamming (string)
  7. Binary
  8. Jaccard
  9. Correlation based distance

5.4 Nonparametric mixture detection

5.4.1 k-methods:

Besides the distance measure, the main choice to be made is the number of clusters k

  • PAM
  • k-means
  • k-medoids

5.4.2 Tight clusters with resampling

  • Strong forms:
    Repeating a clustering procedure multiple times on the same data, but with different starting points creates strong forms according to Diday and Brito (1989).
  • tight clusters
    Repeated subsampling of the dataset and applying a clustering method will result in groups of observations that are “almost always” grouped together; these are called tight clusters (Tseng and Wong 2005).

The study of strong forms or tight clusters facilitates the choice of the number of clusters

5.5 Clustering examples: flow cytometry and mass cytometry

5.5.3 Density-based clustering

Data sets such as flow cytometry, that contain only a few markers and a large number of cells, are amenable to density-based clustering.
It has the advantage of being able to cope with clusters that are not necessarily convex.
One implementation of such a method is called dbscan.

How does density-based clustering (dbscan) work ?

The building block of dbscan is the concept of density-reachability: a point q is directly density-reachable from a point p if it is not further away than a given threshold ϵ, and if p is surrounded by sufficiently many points such that one may consider p (and q) be part of a dense region. We say that q is density-reachable from p if there is a sequence of points p1,…,pn. p1,…,pn with p1=p and pn=q, so that each pi+1 is directly density-reachable from pi.

A cluster is then a subset of points that satisfy the following properties:

  • All points within the cluster are mutually density-connected.
  • If a point is density-connected to any point of the cluster, it is part of the cluster as well.
  • Groups of points must have at least MinPts points to count as a cluster.

5.6 Hierarchical clustering

5.7 Validating and choosing the number of clusters

We formalize this with the within-groups sum of squared distances (WSS):
W S S k = ∑ l = 1 k ∑ x i ∈ C l d 2 ( x i , x l ˉ ) WSS_k=\sum_{l=1}^{k} \sum_{x_{i} \in C_{l}}d^2(x_i,\bar{x_{l}}) WSSk=l=1kxiCld2(xi,xlˉ)
One idea is to look at WSSk as a function of k. This will always be a decreasing function, but if there is a pronounced region where it decreases sharply and then flattens out, we call this an elbow and might take this as a potential sweet spot for the number of clusters.

5.7.1 Using the gap statistic

Algorithm for computing the gap statistic (Tibshirani, Walther, and Hastie 2001):

  • Cluster the data with kk clusters and compute WSSk for the various choices of k.
  • Generate B plausible reference data sets, using Monte Carlo sampling from a homogeneous distribution and redo Step 1 above for these new simulated data. This results in B new within-sum-of-squares for simulated data Wkb, for b=1,…,B.

Compute the gap(k)-statistic:
g a p ( k ) = l k ˉ − l o g W S S k gap(k)=\bar{l_k}-logWSS_k gap(k)=lkˉlogWSSk with l k ˉ = 1 B ∑ b l o g W k b ∗ \bar{l_k}=\frac{1}{B}\sum_b logW_{kb}^{*} lkˉ=B1blogWkb
Note that the first term is expected to be bigger than the second one if the clustering is good (i.e., the WSS is smaller); thus the gap statistic will be mostly positive and we are looking for its highest value.

We can use the standard deviation to help choose the best k.
sd k 2 = 1 B − 1 ∑ b = 1 B ( log ⁡ ( W k b ∗ ) − l ‾ k ) 2 \text{sd}_k^2 = \frac{1}{B-1}\sum_{b=1}^B\left(\log(W^*_{kb})-\overline{l}_k\right)^2 sdk2=B11b=1B(log(Wkb)lk)2
Several choices are available, for instance, to choose the smallest k such that
gap ( k ) ≥ gap ( k + 1 ) − s k + 1 ′ where  s k + 1 ′ = sd k + 1 1 + 1 / B . \text{gap}(k) \geq \text{gap}(k+1) - s'_{k+1}\qquad \text{where } s'_{k+1}=\text{sd}_{k+1}\sqrt{1+1/B}. gap(k)gap(k+1)sk+1where sk+1=sdk+11+1/B .

Useful python library

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值