[Image] Unsupervised Learning

无监督学习中,我们不再依赖已知的输入-输出对,而是进行密度估计和数据聚类。文章介绍了密度估计的直方图和核密度估计方法,讨论了它们的优缺点。接着讲解了混合模型,特别是期望最大化(EM)算法在参数估计中的应用,如在视频变化检测中的使用。接着介绍了多种聚类方法,包括K-means、ISODATA、均值漂移以及谱聚类,并通过图像分割展示了聚类的实际应用。最后提到了基于图的聚类方法,如凝聚聚类和分裂聚类,以及如何利用最小割来优化聚类效果。
摘要由CSDN通过智能技术生成

Unsupervised Learning

Motivation: Before, some assumptions we adopted:
 1. a function that maps the the observed input to output;
 2. a training dataset to train the parameter for the model;
Without these assumption, we have
 1. unknown functional form: non-parametric density estimation: only get the probability value, without knowing the exact form of PDF(probability density function);
 2. only data, no output, data clustering.

Density Estimation

  • Histogram: discretize the feature space into bins and count
    • Pro:
      • with infinite amount of data any density can be approximated arbitrarily well >> approach continuous
      • computationally simple
    • Con:
      • curse of dimensionality: the number of bins, and thus the number of data increase exponentially with the increase of data dimensionality;
      • the size of bins is hard to determine, and even no optimal size.
  • Kernel density estimation

    • we are given an data point x first, then we need to output its probability Pr(x);
      Pr(x)=KNV=1Nhdi=1Nkh(xxi) P r ( x ) = K N V = 1 N h d ∑ i = 1 N k h ( x − x i )
    • where:
      • hd h d is the volume in d-dimensionality space;
      • N is the total number of given dataset
      • kh k h is the kernel function, where h stands for the kernel width;
    • Understanding:

      In general way, the kernel function kh(xxi) k h ( x − x i ) defines a weight function, which depends on the distance to examined location in feature space.
      While the denominator NV N V can account for the terminology density, where over them accounts for the calculation of density.

    • Different kernel function:

      • Parzen window estimator:
  • Bias-variance trade-off: refers to the influence radius of a point
    • For example, bin width of histogram, kernel width of kernel function, number of neighbors for kNN;
    • Too large radius result in too smooth case >> has large bias: a multimodal is mistakenly fitted into a single peak Gaussian
    • Too small radius result in too variant case >> has large variance: a single peak Gaussian is mistakenly fitted into a multimodal function;

Mixture Model

  • Parameters we want to get in Expectation maximization(EM)
    • the mean μi μ i , vcm i ∑ i and occurrence probability (or called mixing coefficient for the i-th Gaussian distribution) wi w i ;
    • The updation of μi μ i , vcm i ∑ i and mixing coefficient wi w i required posterior probability of a data point xn x n , after we initialize the three parameter: γ(znk)=jπj(xn|μj,Σj) γ ( z n k ) = ∑ j π j N ( x n | μ j , Σ j )
    • See PRML at page 435.
  • EM for change detection in video:
    • For each pixel we calculate a Gaussian Mixture function
    • then each time the newly generated pixel(from video flow) is put into the mixture model to calculate a probability, and check whether it is below a predefined threshold.
    • If the calculated probability is too low, then this means the pixel intensity changes, otherwise it remain stable.

Clustering

  • K-means clustering
    • assumption:
      • we know the number k of clusters
      • each cluster is defined by a cluster center
      • data points are assigned to the nearest cluster center
    • ISODATA clustering:
      • in each iteration allows the change of cluster number
      • splitting of clusters with large standard deviation
      • merging of clusters with small distance between centres dissolving clusters with too few points
  • Mean shift
    • from each data point, do local gradient ascent on the density
    • call all data points that reach the same mode a cluster
    • How to do local gradient ascent:
      • place a circular window at the data point
      • compute (weighted) mean of all points within the window;
      • weights depend on kernel function kh(xxi) k h ( x − x i )
      • move the window to the mean
      • iterate to convergence
    • Example: Image segmentation
      • Trick to achieve compact clustering: add pixel coordinate in the feature vector x={R,G,B,X,Y} x = { R , G , B , X , Y }

Graph-based Clustering

  • Agglomerative Clustering: Let each point be its own cluster, iteratively find the two most similar clusters and merge;
    • Different way to define distance:
      • single linkage: based on single point pair, not so robust; the cluster boundary along the low density valley, and cluster shape is irregular.
      • complete linkage: aim to keep group contact, cluster shape regular
  • Divisive Clustering: let all points be one cluster, iteratively find the two most dissimilar one and divide it.
  • Spectral Clustering: emphasize connectivity rather than compactness.
    • Motivation: Viewing the problem of clustering, or image segmentation, as problem of Minimum Graph Cut.
    • The elements of a graph including vertex, edge and weight of edge. So each data point in clustering, or each pixel in image, corresponding to vertex in graph. The edge and weight is a measurement of similarity between data point.
    • For the example of image segmentation, the weight can be calculated from Euclidean distance of feature vector for each pixel x={R,G,B,X,Y} x = { R , G , B , X , Y } , using the following formula wij=exp{||xixj||2σ2} w i j = e x p { − | | x i − x j | | 2 σ 2 }
    • Optimization goal: To find the cut edge whose sum of weight is the least: minimizecut(A,B)=iAjBwij m i n i m i z e c u t ( A , B ) = ∑ i ∈ A ∑ j ∈ B w i j ;
    • Without adding constraint the above method tend to be biased >> generating isolated cluster.
    • To avoid this, we need to control the size of cluster (not so small), so we can put the size of cluster at the denominator:
      Ncut=cut(A,B)(1Vol(A)+1Vol(B)) N c u t = c u t ( A , B ) ∗ ( 1 V o l ( A ) + 1 V o l ( B ) )

      Vol(A): volume of graph A: Vol(A)=iAdi V o l ( A ) = ∑ i ∈ A d i
      di d i : degree of each vertex, di=jwij d i = ∑ j w i j
    • Procedure to solve spectral clustering
      • calculate graph Laplacian matrix:
        L=DW=diag{d1,d2,,dn}w11w21wn1w12w22wn2w1nw2nwnn L = D − W = d i a g { d 1 , d 2 , … , d n } − [ w 11 w 12 ⋯ w 1 n w 21 w 22 ⋯ w 2 n ⋮ ⋮ ⋱ ⋮ w n 1 w n 2 ⋯ w n n ]
      • graph Laplacian matrix satisfy fTLf=12ni=1wij(fifj)2 f T L f = 1 2 ∑ i = 1 n w i j ( f i − f j ) 2 , f can be any arbitrary vector;
      • By setting f as specific vector, the following transformation can be done: fTLf=|V|NCut(A,B) f T L f = | V | ∗ N C u t ( A , B ) . Here |V| is a constant.
      • According to Rayleigh Quoitent: the maximum and minimum of R(L,f)=fTLffTf R ( L , f ) = f T L f f T f happens to equal to its maximum and minimum eigenvalue, meanwhile as fTf f T f is a constant, so minimizing R(L,f) R ( L , f ) equals to minimize fTLf=|V|NCut(A,B) f T L f = | V | ∗ N C u t ( A , B ) .
      • However, as the minimum eigenvalue of L equals to 0, whose corresponding eigenvector doesn’t satisfy the required condition, we just find the second minimum eigenvector v (a n-dimension vector) whose element is either positive or negative.
      • The positive element corresponding one cluster, while negative element corresponding to another.
      • Generalization for 3 or more cluster: Choose k eigenvectors with k smallest eigenvalue. Reordering the eigenvector as a new NK N ∗ K matrix, then use K-means to find the k clusters >> clustering after transformation.

Example of clustering

  • Unsupervised learning
    • Requirement: different clusters should be well-separated and enough data should be provided.
  • Image segmentation:

Reference

漫步Clustering

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值