Unsupervised Learning
Motivation: Before, some assumptions we adopted:
1. a function that maps the the observed input to output;
2. a training dataset to train the parameter for the model;
Without these assumption, we have
1. unknown functional form: non-parametric density estimation: only get the probability value, without knowing the exact form of PDF(probability density function);
2. only data, no output, data clustering.
Density Estimation
- Histogram: discretize the feature space into bins and count
- Pro:
- with infinite amount of data any density can be approximated arbitrarily well >> approach continuous
- computationally simple
- Con:
- curse of dimensionality: the number of bins, and thus the number of data increase exponentially with the increase of data dimensionality;
- the size of bins is hard to determine, and even no optimal size.
- Pro:
Kernel density estimation
- we are given an data point x first, then we need to output its probability Pr(x);
Pr(x)=KNV=1Nhd∑i=1Nkh(x−xi) P r ( x ) = K N V = 1 N h d ∑ i = 1 N k h ( x − x i ) - where:
- hd h d is the volume in d-dimensionality space;
- N is the total number of given dataset
- kh k h is the kernel function, where h stands for the kernel width;
Understanding:
In general way, the kernel function kh(x−xi) k h ( x − x i ) defines a weight function, which depends on the distance to examined location in feature space.
While the denominator NV N V can account for the terminology density, where over them accounts for the calculation of density.Different kernel function:
- Parzen window estimator:
- we are given an data point x first, then we need to output its probability Pr(x);
- Bias-variance trade-off: refers to the influence radius of a point
- For example, bin width of histogram, kernel width of kernel function, number of neighbors for kNN;
- Too large radius result in too smooth case >> has large bias: a multimodal is mistakenly fitted into a single peak Gaussian
- Too small radius result in too variant case >> has large variance: a single peak Gaussian is mistakenly fitted into a multimodal function;
Mixture Model
- Parameters we want to get in Expectation maximization(EM)
- the mean μi μ i , vcm ∑i ∑ i and occurrence probability (or called mixing coefficient for the i-th Gaussian distribution) wi w i ;
- The updation of μi μ i , vcm ∑i ∑ i and mixing coefficient wi w i required posterior probability of a data point xn x n , after we initialize the three parameter: γ(znk)=∑jπj(xn|μj,Σj) γ ( z n k ) = ∑ j π j N ( x n | μ j , Σ j )
- See PRML at page 435.
- EM for change detection in video:
- For each pixel we calculate a Gaussian Mixture function
- then each time the newly generated pixel(from video flow) is put into the mixture model to calculate a probability, and check whether it is below a predefined threshold.
- If the calculated probability is too low, then this means the pixel intensity changes, otherwise it remain stable.
Clustering
- K-means clustering
- assumption:
- we know the number k of clusters
- each cluster is defined by a cluster center
- data points are assigned to the nearest cluster center
- ISODATA clustering:
- in each iteration allows the change of cluster number
- splitting of clusters with large standard deviation
- merging of clusters with small distance between centres dissolving clusters with too few points
- assumption:
- Mean shift
- from each data point, do local gradient ascent on the density
- call all data points that reach the same mode a cluster
- How to do local gradient ascent:
- place a circular window at the data point
- compute (weighted) mean of all points within the window;
- weights depend on kernel function kh(x−xi) k h ( x − x i )
- move the window to the mean
- iterate to convergence
- Example: Image segmentation
- Trick to achieve compact clustering: add pixel coordinate in the feature vector x={R,G,B,X,Y} x = { R , G , B , X , Y }
Graph-based Clustering
- Agglomerative Clustering: Let each point be its own cluster, iteratively find the two most similar clusters and merge;
- Different way to define distance:
- single linkage: based on single point pair, not so robust; the cluster boundary along the low density valley, and cluster shape is irregular.
- complete linkage: aim to keep group contact, cluster shape regular
- Different way to define distance:
- Divisive Clustering: let all points be one cluster, iteratively find the two most dissimilar one and divide it.
- Spectral Clustering: emphasize connectivity rather than compactness.
- Motivation: Viewing the problem of clustering, or image segmentation, as problem of Minimum Graph Cut.
- The elements of a graph including vertex, edge and weight of edge. So each data point in clustering, or each pixel in image, corresponding to vertex in graph. The edge and weight is a measurement of similarity between data point.
- For the example of image segmentation, the weight can be calculated from Euclidean distance of feature vector for each pixel x={R,G,B,X,Y} x = { R , G , B , X , Y } , using the following formula wij=exp{−||xi−xj||2σ2} w i j = e x p { − | | x i − x j | | 2 σ 2 }
- Optimization goal: To find the cut edge whose sum of weight is the least: minimizecut(A,B)=∑i∈A∑j∈Bwij m i n i m i z e c u t ( A , B ) = ∑ i ∈ A ∑ j ∈ B w i j ;
- Without adding constraint the above method tend to be biased >> generating isolated cluster.
- To avoid this, we need to control the size of cluster (not so small), so we can put the size of cluster at the denominator:
Ncut=cut(A,B)∗(1Vol(A)+1Vol(B)) N c u t = c u t ( A , B ) ∗ ( 1 V o l ( A ) + 1 V o l ( B ) )
Vol(A): volume of graph A: Vol(A)=∑i∈Adi V o l ( A ) = ∑ i ∈ A d i
di d i : degree of each vertex, di=∑jwij d i = ∑ j w i j - Procedure to solve spectral clustering
- calculate graph Laplacian matrix:
L=D−W=diag{d1,d2,…,dn}−⎡⎣⎢⎢⎢⎢w11w21⋮wn1w12w22⋮wn2⋯⋯⋱⋯w1nw2n⋮wnn⎤⎦⎥⎥⎥⎥ L = D − W = d i a g { d 1 , d 2 , … , d n } − [ w 11 w 12 ⋯ w 1 n w 21 w 22 ⋯ w 2 n ⋮ ⋮ ⋱ ⋮ w n 1 w n 2 ⋯ w n n ] - graph Laplacian matrix satisfy fTLf=12∑ni=1wij(fi−fj)2 f T L f = 1 2 ∑ i = 1 n w i j ( f i − f j ) 2 , f can be any arbitrary vector;
- By setting f as specific vector, the following transformation can be done: fTLf=|V|∗NCut(A,B) f T L f = | V | ∗ N C u t ( A , B ) . Here |V| is a constant.
- According to Rayleigh Quoitent: the maximum and minimum of R(L,f)=fTLffTf R ( L , f ) = f T L f f T f happens to equal to its maximum and minimum eigenvalue, meanwhile as fTf f T f is a constant, so minimizing R(L,f) R ( L , f ) equals to minimize fTLf=|V|∗NCut(A,B) f T L f = | V | ∗ N C u t ( A , B ) .
- However, as the minimum eigenvalue of L equals to 0, whose corresponding eigenvector doesn’t satisfy the required condition, we just find the second minimum eigenvector v (a n-dimension vector) whose element is either positive or negative.
- The positive element corresponding one cluster, while negative element corresponding to another.
- Generalization for 3 or more cluster: Choose k eigenvectors with k smallest eigenvalue. Reordering the eigenvector as a new N∗K N ∗ K matrix, then use K-means to find the k clusters >> clustering after transformation.
- calculate graph Laplacian matrix:
Example of clustering
- Unsupervised learning
- Requirement: different clusters should be well-separated and enough data should be provided.
- Image segmentation: