[Image] Unsupervised Learning

最新推荐文章于 2022-09-05 16:24:19 发布

Ph8_0

最新推荐文章于 2022-09-05 16:24:19 发布

阅读量244

点赞数

分类专栏： Academic Note

本文链接：https://blog.csdn.net/Ph8_0/article/details/79281387

版权

Academic Note 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

无监督学习中，我们不再依赖已知的输入-输出对，而是进行密度估计和数据聚类。文章介绍了密度估计的直方图和核密度估计方法，讨论了它们的优缺点。接着讲解了混合模型，特别是期望最大化（EM）算法在参数估计中的应用，如在视频变化检测中的使用。接着介绍了多种聚类方法，包括K-means、ISODATA、均值漂移以及谱聚类，并通过图像分割展示了聚类的实际应用。最后提到了基于图的聚类方法，如凝聚聚类和分裂聚类，以及如何利用最小割来优化聚类效果。

摘要由CSDN通过智能技术生成

Unsupervised Learning

Motivation: Before, some assumptions we adopted:
1. a function that maps the the observed input to output;
2. a training dataset to train the parameter for the model;
Without these assumption, we have
1. unknown functional form: non-parametric density estimation: only get the probability value, without knowing the exact form of PDF(probability density function);
2. only data, no output, data clustering.

Density Estimation

Histogram: discretize the feature space into bins and count
- Pro:
  - with infinite amount of data any density can be approximated arbitrarily well >> approach continuous
  - computationally simple
- Con:
  - curse of dimensionality: the number of bins, and thus the number of data increase exponentially with the increase of data dimensionality;
  - the size of bins is hard to determine, and even no optimal size.
Kernel density estimation
- we are given an data point x first, then we need to output its probability Pr(x);
  $P r (x) = K N V = 1 N h d \sum i = 1 N k h (x - x i)$ $Pr(x) = \frac{K}{NV} = \frac{1}{Nh^d}\sum^N_{i = 1} k_h(\textbf{x} -\textbf{x}_i)$
- where:
  - $h^d$ is the volume in d-dimensionality space;
  - N is the total number of given dataset
  - $k_h$ is the kernel function, where h stands for the kernel width;
- Understanding:
  
  In general way, the kernel function $k_h(x - x_i)$ defines a weight function, which depends on the distance to examined location in feature space.
  While the denominator $NV$ can account for the terminology density, where over them accounts for the calculation of density.
- Different kernel function:
  - Parzen window estimator:
Bias-variance trade-off: refers to the influence radius of a point
- For example, bin width of histogram, kernel width of kernel function, number of neighbors for kNN;
- Too large radius result in too smooth case >> has large bias: a multimodal is mistakenly fitted into a single peak Gaussian
- Too small radius result in too variant case >> has large variance: a single peak Gaussian is mistakenly fitted into a multimodal function;

Mixture Model

Parameters we want to get in Expectation maximization(EM)
- the mean $\mu_i$ , vcm $\sum_i$ and occurrence probability (or called mixing coefficient for the i-th Gaussian distribution) $w_i$ ;
- The updation of $\mu_i$ , vcm $\sum_i$ and mixing coefficient $w_i$ required posterior probability of a data point $x_n$ , after we initialize the three parameter: $\gamma(z_{nk}) = \sum_j \pi_j \mathcal{N}(\textbf{x}_n | \mu_j, \Sigma_j)$
- See PRML at page 435.
EM for change detection in video:
- For each pixel we calculate a Gaussian Mixture function
- then each time the newly generated pixel(from video flow) is put into the mixture model to calculate a probability, and check whether it is below a predefined threshold.
- If the calculated probability is too low, then this means the pixel intensity changes, otherwise it remain stable.

Clustering

K-means clustering
- assumption:
  - we know the number k of clusters
  - each cluster is defined by a cluster center
  - data points are assigned to the nearest cluster center
- ISODATA clustering:
  - in each iteration allows the change of cluster number
  - splitting of clusters with large standard deviation
  - merging of clusters with small distance between centres dissolving clusters with too few points
Mean shift
- from each data point, do local gradient ascent on the density
- call all data points that reach the same mode a cluster
- How to do local gradient ascent:
  - place a circular window at the data point
  - compute (weighted) mean of all points within the window;
  - weights depend on kernel function $k_h(x-x_i)$
  - move the window to the mean
  - iterate to convergence
- Example: Image segmentation
  - Trick to achieve compact clustering: add pixel coordinate in the feature vector $\textbf{x} = \{R,G,B,X,Y\}$

Graph-based Clustering

Agglomerative Clustering: Let each point be its own cluster, iteratively find the two most similar clusters and merge;
- Different way to define distance:
  - single linkage: based on single point pair, not so robust; the cluster boundary along the low density valley, and cluster shape is irregular.
  - complete linkage: aim to keep group contact, cluster shape regular
Divisive Clustering: let all points be one cluster, iteratively find the two most dissimilar one and divide it.
Spectral Clustering: emphasize connectivity rather than compactness.
- Motivation: Viewing the problem of clustering, or image segmentation, as problem of Minimum Graph Cut.
- The elements of a graph including vertex, edge and weight of edge. So each data point in clustering, or each pixel in image, corresponding to vertex in graph. The edge and weight is a measurement of similarity between data point.
- For the example of image segmentation, the weight can be calculated from Euclidean distance of feature vector for each pixel $\textbf{x} = \{R,G,B,X,Y\}$ , using the following formula $w_{ij} = exp\{ -\frac{||x_i - x_j||}{2\sigma_2}\}$
- Optimization goal: To find the cut edge whose sum of weight is the least: $minimize\, cut(A, B) = \sum_{i \in A} \sum_{j \in B} w_{ij}$ ;
- Without adding constraint the above method tend to be biased >> generating isolated cluster.
- To avoid this, we need to control the size of cluster (not so small), so we can put the size of cluster at the denominator:
  $N c u t = c u t (A, B) * (1 V o l ( A ) + 1 V o l ( B ))$ $Ncut = cut(A, B) * (\frac{1}{Vol(A)} + \frac{1}{Vol(B)})$
  Vol(A): volume of graph A: $Vol(A) = \sum_{i \in A}d_i$
  $d_i$ : degree of each vertex, $d_i = \sum_j w_{ij}$
- Procedure to solve spectral clustering
  - calculate graph Laplacian matrix:
    $L = D - W = d i a g {d 1, d 2, \dots, d n} - ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ w 11 w 21 ⋮ w n 1 w 12 w 22 ⋮ w n 2 \dots \dots ⋱ \dots w 1 n w 2 n ⋮ w n n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥$ $L = D - W = diag\{d_1, d_2, \dots, d_n \} - \left[ \begin{matrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nn} \\ \end{matrix}\right]$
  - graph Laplacian matrix satisfy $f^T L\,f = \frac{1}{2}\sum^n_{i = 1} w_{ij}(f_i - f_j)^2$ , f can be any arbitrary vector;
  - By setting f as specific vector, the following transformation can be done: $f^T L\,f = |V| * NCut(A, B)$ . Here |V| is a constant.
  - According to Rayleigh Quoitent: the maximum and minimum of $R(L, f) = \frac{f^T L\,f }{f^T f }$ happens to equal to its maximum and minimum eigenvalue, meanwhile as $f^T f$ is a constant, so minimizing $R(L, f)$ equals to minimize $f^T L\,f = |V| * NCut(A, B)$ .
  - However, as the minimum eigenvalue of L equals to 0, whose corresponding eigenvector doesn’t satisfy the required condition, we just find the second minimum eigenvector v (a n-dimension vector) whose element is either positive or negative.
  - The positive element corresponding one cluster, while negative element corresponding to another.
  - Generalization for 3 or more cluster: Choose k eigenvectors with k smallest eigenvalue. Reordering the eigenvector as a new $N* K$ matrix, then use K-means to find the k clusters >> clustering after transformation.