Unsupervised Learning
Motivation: Before, some assumptions we adopted:
1. a function that maps the the observed input to output;
2. a training dataset to train the parameter for the model;
Without these assumption, we have
1. unknown functional form: non-parametric density estimation: only get the probability value, without knowing the exact form of PDF(probability density function);
2. only data, no output, data clustering.
Density Estimation
- Histogram: discretize the feature space into bins and count
- Pro:
- with infinite amount of data any density can be approximated arbitrarily well >> approach continuous
- computationally simple
- Con:
- curse of dimensionality: the number of bins, and thus the number of data increase exponentially with the increase of data dimensionality;
- the size of bins is hard to determine, and even no optimal size.
- Pro:
Kernel density estimation
- we are given an data point x first, then we need to output its probability Pr(x);
Pr(x)=KNV=1Nhd∑i=1Nkh(x−xi) P r ( x ) = K N V = 1 N h d ∑ i = 1 N k h ( x − x i ) - where:
- hd h d is the volume in d-dimensionality space;
- N is the total number of given dataset
- kh k h is the kernel function, where h stands for the kernel width;
Understanding:
In general way, the kernel function kh(x−xi) k h ( x − x i ) defines a weight function, which depends on the distance to examined location in feature space.
While the denominator NV N V can account for the terminology density, where over them accounts for the calculation of density.Different kernel function:
- Parzen window estimator:
- we are given an data point x first, then we need to output its probability Pr(x);
- Bias-variance trade-off: refers to the influence radius of a point
- For example, bin width of histogram, kernel width of kernel function, number of neighbors for kNN;
- Too large radius result in too smooth case >> has large bias: a multimodal is mistakenly fitted into a single peak Gaussian
- Too small radius result in too variant case >> has large variance: a single peak Gaussian is mistakenly fitted into a multimodal function;
Mixture Model
- Parameters we want to get in Expectation maximization(EM)
- the mean μi μ i , vcm ∑i ∑ i and occurrence probability (or called mixing coefficient for the i-th Gaussian distribution) wi w i ;
- The updation of μi μ i , vcm ∑