There are two major issues in thinking of clustering:
What is a good inter-cluster distance?
Single-link: Using the distance between the closest elements
Complete-link: Using the distance between the furthest elements
Group average: Using the distance between two averages
How many clusters are there?
It can be given a priori, e.g., the number of desired regions
It can be indirectly inferred by setting a threshold by which two points can be decided whether they belong to the same cluster.
Two problems to avoid
Under-fitting: too few clusters
Over-fitting: too many clusters
Two clustering strategies
Agglomerative clustering
Divisive clustering
Missing Data Problem: Example
Let us consider a missing data problem example.
Assume that people can be classified into three groups according to the physical size, big, median, and small people.
Each group is characterized by the population percentage and a 2-D Gaussian showing the distribution of weight-height.
The reason for using Gaussian distributions instead hard-thresholds is due to the uncertainty or error for weight-height measurement.
Now you are given the statistics of a certain population, and you are given two tasks:
Estimate the model parameters for each class
Classify each data point into one of three classes
That means the class labels are missing in the data we have collected, and we need to find them.
Probabilistic Formulation
Prior probability: something you know before you even see the data or the observation. It is like your prior knowledge.
Likelihood function: something to evaluate how likely a data sample is generated from a certain class. It is like your evidence.
Posterior probability: based on what you see and you know, what is the probability of a data sample y belonging to certain class label. It is like the estimate of the missing data.