文章目录
- applications of cluster analysis
- Other distinctions between sets of clusters
- Types of clusters
- Characteristics of the input data are important
- Clustering algorithms
- K-means clustering
- K-means Clustering – Details
- Evaluating K-means Clusters
- K-means as an Optimization Problem
- Limitation of K-means
- Overcoming K-means limitations
- Solutions to initial centroids problem
- Updating Centers Incrementally
- Bisecting K-means 二分K均值算法
- handing empty clusters
- Bisecting k-means
- Strengths of Hierarchical Clustering
- DBSCAN
applications of cluster analysis
- understanding: group related documents for browsing, group genes and proteins that have similar functionality, or group stocks
- summarization: reduce the size of large data sets
What is not cluster analysis?
- Supervised classification
- Have class label information
- Simple segmentation
- Dividing students into different registration groups alphabetically, by last name
- Results of a query
- Groupings are a result of an external specification
- Clustering is a grouping of objects based on the data
- Graph partitioning
- Association analysis- Local vs. global connections
Notion of a cluster can be ambiguous
Types of clusterings
- A clustering is a set of clusters
- Important distinction between hierarchical and partitional sets of clusters
- Partitional clustering- A division data objects into non-overlapping subsets(clusters) such that each data object is in exactly one subset
- Hierarchical clustering
- A set of nested clusters organized as a hierarchical tree
Other distinctions between sets of clusters
-
exclusive versus non-exclusive
- in non-exclusive clusterings, points may belong to multiple clusters.
- can represent multiple classes or ‘border’ points
-
fuzzy versus non-fuzzy
-
in fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
-
weights must sum to 1
-
probabilistic clustering has similar characteristics
-
-
partial versus complete
- in some cases, we only want to cluster some of the data
-
heterogeneous versus homogeneous
- clusters of widely different sizes, shapes, and densities
Types of clusters
- well-separated clusters
- a cluster is a set of points such that any point in a cluster is closer(or more similar) to every other point in the cluster than to any point not in the cluster.
- center-based clusters
- a cluster is a set of objects such that an object in a cluster is closer( more similar) to the “center” of a cluster, than to the center of any other cluster
- the center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster
- contiguous clusters (nearest neighbor or transitive)
- a cluster is a set of points such that a point in a cluster is closer(or more similar) to one or more other points in the cluster than to any point not in the cluster
- density-based clusters
- a cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.
- used when the clusters are irregular or intertwined, and when noise and outliers are present.
- shared property or conceptual clusters
- clusters defined by an objective function
- map the clustering problem to a different domain and solve a related problem in that domain
- k-means clustering
Characteristics of the input data are important
- type of proximity or density measure
- this is a derived measure, but central to clustering
- sparseness
- dictates types of similarity
- adds to efficiency
- attribute type
- dictates type pf similarity
- type of data
-
- dictates type of similarity
- other characteristics, e.g., autocorrelation
- dictates type of similarity
- dimensionality
- noise and outliers
- type of distribution
Clustering algorithms
- K-means and its variants
- hierarchical clustering
- density-based clustering
K-means clustering
-
Partitional clustering approach
-
Number of clusters, K, must be specified
-
Each cluster is associated with a centroid (center point)
-
Each point is assigned to the cluster with the closest centroid
-
The basic algorithm is very simple
K-means Clustering – Details
- Initial centroids are often chosen randomly.
- Clusters produced vary from one run to another.
- The centroid is (typically) the mean of the points in the cluster.
- Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.
- K-means will converge for common similarity measures mentioned above.
- Most of the convergence happens in the first few iterations.
- Often the stopping condition is changed to ‘Until relatively few points change clusters’
- Complexity is
O
(
n
∗
K
∗
I
∗
d
)
O( n * K * I * d )
O(n∗K∗I∗d)
- n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
Evaluating K-means Clusters
-
Most common measure is Sum of Squared Error (SSE)
-
For each point, the error is the distance to the nearest cluster
-
To get SSE, we square these errors and sum them.
S S E = ∑ i = 1 K ∑ x ∈ C i d i s t 2 ( m i , x ) SSE = \sum_{i=1}^K \sum_{x\in C_i} dist^2 (m_i,x) SSE=i=1∑Kx∈Ci∑dist2(mi,x) -
x is a data point in cluster Ci and mi is the representative point for cluster Ci
- can show that mi corresponds to the center (mean) of the cluster
-
Given two sets of clusters, we prefer the one with the smallest error
-
One easy way to reduce SSE is to increase K, the number of clusters
- A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
-
K-means as an Optimization Problem
-
Objective: Minimize the Sum of Squared Error (SSE)
S S E = ∑ i = 1 K ∑ x ∈ C i d i s t 2 ( m i , x ) SSE = \sum_{i=1}^K \sum_{x\in C_i} dist^2 (m_i,x) SSE=i=1∑Kx∈Ci∑dist2(mi,x)
We fix the center, if SSE is not optimal,c j = arg min i ∈ { 1 , 2 , … , k } d i s t ( m i , j ) c_j = \arg \min _{i \in \{1,2,…,k\}} dist(m_i,j) cj=argmini∈{1,2,…,k}dist(mi,j)
Then, we fix the cluster assignment, derive the new center
m i = 1 ∣ C i ∣ ∑ x ∈ C i x m_i = \frac{1}{|C_i|}\sum_{x\in C_i}x mi=∣Ci∣1x∈Ci∑x
Limitation of K-means
-
K-means has problems when clusters are different in:
-
Sizes
-
Densities
-
Non-globular shapes
-
-
K-means has problems when the data contains outliers.
Overcoming K-means limitations
One solution is to use many clusters.
Find parts of clusters, but need to put together.
Solutions to initial centroids problem
- muptiple runs
- helps, but probability is not on your side
- sample and use hierarchical clustering to determine initial centroids
- select more than k initial centroids and then select among these initial centroids
- select most widely separated
- postprecessing
- bisecting K-means
- not as susceptible to initialization issues
Updating Centers Incrementally
-
In the basic K-means algorithm, centroids are updated after all points
are assigned to a centroid -
An alternative is to update the centroids after each assignment
(incremental approach)-
Each assignment updates zero or two centroids
-
More expensive
-
Introduces an order dependency
-
Never get an empty cluster
-
Can use “weights” to change the impact
-
Bisecting K-means 二分K均值算法
- Bisecting K-means algorithm
- Variant of K-means that can produce a partitional or a
hierarchical clustering
- Variant of K-means that can produce a partitional or a
handing empty clusters
- Basic K-means algorithm, centroids are updated after all points are assigned to a centroid.
- An alternative is to update the centroids after each assignment ( incremental approach)
- Each assignment updates
Bisecting k-means
Strengths of Hierarchical Clustering
- Do not have to assume any particular number of clusters.
- They may correspond
Hierarchical Clustering
- Two main types of hierarchical clustering
Agglomerative clustering algorithm
- More popular hierarchical clustering technique
- Basic algorithm is straightforward
After Merging
Intermediate Situation
How to Define Inter-Cluster Similarity
Cluster Similarity: MIN or Single Link
- Similarity of two cluster
Strength of MIN
Cluster Similarity: MAX or Complete Linkage
Cluster Similarity: Group Average
Hierarchical Clustering: Time and Space requirements
Hierarchical Clustering: Problems and Limitations
MST: Divisive Hierarchical Clustering
- Build MST (Minimum Spanning Tree)
DBSCAN
- DBSCAN is a density-based algorithm
- Density = number of points within a specified radius (Eps)
- A point is a core point if it has more than a specified number of points (MinPts) within Eps
- A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.
- A noise point is any point that is not a core point or a border point.
When DBSCAN Works Well
- Resistant to Noise
- Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
- Varying densities
- High-dimensional data
DBSCAN: Determining EPS and MinPts
- Idea is that for points in a cluster, their k t h k^{th} kth nearest neighbors are at roughly the same distance
- Nose points have the k t h k^{th} kth