数据挖掘:最全聚类算法clustering k-means和DBSCAN、层次聚类

applications of cluster analysis
  • understanding: group related documents for browsing, group genes and proteins that have similar functionality, or group stocks
  • summarization: reduce the size of large data sets
What is not cluster analysis?
  • Supervised classification
    • Have class label information
  • Simple segmentation
    • Dividing students into different registration groups alphabetically, by last name
  • Results of a query
    • Groupings are a result of an external specification
    • Clustering is a grouping of objects based on the data
  • Graph partitioning
    - Association analysis
    • Local vs. global connections
Notion of a cluster can be ambiguous
Types of clusterings
  • A clustering is a set of clusters
    - Important distinction between hierarchical and partitional sets of clusters
    - Partitional clustering
    • A division data objects into non-overlapping subsets(clusters) such that each data object is in exactly one subset
  • Hierarchical clustering
    • A set of nested clusters organized as a hierarchical tree
Other distinctions between sets of clusters
  • exclusive versus non-exclusive

    • in non-exclusive clusterings, points may belong to multiple clusters.
    • can represent multiple classes or ‘border’ points
  • fuzzy versus non-fuzzy

    • in fuzzy clustering, a point belongs to every cluster with some weight

      between 0 and 1

    • weights must sum to 1

    • probabilistic clustering has similar characteristics

  • partial versus complete

    • in some cases, we only want to cluster some of the data
  • heterogeneous versus homogeneous

    • clusters of widely different sizes, shapes, and densities
Types of clusters
  • well-separated clusters
    • a cluster is a set of points such that any point in a cluster is closer(or more similar) to every other point in the cluster than to any point not in the cluster.
  • center-based clusters
    • a cluster is a set of objects such that an object in a cluster is closer( more similar) to the “center” of a cluster, than to the center of any other cluster
    • the center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster
  • contiguous clusters (nearest neighbor or transitive)
    • a cluster is a set of points such that a point in a cluster is closer(or more similar) to one or more other points in the cluster than to any point not in the cluster
  • density-based clusters
    • a cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.
    • used when the clusters are irregular or intertwined, and when noise and outliers are present.
  • shared property or conceptual clusters
  • clusters defined by an objective function
  • map the clustering problem to a different domain and solve a related problem in that domain
  • k-means clustering
Characteristics of the input data are important
  • type of proximity or density measure
    • this is a derived measure, but central to clustering
  • sparseness
    • dictates types of similarity
    • adds to efficiency
  • attribute type
    • dictates type pf similarity
  • type of data
    • dictates type of similarity
      • other characteristics, e.g., autocorrelation
  • dimensionality
  • noise and outliers
  • type of distribution
Clustering algorithms
  • K-means and its variants
  • hierarchical clustering
  • density-based clustering
K-means clustering
  • Partitional clustering approach

  • Number of clusters, K, must be specified

  • Each cluster is associated with a centroid (center point)

  • Each point is assigned to the cluster with the closest centroid

  • The basic algorithm is very simple

K-means Clustering – Details
  • Initial centroids are often chosen randomly.
    • Clusters produced vary from one run to another.
  • The centroid is (typically) the mean of the points in the cluster.
  • Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.
  • K-means will converge for common similarity measures mentioned above.
  • Most of the convergence happens in the first few iterations.
    • Often the stopping condition is changed to ‘Until relatively few points change clusters’
  • Complexity is O ( n ∗ K ∗ I ∗ d ) O( n * K * I * d ) O(nKId)
    • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
Evaluating K-means Clusters
  • Most common measure is Sum of Squared Error (SSE)

    • For each point, the error is the distance to the nearest cluster

    • To get SSE, we square these errors and sum them.
      S S E = ∑ i = 1 K ∑ x ∈ C i d i s t 2 ( m i , x ) SSE = \sum_{i=1}^K \sum_{x\in C_i} dist^2 (m_i,x) SSE=i=1KxCidist2(mi,x)

    • x is a data point in cluster Ci and mi is the representative point for cluster Ci

      • can show that mi corresponds to the center (mean) of the cluster
    • Given two sets of clusters, we prefer the one with the smallest error

    • One easy way to reduce SSE is to increase K, the number of clusters

      • A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
K-means as an Optimization Problem
  • Objective: Minimize the Sum of Squared Error (SSE)
    S S E = ∑ i = 1 K ∑ x ∈ C i d i s t 2 ( m i , x ) SSE = \sum_{i=1}^K \sum_{x\in C_i} dist^2 (m_i,x) SSE=i=1KxCidist2(mi,x)
    We fix the center, if SSE is not optimal,

    c j = arg ⁡ min ⁡ i ∈ { 1 , 2 , … , k } d i s t ( m i , j ) c_j = \arg \min _{i \in \{1,2,…,k\}} dist(m_i,j) cj=argmini{1,2,,k}dist(mi,j)

    Then, we fix the cluster assignment, derive the new center
    m i = 1 ∣ C i ∣ ∑ x ∈ C i x m_i = \frac{1}{|C_i|}\sum_{x\in C_i}x mi=Ci1xCix

Limitation of K-means
  • K-means has problems when clusters are different in:

    • Sizes

    • Densities

    • Non-globular shapes

  • K-means has problems when the data contains outliers.

Overcoming K-means limitations

One solution is to use many clusters.

Find parts of clusters, but need to put together.

Solutions to initial centroids problem
  • muptiple runs
    • helps, but probability is not on your side
  • sample and use hierarchical clustering to determine initial centroids
  • select more than k initial centroids and then select among these initial centroids
    • select most widely separated
  • postprecessing
  • bisecting K-means
    • not as susceptible to initialization issues
Updating Centers Incrementally
  • In the basic K-means algorithm, centroids are updated after all points
    are assigned to a centroid

  • An alternative is to update the centroids after each assignment
    (incremental approach)

    • Each assignment updates zero or two centroids

    • More expensive

    • Introduces an order dependency

    • Never get an empty cluster

    • Can use “weights” to change the impact

Bisecting K-means 二分K均值算法
  • Bisecting K-means algorithm
    • Variant of K-means that can produce a partitional or a
      hierarchical clustering
handing empty clusters
  • Basic K-means algorithm, centroids are updated after all points are assigned to a centroid.
  • An alternative is to update the centroids after each assignment ( incremental approach)
    • Each assignment updates
Bisecting k-means
Strengths of Hierarchical Clustering
  • Do not have to assume any particular number of clusters.
  • They may correspond
Hierarchical Clustering
  • Two main types of hierarchical clustering
Agglomerative clustering algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
After Merging
Intermediate Situation
How to Define Inter-Cluster Similarity

Cluster Similarity: MIN or Single Link

  • Similarity of two cluster

Strength of MIN

Cluster Similarity: MAX or Complete Linkage

Cluster Similarity: Group Average

Hierarchical Clustering: Time and Space requirements
Hierarchical Clustering: Problems and Limitations
MST: Divisive Hierarchical Clustering
  • Build MST (Minimum Spanning Tree)
DBSCAN
  • DBSCAN is a density-based algorithm
    • Density = number of points within a specified radius (Eps)
    • A point is a core point if it has more than a specified number of points (MinPts) within Eps
    • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.
    • A noise point is any point that is not a core point or a border point.
When DBSCAN Works Well
  • Resistant to Noise
  • Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
  • Varying densities
  • High-dimensional data
DBSCAN: Determining EPS and MinPts
  • Idea is that for points in a cluster, their k t h k^{th} kth nearest neighbors are at roughly the same distance
  • Nose points have the k t h k^{th} kth
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measures of Cluster Validity Via Correlation
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Cachel wood

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值