聚类(Clustering)-----物以类聚,人以群分。
1.Finding groups of objects
Objects similar to each other are in the same group
Objects are different from those in other groups
2.Unsupervised Learning
No labels
Data driven
3.Requirements:arbitrary shape,noise and outliers
4.K-means、K-mediods、DBSCAN、EM(Expectation Maximization)
聚类是观察式学习,而不是示例式的学习。
聚类能够作为一个独立的工具获得数据的分布状况,观察每一簇数据的特征,集中对特定的聚簇集合作进一步地分析。
聚类分析还可以作为其他数据挖掘任务(如分类、关联规则)的预处理步骤。
聚类分析的方法
划分方法:
Construct various partitions and then evaluate them by some criterion,e.g.,minimizing the sum of square errors
Typical methods:k-means,k-medoids,CLARANS
层次方法:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
Typical methods:Diana,Agnes,BIRCH,CAMELEON
基于密度的方法:
Based on connectivity and density functions
Typical methods:DBSCAN,OPTICS,DenClue
基于网格的方法:
Based on multiple-level granularity structure
Typical methods:STING,WaveCluster,CLIQUE
基于模型的方法:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
Typical methods:EM,SOM,COBWEB
基于频繁模式的方法:
Based on the analysis of frequent patterns
Typical methods:p-Cluster
基于约束的方法:
Clustering by considering user-specified or application-specific constraints
Typical methods:COD(obstacles),constrained clustering
基于链接的方法:
Objects are often linked together in various ways
Massive links can be used to cluster objects:SimRank,LinkClus
距离需要满足的性质:
非负性:d(i, j) > 0 if i ≠ j, and d(i, i) = 0
对称性:d(i, j) = d(j, i)
三角不等式:d(i, j)<= d(i, k) + d(k, j)
闵可夫斯基距离(Minkowski Distance): 计