记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。
第一周
-Considerations for Cluster Analysis
- partitioning criteria (single level vs. hierarchical partitioning)
- separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
- similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
- clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])
Four issues:
-Quality
- deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
- clusters with arbitrary shape
- deal with noisy data
-Scalability
- clustering all the data instead of only on samples
- high dimensionality
- incremental or stream clustering and insensitivity to input order
-Constraint-based clustering
- user-given preferences or constraints
-Interpretable and usability
Cluster Analysis Categorization:
-Technique-centered
- distance-based
- density-based and grid-based methods
- probabilistic and generative models
- leveraging dimensionality reduction methods
- high-dimensional clustering
- scalable tech for cluster analysis
-Data type-centered
- clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.
-Additional insight-centered
- visual insights, semi-supervised, ensemble-based, validation-based.
Typical Clustering Methods:
-Distance-based
- partitioning algo.: k-means, k-medians, k-medoids
- hierarchical algo.: agglomerative vs. divisive method
-Density-based and grid-based
- density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
- grid-based: individual regions are formed into a grid-like structure
-Probabilistic and generative models
- Assume a specific form of the generative model (比如:mixture of Gaussian)
- Model parameters are estimated with EM algo.
- Then estimate the generative probability of the underlying data points.
-High-dimensional clustering
- subspace clustering (bottom-up, top-down, correlation-based method vs. δ -cluster method)
- dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)
Lecture2:
Good clustering:
- High intra-class similarity (Cohesive)
- Low inter-class similarity (Distinctive between clusters)
proximity: similarity or dissimilarity
-Dissimilarity Matrix
- triangle matrix (symmetric)
- distance functions are usually different for different types of data
-Distance on numeric data: Minkowski Distance
A popular distance measure:
d(i,j)=|xi1−xj1|p+|xi2−xj2|p+⋯+|xil−xjl|p−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√p ,
其中, i=(xi1,xi2,…,xil) , j=(xj1,xj2,…,xjl) 为 l 维数据,p 为order (这种距离也常被成为 l−p norm)。Property:
positivity; symmetry; triangle inequality.p=1 : Manhanttan (or city block) distance
- p=2 : Euclidean distance
- p→∞