数据挖掘中的聚类分析

记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。

第一周

-Considerations for Cluster Analysis

  • partitioning criteria (single level vs. hierarchical partitioning)
  • separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
  • similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
  • clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])

Four issues:
-Quality

  • deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
  • clusters with arbitrary shape
  • deal with noisy data

-Scalability

  • clustering all the data instead of only on samples
  • high dimensionality
  • incremental or stream clustering and insensitivity to input order

-Constraint-based clustering

  • user-given preferences or constraints

-Interpretable and usability

Cluster Analysis Categorization:
-Technique-centered

  • distance-based
  • density-based and grid-based methods
  • probabilistic and generative models
  • leveraging dimensionality reduction methods
  • high-dimensional clustering
  • scalable tech for cluster analysis

-Data type-centered

  • clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.

-Additional insight-centered

  • visual insights, semi-supervised, ensemble-based, validation-based.

Typical Clustering Methods:
-Distance-based

  • partitioning algo.: k-means, k-medians, k-medoids
  • hierarchical algo.: agglomerative vs. divisive method

-Density-based and grid-based

  • density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
  • grid-based: individual regions are formed into a grid-like structure

-Probabilistic and generative models

  • Assume a specific form of the generative model (比如:mixture of Gaussian)
  • Model parameters are estimated with EM algo.
  • Then estimate the generative probability of the underlying data points.

-High-dimensional clustering

  • subspace clustering (bottom-up, top-down, correlation-based method vs. δ -cluster method)
  • dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)

Lecture2:
Good clustering:

  • High intra-class similarity (Cohesive)
  • Low inter-class similarity (Distinctive between clusters)

proximity: similarity or dissimilarity

-Dissimilarity Matrix

  • triangle matrix (symmetric)
  • distance functions are usually different for different types of data

-Distance on numeric data: Minkowski Distance

  • A popular distance measure:
    d(i,j)=|xi1xj1|p+|xi2xj2|p++|xilxjl|pp
    其中, i=(xi1,xi2,,xil) j=(xj1,xj2,,xjl) l 维数据, p 为order (这种距离也常被成为 lp norm)。

  • Property:
    positivity; symmetry; triangle inequality.

  • p=1 : Manhanttan (or city block) distance

  • p=2 : Euclidean distance
  • p
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值