[machine learning]clusteing algo and PCA

Clustering Algorithms

HKmeans - hieratchical K mean algorithm
http://gecco.org.chemie.uni-frankfurt.de/hkmeans/H-k-means.pdf
- s1: K=2, generate 2 big clusters for all the data by k-means algo
- s2: for each of the cluster, perform s1 again, then divide each cluster into two cluster(divisive hierarchical clustering)
-s3:。。。。


http://blog.sciencenet.cn/blog-517369-444270.html

  • 传统的聚类算法可以被分为五类:划分方法、层次方法、基于密度方法、基于网格方法和基于模型方法。
    1. 划分: PAM/PArtitioning Method - eg, K-mean
      • 创建k个划分,k为要创建的划分个数;然后利用一个循环定位技术通过将对象从一个划分移到另一个划分来帮助改善划分质量。典型的划分方法包括
    2. 层次: 分为自上而下(分解)和自下而上(合并)常和其他聚类方法如 PAM 合并应用?
      • 自下而上:agglomerative,from N clusters to K clusters
      • 自上而下:divisive hierarchical clustering,from 1 cluster to K cluster?(seldom used)

Hierarchical Clustering
  • algo :
  • Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:

    1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.
    2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.
    3. Compute distances (similarities) between the new cluster and each of the old clusters.
      • Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.
      • single-linkage:= d[(k), (r,s)(the merged cluster)] = min d[(k),(r)], d[(k),(s)]
    4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)

      • This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied.
        • The main weaknesses of agglomerative clustering methods are:
        • they do not scale well: time complexity of at least O(n2), where n is the number of total objects;
        • they can never undo what was done previously.
K mean
  • input :
    • K :=number of clusters
    • Training set x(1),x(2).....x(m)
  • algo

    • s1 先随机地定下K个点,说他们是center,比如说x1,x2,x3….xk,然后计算所有其它点到他们的距离,比如说xn到x3距离最近,就说xn是属于cluster3的
      然后就把所有点暂时分成了k份。
    • s2 再分别计算每个cluster的mean of point,改一下x1,x2,x3….xk的位置,重复s1
  • choosing the value of K:

    • by hand is best. other:
    • elbow method: by running a test with K and see the cost function to see when will the cost function is minimized. as K become larger, J should become smaller. and there will be a elbow point. note that if J doesn’t decreasing, the test should be rerun
  • 求点群中心的算法

  • K-Means主要有两个最重大的缺陷——都和初始值有关:

    • K是事先给定的,这个K值的选定是非常难以估计的。很多时候,事先并不知道给定的数据集应该分成多少个类别才最合适。(ISODATA算法通过类的自动合并和分裂,得到较为合理的类型数目K)
    • K-Means算法需要用初始随机种子点来搞,这个随机种子点太重要,不同的随机种子点会有得到完全不同的结果。(K-Means++算法可以用来解决这个问题,其可以有效地选择初始点)
  • modification : K-Means++算法步骤:
  • http://blog.csdn.net/chlele0105/article/details/12997391

    • 先从我们的数据库随机挑个随机点当“种子点”。
      对于每个点,我们都计算其和最近的一个“种子点”的距离D(x)并保存在一个数组里,然后把这些距离加起来得到Sum(D(x))。
    • 然后,再取一个随机值,用权重的方式来取计算下一个“种子点”。这个算法的实现是,先取一个能落在Sum(D(x))中的随机值Random,然后用Random -= D(x),直到其<=0,此时的点就是下一个“种子点”。
    • 重复第(2)和第(3)步直到所有的K个种子点都被选出来。
    • 进行K-Means算法。
    • from wiki: The exact algorithm is as follows:

    • 1 Choose one center uniformly at random from among the data points.
      2 For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
      3 Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
      4 Repeat Steps 2 and 3 until k centers have been chosen.
      5 Now that the initial centers have been chosen, proceed using standard k-means clustering.

    • This seeding method yields considerable improvement in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error.

    • Additionally, the authors calculate an approximation ratio for their algorithm. The k-means++ algorithm guarantees an approximation ratio O(log k) in expectation (over the randomness of the algorithm), where k is the number of clusters used. This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum.[6]

Clustering as a Mixture of Gaussians
  • Introduction to Model-Based Clustering
    • There’s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model.
    • In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modelled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution.
  • The algorithm which is used in practice to find the mixture of Gaussians that can model the data set is called EM (Expectation-Maximization)


Data visualization
- want to plot data in 50 Dimension – summary the 50 parameters into 2 parameters and plot a x1-x2 graph

Principal Component Analysis problem formulation
- reduce from n D to k D: find k vectors onto which to project the data, so as to minimize the projection error
- PCA is not regression: regression minimize the distance from y to ypredict.
- PCA is trying to find a lower dimensional surface onto which to project the data so as to minimize this squared projection error, to minimize the squared distance between each point and the location of where it gets projected. In the next video we’ll start to talk about how to actually find this lower dimensional surface onto which to project the data.
- REDUCING THE DIMENSION 就是比如说用一条直线拟合两个数据直接的关系,那么用一个向量(一个数据)就可以表示一个二维的数组拉

PCA ALGO:
- compute “covariance matrix” n*n
- compute eigenvectors of the matrix
-


refer to
[1] machine learning on coursera
[2] http://www.csdn.net/article/2012-07-03/2807073-k-means

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值