[machine learning]clusteing algo and PCA

最新推荐文章于 2020-12-13 17:54:52 发布

XXH2015

最新推荐文章于 2020-12-13 17:54:52 发布

阅读量521

点赞数

Clustering Algorithms

HKmeans - hieratchical K mean algorithm
http://gecco.org.chemie.uni-frankfurt.de/hkmeans/H-k-means.pdf
- s1: K=2, generate 2 big clusters for all the data by k-means algo
- s2: for each of the cluster, perform s1 again, then divide each cluster into two cluster(divisive hierarchical clustering)
-s3:。。。。

http://blog.sciencenet.cn/blog-517369-444270.html

传统的聚类算法可以被分为五类：划分方法、层次方法、基于密度方法、基于网格方法和基于模型方法。
1. 划分: PAM/PArtitioning Method - eg, K-mean
  - 创建k个划分，k为要创建的划分个数；然后利用一个循环定位技术通过将对象从一个划分移到另一个划分来帮助改善划分质量。典型的划分方法包括
2. 层次: 分为自上而下（分解）和自下而上（合并）常和其他聚类方法如 PAM 合并应用？
  - 自下而上：agglomerative，from N clusters to K clusters
  - 自上而下：divisive hierarchical clustering，from 1 cluster to K cluster?(seldom used)

Clustering algorithms may be classified as listed below:
- Exclusive Clustering - K mean
- Overlapping Clustering - Fuzzy C-means
  - each point may belong to two or more clusters with different degrees of membership
- Hierarchical Clustering
- Probabilistic Clustering - Mixture of Gaussians
- clustering tutorial:
  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/index.html
- Statistical Data Mining Tutorials
  http://www.autonlab.org/tutorials/list.html
different scalings can lead to different clusterings.
different formulas leads to different clusterings.

Hierarchical Clustering

algo :
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
  - Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.
  - single-linkage:= d[(k), (r,s)(the merged cluster)] = min d[(k),(r)], d[(k),(s)]
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
  - This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied.
    - The main weaknesses of agglomerative clustering methods are:
    - they do not scale well: time complexity of at least O(n2), where n is the number of total objects;
    - they can never undo what was done previously.

K mean

input :
- K :=number of clusters
- Training set $x^{(1)},x^{(2).....x^{(m)}}$
algo
- s1 先随机地定下K个点，说他们是center，比如说x1,x2,x3….xk，然后计算所有其它点到他们的距离，比如说xn到x3距离最近，就说xn是属于cluster3的
  然后就把所有点暂时分成了k份。
- s2 再分别计算每个cluster的mean of point，改一下x1,x2,x3….xk的位置，重复s1
choosing the value of K:
- by hand is best. other:
- elbow method:　by running a test with K and see the cost function to see when will the cost function is minimized. as K become larger, J should become smaller. and there will be a elbow point. note that if J doesn’t decreasing, the test should be rerun
求点群中心的算法
- 一般来说，求点群中心点的算法你可以很简的使用各个点的X/Y坐标的平均值。there are 另三个求中心点的的公式：
- http://www.csdn.net/article/2012-07-03/2807073-k-means
- kmean demo
  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
K-Means主要有两个最重大的缺陷——都和初始值有关：
- K是事先给定的，这个K值的选定是非常难以估计的。很多时候，事先并不知道给定的数据集应该分成多少个类别才最合适。（ISODATA算法通过类的自动合并和分裂，得到较为合理的类型数目K）
- K-Means算法需要用初始随机种子点来搞，这个随机种子点太重要，不同的随机种子点会有得到完全不同的结果。（K-Means++算法可以用来解决这个问题，其可以有效地选择初始点）
modification : K-Means++算法步骤：
http://blog.csdn.net/chlele0105/article/details/12997391
- 先从我们的数据库随机挑个随机点当“种子点”。
  对于每个点，我们都计算其和最近的一个“种子点”的距离D(x)并保存在一个数组里，然后把这些距离加起来得到Sum(D(x))。
- 然后，再取一个随机值，用权重的方式来取计算下一个“种子点”。这个算法的实现是，先取一个能落在Sum(D(x))中的随机值Random，然后用Random -= D(x)，直到其<=0，此时的点就是下一个“种子点”。
- 重复第（2）和第（3）步直到所有的K个种子点都被选出来。
- 进行K-Means算法。
- from wiki: The exact algorithm is as follows:
- 1 Choose one center uniformly at random from among the data points.
  2 For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  3 Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
  4 Repeat Steps 2 and 3 until k centers have been chosen.
  5 Now that the initial centers have been chosen, proceed using standard k-means clustering.
- This seeding method yields considerable improvement in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error.
- Additionally, the authors calculate an approximation ratio for their algorithm. The k-means++ algorithm guarantees an approximation ratio O(log k) in expectation (over the randomness of the algorithm), where k is the number of clusters used. This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum.[6]

Clustering as a Mixture of Gaussians

Introduction to Model-Based Clustering
- There’s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model.
- In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modelled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution.
The algorithm which is used in practice to find the mixture of Gaussians that can model the data set is called EM (Expectation-Maximization)

Data visualization
- want to plot data in 50 Dimension – summary the 50 parameters into 2 parameters and plot a x1-x2 graph

Principal Component Analysis problem formulation
- reduce from n D to k D: find k vectors onto which to project the data, so as to minimize the projection error
- PCA is not regression: regression minimize the distance from y to ypredict.
- PCA is trying to find a lower dimensional surface onto which to project the data so as to minimize this squared projection error, to minimize the squared distance between each point and the location of where it gets projected. In the next video we’ll start to talk about how to actually find this lower dimensional surface onto which to project the data.
- REDUCING THE DIMENSION 就是比如说用一条直线拟合两个数据直接的关系，那么用一个向量（一个数据）就可以表示一个二维的数组拉

PCA ALGO:
- compute “covariance matrix” n*n
- compute eigenvectors of the matrix
-

refer to
[1] machine learning on coursera
[2] http://www.csdn.net/article/2012-07-03/2807073-k-means

XXH2015

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[machine learning]clusteing algo and PCA

Clustering AlgorithmsClustering algorithms may be classified as listed below: Exclusive Clustering - K meanOverlapping Clustering - Fuzzy C-means each point may belong to two or more clusters with
复制链接

扫一扫