1 首先针对数据进行分析,回答下面的问题:
1)想用聚类方法解决什么问题。是想看数据的结构,还是想把数据分为很多类,还是有其他的目的。
2)数据本身的分布。针对样本聚类还是针对变量聚类?样本可能符合怎样的分布?变量又会符合怎样的分布?
2 选择合适的聚类方法
针对聚类目的和数据的分布,选择合适的方法。一般来说,层次聚类比较适合用来分析数据的结构,因此可以用来做初步的聚类,从而对数据的结构有一个初步的了解。k-means聚类需要指定类别数。还有很多其他的聚类方法。针对数据分布,需要选择合理的方法计算聚类对象之间的相似性,一般采用基于距离或者基于相关性的方法。
3 确定聚类数
这个问题很难确定,除非已经有了很好的先验知识。这和下面的问题很相关。
4 评估聚类效果好坏
如果有金标准,则用它们来判断聚类效果是最合适的。如果没有的话,则首先要考虑聚类的目的,然后聚类结果应该与已有先验知识相吻合。
此外,还有一些方法可以用来判断聚类效果。一个是聚类结果是否稳定,即如果聚类方法涉及到参数的选择,可能有的结果对于参数很敏感;另外一个是不同聚类方法的一致性如何。如果不同的方法得到的聚类都很一致,则效果较好。
在R中有很多包可以做聚类分析,下面是CRAN上总结的用于聚类的包:
见(http://cran.r-project.org/web/views/Cluster.html)
Hierarchical Clustering:
- Functions hclust() from package stats and agnes() from cluster are the primary functions for agglomerative hierarchical clustering, function diana() can be used for divisive hierarchical clustering. Faster alternatives to hclust() are provided by the packages fastcluster and flashClust.
- Function dendrogram() from stats and associated methods can be used for improved visualization for cluster dendrograms.
- Package dynamicTreeCut contains methods for detection of clusters in hierarchical clustering dendrograms.
- hybridHclust implements hybrid hierarchical clustering via mutual clusters.
- Package isopam uses an algorithm which is based on the classification of ordination scores from isometric feature mapping. The classification is performed either as a hierarchical, divisive method or as non-hierarchical partitioning.
- Package LLAhclust provides likelihood linkage analysis hierarchical clustering.
- The package protoclust implements a form of hierarchical clustering that associates a prototypical element with each interior node of the dendrogram. Using the package's plot() function, one can produce dendrograms that are prototype-labeled and are therefore easier to interpret.
- pvclust is a package for assessing the uncertainty in hierarchical cluster analysis. It provides approximately unbiased p-values as well as bootstrap p-values.
- Package sparcl provides clustering for a set of n observations when p variables are available, where p >> n . It adaptively chooses a set of variables to use in clustering the observations. Sparse K-means clustering and sparse hierarchical clustering are implemented.
Partitioning Clustering:
- Function kmeans() from package stats provides several algorithms for computing partitions with respect to Euclidean distance.
- Function pam() from package cluster implements partitioning around medoids and can work with arbitrary distances. Function clara() is a wrapper to pam() for larger data sets. Silhouette plots and spanning ellipses can be used for visualization.
- Package apcluster implements Frey's and Dueck's Affinity Propagation clustering. The algorithms in the package are analogous to the Matlab code published by Frey and Dueck.
- Package bayesclust allows to test and search for clusters in a hierarchical Bayes model.
- Package clues provides a clustering method based on local shrinking.
- Package clusterSim allows to search for the optimal clustering procedure for a given dataset.
- Package flexclust provides k-centroid cluster algorithms for arbitrary distance measures, hard competitive learning, neural gas and QT clustering. Neighborhood graphs and image plots of partitions are available for visualization. Some of this functionality is also provided by package cclust.
- Package kernlab provides a weighted kernel version of the k-means algorithm by kkmeans and spectral clustering by specc.
- Packages kml and kml3d provide k-means clustering specifically for longitudinal (joint) data.
- Package optpart contains a set of algorithms for creating partitions and coverings of objects largely based on operations on similarity relations (or matrices).
- Package pdfCluster provides tools to perform cluster analysis via kernel density estimation. Clusters are associated to the maximally connected components with estimated density above a threshold.
- Package skmeans allows spherical k-Means Clustering, i.e. k-means clustering with cosine similarity. It features several methods, including a genetic and a simple fixed-point algorithm and an interface to the CLUTO vcluster program for clustering high-dimensional datasets.
- Package trimcluster provides trimmed k-means clustering. Package tclust also allows for trimmed k-means clustering. In addition using this package other covariance structures can also be specified for the clusters.
Model-based Clustering:
- ML estimation:
- Package mclust fits mixtures of Gaussians using the EM algorithm. It allows fine control of volume and shape of covariance matrices and agglomerative hierarchical clustering based on maximum likelihood. It provides comprehensive strategies using hierarchical clustering, EM and the Bayesian Information Criterion (BIC) for clustering, density estimation, and discriminant analysis. Please note the license under which this package is distributed. Except for strict academic use, use of mclust (by itself or through other packages) requires payment of an annual license fee and completion of a license agreement.
- Package HDclassif provides function hddc to fit Gaussian mixture model to high-dimensional data where it is assumed that the data lives in a lower dimension than the original space.
- mritc provides tools for classification using normal mixture models and (higher resolution) hidden Markov normal mixture models fitted by various methods.
- prabclus clusters a presence-absence matrix object by calculating an MDS from the distances, and applying maximum likelihood Gaussian mixtures clustering to the MDS points.
- Package MetabolAnalyze fits mixtures of probabilistic principal component analysis with the EM algorithm.
- Fitting finite mixtures of uni- and multivariate scale mixtures of skew-normal distributions with the EM algorithm is provided by package mixsmsn.
- Package movMF fits finite mixtures of von Mises-Fisher distributions with the EM algorithm.
- Package MFDA implements model-based functional data analysis.
- For grouped conditional data package mixdist can be used.
- Package mixRasch estimates mixture Rasch models, including the dichotomous Rasch model, the rating scale model, and the partial credit model with joint maximum likelihood estimation.
- Package pmclust allows to use unsupervised model-based clustering for high dimensional (ultra) large data. The package uses Rmpi to perform a parallel version of the EM algorithm for mixtures of Gaussians.
- Bayesian estimation:
- Bayesian estimation of finite mixtures of multivariate Gaussians is possible using package bayesm. The package provides functionality for sampling from such a mixture as well as estimating the model using Gibbs sampling. Additional functionality for analyzing the MCMC chains is available for averaging the moments over MCMC draws, for determining the marginal densities, for clustering observations and for plotting the uni- and bivariate marginal densities.
- Package bayesmix provides Bayesian estimation using JAGS.
- Package Bmix provides Bayesian Sampling for stick-breaking mixtures.
- Package bclust allows Bayesian clustering using a spike-and-slab hierarchical model and is suitable for clustering high-dimensional data.
- Package dpmixsim fits Dirichlet process mixture models using conjugate models with normal structure. Packageprofdpm determines the maximum posterior estimate for product partition models where the Dirichlet process mixture is a specific case in the class.
- Package mixAK contains a mixture of statistical methods including the MCMC methods to analyze normal mixtures with possibly censored data.
- Package GSM fits mixtures of gamma distributions.
- Package mcclust implements methods for processing a sample of (hard) clusterings, e.g. the MCMC output of a Bayesian clustering model. Among them are methods that find a single best clustering to represent the sample, which are based on the posterior similarity matrix or a relabelling algorithm.
- Package rjags provides an interface to the JAGS MCMC library which includes a module for mixture modelling.
- Other estimation methods:
- Package AdMit allows to fit an adaptive mixture of Student-t distributions to approximate a target density through its kernel function.
- Circular and orthogonal regression clustering using redescending M-estimators is provided by package edci.
- Robust estimation using Weighted Likelihood can be done with package wle.
- Package pendensity estimates densities with a penalized mixture approach.
Other Cluster Algorithms:
- Package amap provides alternative implementations of k-means and agglomerative hierarchical clustering.
- Package biclust provides several algorithms to find biclusters in two-dimensional data. Package isa2 provides the Iterative Signature Algorithm (ISA) for biclustering.
- Package cba implements clustering techniques for business analytics like "rock" and "proximus".
- Package CHsharp clusters 3-dimensional data into their local modes based on a convergent form of Choi and Hall's (1999) data sharpening method.
- Package clue implements ensemble methods for both hierarchical and partitioning cluster methods.
- Package CoClust implements a cluster algorithm that is based on copula functions and therefore allows to group observations according to the multivariate dependence structure of the generating process without any assumptions on the margins.
- Fuzzy clustering and bagged clustering are available in package e1071.
- Package compHclust provides complimentary hierarchical clustering which was especially designed for microarray data to uncover structures present in the data that arise from 'weak' genes.
- Package FactoClass performs a combination of factorial methods and cluster analysis.
- The hopach algorithm is a hybrid between hierarchical methods and PAM and builds a tree by recursively partitioning a data set.
- For graphs and networks model-based clustering approaches are implemented in packages latentnet and mixer.
- Package nnclust allows fast clustering of large data sets by constructing a minimum spanning tree for each cluster. For each cluster the procedure is stopped when the nearest-neighbour distance rises above a specified threshold. A set of clusters and a set of "outliers" not in any cluster is returned. The algorithm works best for well-separated clusters in up to 8 dimensions, and sample sizes up to hundreds of thousands.
- Package randomLCA provides the fitting of latent class models which optionally also include a random effect. PackagepoLCA allows for polytomous variable latent class analysis and regression.
- Package RPMM fits recursively partitioned mixture models for Beta and Gaussian Mixtures. This is a model-based clustering algorithm that returns a hierarchy of classes, similar to hierarchical clustering, but also similar to finite mixture models.
- Package segclust fits a segmentation/clustering model. A mixture of univariate gaussian distributions is used for the cluster structure and segments are assumed to arise because switching between clusters over time occurs.
- Self-organizing maps are available in package som.
- Several packages provide cluster algorithms which have been developped for bioinformatics applications. These packages include FunCluster for profiling microarray expression data, MMG for mixture models on graphcs, and ORIClustfor order-restricted information-based clustering.
Cluster-wise Regression:
- Package flexmix implements an user-extensible framework for EM-estimation of mixtures of regression models, including mixtures of (generalized) linear models.
- Package fpc provides fixed-point methods both for model-based clustering and linear regression. A collection of asymmetric projection methods can be used to plot various aspects of a clustering.
- Multigroup mixtures of latent Markov models on mixed categorical and continuous data (including time series) can be fitted using depmix or depmixS4. The parameters are optimized using a general purpose optimization routine given linear and nonlinear constraints on the parameters.
- Package mixreg fits mixtures of one-variable regressions and provides the bootstrap test for the number of components.
- Package lcmm fits a latent class linear mixed model which is also known as growth mixture model or heterogeneous linear mixed model using a maximum likelihood method.
- moc fits mixture models to multivariate mixed data using a Newton-type algorithm. The component specific distribution may have one, two or three parameters. Covariates and concomitant variables can be specified as well as constraints for the parameters.
- mixtools provides fitting with the EM algorithm for parametric and non-parametric (multivariate) mixtures. Parametric mixtures include mixtures of multinomials, multivariate normals, normals with repeated measures, Poisson regressions and Gaussian regressions (with random effects). Non-parametric mixtures include the univariate semi-parametric case where symmetry is imposed for identifiability and multivariate non-parametric mixtures with conditional independent assumption. In addition fitting mixtures of Gaussian regressions with the Metropolis-Hastings algorithm is available.
- mixPHM fits mixtures of proportional hazard models with the EM algorithm.
- Package gamlss.mx fits finite mixtures of of gamlss family distributions.
Additional Functionality:
- Mixtures of univariate normal distributions can be printed and plotted using package nor1mix.
- Packages gcExplorer and clusterfly allow to visualise the results of clustering algorithms.
- Package clusterGeneration contains functions for generating random clusters and random covariance/correlation matrices, calculating a separation index (data and population version) for pairs of clusters or cluster distributions, and 1-D and 2-D projection plots to visualize clusters. Alternatively MixSim generates a finite mixture model with Gaussian components for prespecified levels of maximum and/or average overlaps. This model can be used to simulate data for studying the performance of cluster algorithms.
- For cluster validation package clusterRepro tests the reproducibility of a cluster. Package clv contains popular internal and external cluster validation methods ready to use for most of the outputs produced by functions from package cluster and clValid calculates several stability measures.
- Package clustTool provides a GUI for clustering data with spatial information.
- Package clustvarsel provides variable selection for model-based clustering.
- Functionality to compare the similarity between two cluster solutions is provided by cluster.stats() in package fpc.
- clusterCons allows to calculate the consensus clustering result from re-sampled clustering experiments with the option of using multiple algorithms and parameters.
- The stability of k-centroid clustering solutions fitted using functions from package flexclust can also be validated via bootFlexclust() using bootstrap methods.
- Package MOCCA provides methods to analyze cluster alternatives based on multi-objective optimization of cluster validation indices.
- Package SDisc provides an integrated methodology for the identification of homogeneous profiles in a data distribution by including methods for data treatment and pre-processing, repeated cluster analysis, model selection, model reliability and reproducibility assessment, profiles characterization and validation by visual and table summaries.
- Package sigclust provides a statistical method for testing the significance of clustering results.
转载自:http://blog.sciencenet.cn/blog-54276-549380.html