CRAN Task View: Multivariate Statistics

Maintainer:Paul Hewson
Contact:Paul.Hewson at plymouth.ac.uk
Version:2012-04-11

Base R contains most of the functionality for classical multivariate analysis, somewhere. There are a large number of packages on CRAN which extend this methodology, a brief overview is given below. Application-specific uses of multivariate statistics are described in relevant task views, for example whilst principal components are listed here, ordination is covered in theEnvironmetrics task view. Further information on supervised classification can be found in the MachineLearning task view, and unsupervised classification in the Cluster task view.

The packages in this view can be roughly structured into the following topics. If you think that some package is missing from the list, please let me know.

Visualising multivariate data

  • Graphical Procedures: A range of base graphics (e.g. pairs() and coplot()) and lattice functions (e.g. xyplot() and splom()) are useful for visualising pairwise arrays of 2-dimensional scatterplots, clouds and 3-dimensional densities. scatterplot.matrix in the car provides usefully enhanced pairwise scatterplots. The cwhmisc package provides plotSplomT() which displays correlation values and adds histograms on the diagonal of scatterplot matrices. Beyond this, scatterplot3d provides 3 dimensional scatterplots, aplpack provides bagplots and spin3R(), a function for rotating 3d clouds. misc3d, dependent upon rgl, provides animated functions within R useful for visualising densities. YaleToolkit provides a range of useful visualisation techniques for multivariate data. More specialised multivariate plots include the following: faces() in aplpack provides Chernoff's faces; parcoord() from MASS provides parallel coordinate plots; stars() in graphics provides a choice of star, radar and cobweb plots respectively. mstree() in ade4 and spantree() in vegan provide minimum spanning tree functionality. calibratesupports biplot and scatterplot axis labelling, chplot provides convex hull plots. geometry, which provides an interface to the qhull library, gives indices to the relevant points viaconvexhulln()ellipse draws ellipses for two parameters, and provides plotcorr(), visual display of a correlation matrix. denpro provides level set trees for multivariate visualisation. Mosaic plots are available via mosaicplot() in graphics and mosaic() in vcd that also contains other visualization techniques for multivariate categorical data. gclus provides a number of cluster specific graphical enhancements for scatterplots and parallel coordinate plots See the links for a reference to GGobi. rggobi interfaces with GGobi. xgobi interfaces to the XGobi and XGvis programs which allow linked, dynamic multivariate plots as well as projection pursuit. Finally, iplots allows particularly powerful dynamic interactive graphics, of which interactive parallel co-ordinate plots and mosaic plots may be of great interest. Seriation methods are provided by seriation which can reorder matrices and dendrograms.
  • Data Preprocessing: summarize() and summary.formula() in Hmisc assist with descriptive functions; from the same package varclus() offers variable clustering while dataRep() and find.matches() assist in exploring a given dataset in terms of representativeness and finding matches. Whilst dist() in base and daisy() in cluster provide a wide range of distance measures, proxy provides a framework for more distance measures, including measures between matrices. simba provides functions for dealing with presence / absence data including similarity matrices and reshaping.

Hypothesis testing

  • ICSNP provides Hotellings T2 test as well as a range of non-parametric tests including location tests based on marginal ranks, spatial median and spatial signs computation, estimates of shape. Non-parametric two sample tests are also available from cramer and spatial sign and rank tests to investigate location, sphericity and independence are available in SpatialNP.

Multivariate distributions

  • Descriptive measures: cov() and cor() in stats will provide estimates of the covariance and correlation matrices respectively. ICSNP offers several descriptive measures such as spatial.median()which provides an estimate of the spatial median and further functions which provide estimates of scatter. Further robust methods are provided such as cov.rob() in MASS which provides robust estimates of the variance-covariance matrix by minimum volume ellipsoid, minimum covariance determinant or classical product-moment. covRobust provides robust covariance estimation via nearest neighbor variance estimation. robustbase provides robust covariance estimation via fast minimum covariance determinant with covMCD() and the Orthogonalized pairwise estimate of Gnanadesikan-Kettenring via covOGK(). Scalable robust methods are provided within rrcov also using fast minimum covariance determinant with covMcd() as well as M-estimators with covMest().corpcor provides shrinkage estimation of large scale covariance and (partial) correlation matrices.
  • Densities (estimation and simulation): mvnorm() in MASS simulates from the multivariate normal distribution. mvtnorm also provides simulation as well as probability and quantile functions for both the multivariate t distribution and multivariate normal distributions as well as density functions for the multivariate normal distribution, mvtnormpcs provides functions based on Dunnett. mnormt provides multivariate normal and multivariate t density and distribution functions as well as random number simulation. sn provides density, distribution and random number generation for the multivariate skew normal and skew t distribution. delt provides a range of functions for estimating multivariate densities by CART and greedy methods. Comprehensive information on mixtures is given in the Cluster view, some density estimates and random numbers are provided by rmvnorm.mixt() and dmvnorm.mixt() in ks, mixture fitting is also provided withinbayesm. Functions to simulate from the Wishart distribution are provided in a number of places, such as rwishart() in bayesm and rwish() in MCMCpack (the latter also has a density functiondwish()). bkde2D() from KernSmooth and kde2d() from MASS provide binned and non-binned 2-dimensional kernel density estimation, ks also provides multivariate kernel smoothing as does ash andGenKernprim provides patient rule induction methods to attempt to find regions of high density in high dimensional multivariate data, feature also provides methods for determining feature significance in multivariate data (such as in relation to local modes).
  • Assessing normality: mvnormtest provides a multivariate extension to the Shapiro-Wilks test, mvoutlier provides multivariate outlier detection based on robust methods. ICS provides tests for multi-normality. mvnorm.etest() in energy provides an assessment of normality based on E statistics (energy); in the same package k.sample() assesses a number of samples for equal distributions. Tests for Wishart-distributed covariance matrices are given by mauchly.test() in stats.
  • Copulas: copula provides routines for a range of (elliptical and archimedean) copulas including normal, t, Clayton, Frank, Gumbel, fgac provides generalised archimedian copula.

Linear models

  • From stats, lm() (with a matrix specified as the dependent variable) offers multivariate linear models, anova.mlm() provides comparison of multivariate linear models. manova() offers MANOVA. snprovides msn.mle() and mst.mle() which fit multivariate skew normal and multivariate skew t models. pls provides partial least squares regression (PLSR) and principal component regression, pplsprovides penalized partial least squares, dr provides dimension reduction regression options such as "sir" (sliced inverse regression), "save" (sliced average variance estimation). plsgenomicsprovides partial least squares analyses for genomics. relaimpo provides functions to investigate the relative importance of regression parameters.

Projection methods

  • Principal components: these can be fitted with prcomp() (based on svd(), preferred) as well as princomp() (based on eigen() for compatibility with S-PLUS) from stats. sca provides simple components. pc1() in Hmisc provides the first principal component and gives coefficients for unscaled data. Additional support for an assessment of the scree plot can be found in nFactors, whereas paran provides routines for Horn's evaluation of the number of dimensions to retain. pcurve provides Principal Curve analysis and visualisation as well as a further principal component method. For wide matrices, gmodels provides fast.prcomp() and fast.svd()kernlab uses kernel methods to provide a form of non-linear principal components with kpca()pcaPP provides robust principal components by means of projection pursuit. amap provides further robust and parallelised methods such as a form of generalised and robust principal component analysis viaacpgen() and acprob() respectively. Further options for principal components in an ecological setting are available within ade4 and in a sensory setting in SensoMineRpsy provides a variety of routines useful in psychometry, in this context these include sphpca() which maps onto a sphere and fpca() where some variables may be considered as dependent as well as scree.plot() which has the option of adding simulation results to help assess the observed data. PTAk provides principal tensor analysis analagous to both PCA and correspondence analysis. smatr provides standardised major axis estimation with specific application to allometry.
  • Canonical Correlation: cancor() in stats provides canonical correlation. kernlab uses kernel methods to provide robust canonical correlation with kcca()concor provides a number of concordance methods.
  • Redundancy Analysis: calibrate provides rda() for redundancy analysis as well as further options for canonical correlation.
  • Independent Components: fastICA provides fastICA algorithms to perform independent component analysis (ICA) and Projection Pursuit, and PearsonICA uses score functions. ICS provides either an invariant co-ordinate system or independent components. JADE adds an interface to the JADE algorithm, as well as providing some diagnostics for ICA.
  • Procrustes analysis: procrustes() in vegan provides procrustes analysis, this package also provides functions for ordination and further information on that area is given in the Environmetricstask view. Generalised procrustes analysis via GPA() is available from FactoMineR.

Principal coordinates / scaling methods

  • cmdscale() in stats provides classical multidimensional scaling (principal coordinates analysis), sammon() and isoMDS() in MASS offer Sammon and Kruskal's non-metric multidimensional scaling.vegan provides wrappers and post-processing for non-metric MDS. indscal() is provided by SensoMineR.

Unsupervised classification

  • Cluster analysis: A comprehensive overview of clustering methods available within R is provided by the Cluster task view. Standard techniques include hierarchical clustering by hclust() and k-means clustering by kmeans() in stats. A range of established clustering and visualisation techniques are also available in cluster, some cluster validation routines are available in clvand the Rand index can be computed from classAgreement() in e1071. Trimmed cluster analysis is available from trimcluster, cluster ensembles are available from clue, methods to assist with choice of routines are available in clusterSim and hybrid methodology is provided by hybridHclust. Distance measures ( edist()) and hierarchical clustering ( hclust.energy()) based on E-statistics are available in energyLLAhclust provides variable and object clustering based on a likelihood linkage method, which also provides indices for assessing the results. Mahalanobis distance based clustering (for fixed points as well as clusterwise regression) are available from fpcclustvarsel provides variable selection within model based clustering. Fuzzy clustering is available within cluster as well as via the hopach (Hierarchical Ordered Partitioning and Collapsing Hybrid) algorithm. kohonen provides supervised and unsuperised SOMs for high dimensional spectra or patterns. clusterGeneration helps simulate clusters. The Environmetrics task view also gives a topic-related overview of some clustering techniques. Model based clustering is available in mclust and model based clustering for functional data is available in MFDA.
  • Tree methods: Full details on tree methods are given in the MachineLearning task view. Suffice to say here that classification trees are sometimes considered within multivariate methods;rpart is most used for this purpose. TWIX provides trees with extra splits. hier.part partitions the variance in a multivariate data set. mvpart extend regression trees to cover multivariate regression trees, party provides recursive partitioning and rrp provides random recursive partitioning. Classification and regression training is provided by caret. (Formerly, caretLSF provided additional parallel processing capacity which has now been incorporated into caret.) kknn provides k-nearest neighbour methods which can be used for regression as well as classification.

Supervised classification and discriminant analysis

  • lda() and qda() within MASS provide linear and quadratic discrimination respectively. mda provides mixture and flexible discriminant analysis with mda() and fda() as well as multivariate adaptive regression splines with mars() and adaptive spline backfitting with the bruto() function. Multivariate adaptive regression splines can also be found in earthrda provides classification for high dimensional data by means of shrunken centroids regularized discriminant analysis. Package class provides k-nearest neighbours by knn()knncat provides k-nearest neighbours for categorical variables. SensoMineR provides FDA() for factorial discriminant analysis. A number of packages provide for dimension reduction with the classification. klaRincludes variable selection and robustness against multicollinearity as well as a number of visualisation routines. superpc provides principal components for supervised classification, whereas gpls provides classification using generalised partial least squares. hddplot provides cross-validated linear discriminant calculations to determine the optimum number of features.ROCR provides a range of methods for assessing classifier performance. predbayescor provides naive Bayes classification. Further information on supervised classification can be found in theMachineLearning task view.

Correspondence analysis

  • corresp() and mca() in MASS provide simple and multiple correspondence analysis respectively. ca also provides single, multiple and joint correspondence analysis. ca() and mca() in ade4 provide correspondence and multiple correspondence analysis respectively, as well as adding homogeneous table analysis with hta(). Further functionality is also available within vegan co-correspondence is available from cocorrespFactoMineR provides CA() and MCA() which also enable simple and multiple corresondence analysis as well as associated graphical routines. homalsprovides homogeneity analysis.

Missing data

  • mitools provides tools for multiple imputation, mice provides multivariate imputation by chained equations mvnmle provides ML estimation for multivariate normal data with missing values, mixprovides multiple imputation for mixed categorical and continuous data. pan provides multiple imputation for missing panel data. VIM provides methods for the visualisation as well as imputation of missing data. aregImpute() and transcan() from Hmisc provide further imputation methods. monomvn deals with estimation models where the missing data pattern is monotone.

Latent variable approaches

  • factanal() in stats provides factor analysis by maximum likelihood, Bayesian factor analysis is provided for Gaussian, ordinal and mixed variables in MCMCpackGPArotation offers GPA (gradient projection algorithm) factor rotation. FAiR provides factor analysis solved using genetic algorithms. sem fits linear structural equation models and ltm provides latent trait models under item response theory and range of extensions to Rasch models can be found in eRmFactoMineR provides a wide range of Factor Analysis methods, including MFA() and HMFA() for multiple and hierarchical multiple factor analysis as well as ADFM() for multiple factor analysis of quantitative and qualitative data. tsfa provides factor analysis for time series. poLCA provides latent class and latent class regression models for a variety of outcome variables.

Modelling non-Gaussian data

  • mprobit provides multivariate probit model for binary as well as ordinal response, MNP provides Bayesian multinomial probit models, polycor provides polchoric and tetrachoric correlation matrices. bayesm provides a range of models such as seemingly unrelated regression, multinomial logit/probit, multivariate probit and instrumental variables. VGAM provides Vector Generalised Linear and Additive Models, Reduced Rank regression

Matrix manipulations

  • As a vector- and matrix-based language, base R ships with many powerful tools for doing matrix manipulations, which are complemented by the packages Matrix and SparseMmatrixcalc adds functions for matrix differential calculus. Some further sparse matrix functionality is also available from spam.

Miscellaneous utitlies

  • abind generalises cbind() and rbind() for arrays, mApply() in Hmisc generalises apply() for matrices and passes multiple functions. In addition to functions listed earlier, sn provides operations such as marginalisation, affine transformations and graphics for the multivariate skew normal and skew t distribution. panel provides methods for modelling panel data. mArprovides for vector auto-regression, MSBVAR provides for Bayesian vector autoregression models, along with impulse responses and forecasting. rm.boot() from Hmisc bootstraps repeated measures models. cramer Multivariate nonparametric Cramer-Test for the two-sample-problem psy also provides a range of statistics based on Cohen's kappa including weighted measures and agreement among more than 2 raters. cwhmisc contains a number of interesting support functions which are of interest, such as ellipse()normalise() and various rotation functions. desirability provides functions for multivariate optimisation. geozoo provides plotting of geometric objects in GGobi.

CRAN packages:

Related links:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值