PLS 偏最小二乘

http://www.vcclab.org/lab/pls/

http://www.utdallas.edu/~herve/

http://www.vsni.co.uk/software/genstat/htmlhelp/server/PLS.htm

PLS procedure

Fits a partial least squares regression model (Ian Wakeling & Nick Bratchell).


Options

PRINT  = string tokens
Printed output required ( data xloadings yloadings ploadings scores leverages xerrors yerrors scree xpercent ypercent predictions groups estimates fittedvalues ); default  esti , xper yper scor xloa yloa ploa

NROOTS  = scalar
Number of PLS dimensions to be extracted

YSCALING  = string token
Whether to scale the  Y  variates to unit variance; ( yes no ); default  no

XSCALING  = string token
Whether to scale the  X  variates to unit variance; ( yes no ); default  no

NGROUPS  = scalar
Number of cross-validation groups into which to divide the data; default 1 (i.e. no cross-validation performed)

SEED  = scalar or factor
A scalar indicating the seed value to use when dividing the data randomly into  NGROUPS  groups for the cross-validation or a factor to indicate a specific set of groupings to use for the cross-validation; default takes the (scalar) value of  NGROUPS

LABELS  = text
Sample labels for  X  and  Y  that are to be used in the printed output; defaults to the integers 1...n where n is the length of the variates in  X  and  Y

PLABELS  = text
Sample labels for  XPREDICTIONS  that are to be used in the printed output; default uses the integers 1, 2 ...


Parameters

Y  = pointers
Pointer to variates containing the dependent variables

X  = pointers
Pointer to variates containing the independent variables

YLOADINGS  = pointers
Pointer to variates used to store the  Y component loadings for each dimension extracted

XLOADINGS  = pointers
Pointer to variates used to store the  X component loadings for each dimension extracted

PLOADINGS  = pointers
Pointer to variates used to store the loadings for the bilinear model for the  X block

YSCORES  = pointers
Pointer to variates used to store the  Y component scores for each dimension extracted

XSCORES  = pointers
Pointer to variates used to store the  X component scores for each dimension extracted

B  = matrices
A diagonal matrix containing the regression coefficients of  YSCORES  on  XSCORES  for each dimension

YPREDICTIONS  = pointers
A pointer to variates used to store predicted  Y values for samples in the prediction set

XPREDICTIONS  = pointers
A pointer to variates containing data for the independent variables in the prediction set

ESTIMATES  = matrices
An  nX+1 by  nY matrix (where  nX and  nY are the numbers of variates contained in  X  and  Y  respectively) used to store the PLS regression coefficients for a PLS model with  NROOTS  dimensions

FITTEDVALUES  = pointers
Pointer to variates used to store the fitted values for each  Y  variate

LEVERAGES  = variates
Variate used to store the leverage that each sample has on the PLS model

PRESS  = variates
Variate used to contain the Predictive Residual Error Sum of Squares for each dimension in the PLS model, available only if cross-validation has been selected

RSS  = variates
Variate used to store the Residual Sum of Squares for each dimension extracted

YRESIDUALS  = pointers
Pointer to variates used to store the residuals from the  Y block after  NROOTS  dimensions have been extracted, uncorrected for any scaling applied using  YSCALING

XRESIDUALS  = pointers
Pointer to variates used to store the residuals from the  X block after  NROOTS  dimensions have been extracted, uncorrected for any scaling applied using  XSCALING

XPRESIDUALS  = pointers
Pointer to variates used to store the residuals from the  XPREDICTIONS  block after  NROOTS  dimensions have been extracted


Description

The regression method of Partial Least Squares (PLS) was initially developed as a calibration method for use with chemical data. It was designed principally for use with overdetermined data sets and to be more efficient computationally than competing methods such as principal components regression. If Y and X denote matrices of dependent and independent variables respectively, then the aim of PLS is to fit a bilinear model having the form T=XWX=TP′+E and Y=TQ′+F, where W is a matrix of coefficients whose columns define the PLS factors as linear combinations of the independent variables. Successive PLS factors contained in the columns of T are selected both to minimise the residuals in E and simultaneously to have high squared covariance with a single Y variate (PLS1) or a linear combination of multiple Y variates (PLS2). The columns of T are constrained to be mutually orthogonal. See Helland (1988) or Hoskuldsson (1988) for a more comprehensive description of the PLS method.

   The procedure allows the calculation of PLS1 and PLS2 models with cross-validation to assist in the determination of the correct number of dimensions to include in the model. By setting the NGROUPS option the data are randomly divided into a number of groups; samples in each group are then modelled from the remaining samples only. The sum of squares of differences between these "leave out predictions" and the observed values of Y are called PRESS. Many tests of significance for determining the correct number of dimensions are based on comparing values of PRESS for PLS models of varying rank. Values of PRESS are used in the procedure to perform Osten's (1988) test of significance and may also be plotted out in a scree diagram. In addition to the factor scores, factor loadings and residuals, the procedure also calculates a leverage measure (Martens & Naes 1989 page 276) and a single linear combination of the X variables (ESTIMATES) which summarises the entire PLS model.

   The procedure will fail if there are missing values present in either the X or Y variates.

   To use a PLS model to make predictions from new observations on the X variables, two methods are available. Either the user may do this manually by using the model as specified in the estimates matrix, or the new X data may be specified beforehand as the pointer to variates XPREDICTIONS and the corresponding predicions obtained as YPREDICTIONS.

   Output from the PLS procedure can be selected using the following settings of the PRINT option.

     data
the unscaled data values (with labels).

     xloadings
X-component loadings (columns of the matrix  W - see above).

     yloadings
variable loadings for the bilinear model of the matrix of dependent variables. Note that these are standardized to unit length and are not the same as the columns of the matrix  Q above. To obtain  Q form the matrix  C, whose columns are the standardized loadings and post-multiply by the diagonal matrix supplied as the output parameter  B.

     ploadings
variable loadings for the bilinear model of the matrix of independent variables (columns of the matrix P - see above).

     scores
X and  Y component scores. The  X component scores are the columns of the matrix  T and are mutually orthogonal. The  Y component scores, usually given the symbol  u, are not in fact needed in the calculation of the PLS model unless an iterative algorithm is used (see method section). They are provided here for completeness, as sometimes it is useful to plot the  Y component scores against the  X component scores to give a visual indication of the degree of fit for each PLS dimension.

     leverages
measure of leverage.

     xerrors
residual sum of squares and residual standard deviations for all the independent variables. When  NGROUPS >1 additional statistics are calculated from the cross-validated residuals, derived when each object is left out. The PRESS value is equal to the sum of squares of cross-validated standard deviations for each X variable multipled by N-1, where N is the total number of observations. The cross-validated standard deviations may therefore be used to measure the predictive ability of the model for each of the variables.

     yerrors
residual sum of squares and residual standard deviations for all the dependent variables (see xerrors above).

     scree
scree diagram of PRESS.

     xpercent
percentage variance explained for the  X variables.

     ypercent
percentage variance explained for the  Y variables.

     predictions
predicted values for any observations that were not included in the PLS model but were supplied using the  XPREDICTIONS  parameter.

     groups
details of groupings used for cross-validation.

     estimates
estimated PLS regression coefficients.

     fittedvalues
fitted values from the PLS regressions.

The default settings are estimatesxpercentypercentscoresxloadingsyloadingsploadings.

   The data for PLS are supplied using the X and Y parameters, as pointers to variates containing the columns of the X and Y matrices. Other parameters allow output to be saved in appropriate data structures.


Options: PRINTNROOTSYSCALINGXSCALINGNGROUPSSEEDLABELSPLABELS.

Parameters: YXYLOADINGSXLOADINGSPLOADINGSYSCORESXSCORESBYPREDICTIONSXPREDICTIONSESTIMATESFITTEDVALUESLEVERAGESPRESSRSSYRESIDUALSXRESIDUALSXPRESIDUALS.


Method

Although the PLS method is often presented in terms of an iterative algorithm (Manne 1987), the X block loadings vector for the first PLS dimension (w1) is simply the eigenvector of X′YY′X corresponding to its largest eigenvalue. To find the second and subsequent dimensions, X and Y are deflated by orthogonalising with respect to the current PLS factor (t=Xw) and the eigenanalysis repeated. The above approach was adopted by Rogers (1987) in an implementation of a Genstat 4 macro. Here we adopt a very similar approach by performing a singular value decomposition on the matrix X′Y which simultaneously obtains loading vectors for both data blocks (Hoskuldsson 1988, de Jong & ter Braak 1994).

   It is usual to centre all variables prior to a PLS analysis, the procedure will automatically do so even if the XSCALING/YSCALING options are not set. On exit from the procedure the variates pointed to by X and Y are unchanged.


Action with RESTRICT

The procedure will work with restricted variates, fitting a PLS model to the subset of objects indicated by the restriction. If there are different restrictions on different data variates then these restrictions will be combined and the analysis performed on the subset of samples that is common to all the restrictions. Note that the unrestricted length of all of the data variates must be the same and the number of samples in the common subset must be at least three. Any restrictions on a text supplied for the LABELS option or a factor for the SEED option will be ignored. On exit from the procedure all the data variates, and if supplied the SEED factor andLABELS text, will all be returned restricted to the common subset of samples. Output data structures that correspond to the samples (i.e. XSCORESYSCORESFITTEDVALUESLEVERAGESYRESIDUAL and XRESIDUAL) will also be returned restricted to the common subset, and missing values will be used for those values that have been restricted out.

   When restricted data are supplied and LABELS are also given then the appropriate subset of labels will be appear in the output; if LABELS are not defined then default labels reflecting the position of the restricted data in the unrestricted variate will be used instead.

   No restrictions are allowed in the variates supplied in the XPREDICTIONS parameter or the PLABELS option.


References

Helland, I.S. (1988). On the structure of partial least squares regression. Commun, Statist.-.Simula.Comput.17, 581-607.

Hoskuldsson, A. (1988). PLS Regression Methods. J. Chemometrics2, 211-228.

de Jong & ter Braak (1994). Comments on the PLS kernel algorithm. J. Chemometrics8, 169-174

Manne, R. (1987). Analysis of two partial least squares algorithms for multivariate calibration. Chemometrics and Intell. Lab. Systems2, 187-197.

Naes, T. & Martens H. (1989). Multivariate Calibrarion. John Wiley, Chichester.

Osten, D.W. (1988). Selection of optimal regression models via cross-validation. J. Chemometrics2, 39-48.

Rogers, C.A. (1987). A Genstat Macro for Partial Least Squares Analysis with Cross-Validation Assessment of Model Dimensionality. Genstat Newsletter18, 81-92.


See also

Procedures: CCARDARIDGE.

Commands for: Multivariate and cluster analysisRegression analysis.

Example

CAPTION 'PLS example',!t('The data are 24 calibration',\
        'samples used to determine the protein content of wheat',\
        'from spectroscopic readings at six different wavelengths',\
        '(Fearn, T., 1983, Applied Statistics, 32, 73-79).'); STYLE=meta,plain
VARIATE [NVALUES=24] L[1...6],%Protein[1]
READ    L[1...6],%Protein[1]
468 123 246 374 386 -11  9.23   458 112 236 368 383 -15  8.01
457 118 240 359 353 -16 10.95   450 115 236 352 340 -15 11.67
464 119 243 366 371 -16 10.41   499 147 273 404 433   5  9.51
463 119 242 370 377 -12  8.67   462 115 238 370 353 -13  7.75
488 134 258 393 377  -5  8.05   483 141 264 384 398  -2 11.39
463 120 243 367 378 -13  9.95   456 111 233 365 365 -15  8.25
512 161 288 415 443  12 10.57   518 167 293 421 450  19 10.23
552 197 324 448 467  32 11.87   497 146 271 407 451  11  8.09
592 229 360 484 524  51 12.55   501 150 274 406 407  11  8.38
483 137 260 385 374  -3  9.64   491 147 269 389 391   1 11.35
463 121 242 366 353 -13  9.70   507 159 285 410 445  13 10.75
474 132 255 376 383  -7 10.75   496 152 276 396 404   6 11.47  :
" Fit a 3 dimensional PLS model to the standardized data using
  leave-one-out cross-validation. All three dimensions are
  significant using Osten's test"
PLS     [PRINT=estimates,xpercent,ypercent,xloadings,yloadings,ploadings; \
        NROOTS=3; NGROUPS=24; XSCALING=yes; YSCALING=yes] Y=%Protein; X=L

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值