crossval -
Loss estimate using
cross-validation
Syntax
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val =
crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals =
crossval(...,'name',value)
Description
vals = crossval(fun,X) performs 10-fold
cross-validation for the function fun, applied to the data
in X.
fun is a function
handle to a function with two inputs, the training subset of
X, XTRAIN, and the test subset of X,
XTEST, as follows:
testval = fun(XTRAIN,XTEST)
Each time it is called, fun should use XTRAIN
to fit a model, then return some criterion testval
computed on XTEST using that fitted model.
X can be a column vector or a matrix. Rows of
X correspond to observations; columns correspond to
variables or features. Each row of vals contains the
result of applying fun to one test set. If
testval is a non-scalar value, crossval converts
it to a row vector using linear indexing and stored in one row of
vals.
vals = crossval(fun,X,Y,...) is used when data are
stored in separate variables X, Y, ... . All
variables (column vectors, matrices, or arrays) must have the same
number of rows. fun is called with the training subsets of
X, Y, ... , followed by the test subsets of
X, Y, ... , as follows:
testvals = fun(XTRAIN,YTRAIN,...,XTEST,YTEST,...)
mse = crossval('mse',X,y,'Predfun',predfun) returns
mse, a scalar containing a 10-fold cross-validation
estimate of mean-squared error for the function predfun.
X can be a column vector, matrix, or array of predictors.
y is a column vector of response values. X and
y must have the same number of rows.
predfun is a function
handle called with the training subset of X, the
training subset of y, and the test subset of X as
follows:
yfit = predfun(XTRAIN,ytrain,XTEST)
Each time it is called, predfun should use
XTRAIN and ytrain to fit a regression model and
then return fitted values in a column vector yfit. Each
row of yfit contains the predicted values for the
corresponding row of XTEST. crossval computes the
squared errors between yfit and the corresponding response
test set, and returns the overall mean across all test sets.
mcr = crossval('mcr',X,y,'Predfun',predfun) returns
mcr, a scalar containing a 10-fold cross-validation
estimate of misclassification rate (the proportion of misclassified
samples) for the function predfun. The matrix X
contains predictor values and the vector y contains class
labels. predfun should use XTRAIN and
YTRAIN to fit a classification model and return
yfit as the predicted class labels for XTEST.
crossval computes the number of misclassifications between
yfit and the corresponding response test set, and returns
the overall misclassification rate across all test sets.
val =
crossval(criterion,X1,X2,...,y,'Predfun',predfun),
where criterion is 'mse' or
'mcr', returns a cross-validation estimate of mean-squared
error (for a regression model) or misclassification rate (for a
classification model) with predictor values in X1,
X2, ... and, respectively, response values or class labels
in y. X1, X2, ... and y must
have the same number of rows. predfun is a function
handle called with the training subsets of X1,
X2, ..., the training subset of y, and the test
subsets of X1, X2, ..., as follows:
yfit=predfun(X1TRAIN,X2TRAIN,...,ytrain,X1TEST,X2TEST,...)
yfit should be a column vector containing the fitted
values.
vals =
crossval(...,'name',value)
specifies one or more optional parameter name/value pairs from the
following table. Specify name inside single
quotes.
Name
Value
holdout
A scalar specifying the ratio or the number of observations
p for holdout cross-validation. When 0
< p < 1,
approximately p*n observations for the test set are
randomly selected. When p is an integer, p
observations for the test set are randomly selected.
kfold
A scalar specifying the number of folds k for
k-fold cross-validation.
leaveout
Specifies leave-one-out cross-validation. The value must be
1.
mcreps
A positive integer specifying the number of Monte-Carlo
repetitions for validation. Ifthe first input of crossval
is 'mse' or 'mcr', crossval returns the
mean of mean-squared error or misclassification rate across all of
the Monte-Carlo repetitions. Otherwise, crossval
concatenates the values vals from all of the Monte-Carlo
repetitions along the first dimension.
partition
An object c of the cvpartition class, specifying the cross-validation type
and partition.
stratify
A column vector group specifying groups for
stratification. Both training and test sets have roughly the same
class proportions as in group. NaNs or empty
strings in group are treated as missing values, and the
corresponding rows of the data are ignored.
options
A structure that specifies whether to run in parallel, and
specifies the random stream or streams. Create the options
structure with statset.
Option fields:
UseParallel — Set to 'always' to compute in
parallel. Default is 'never'.
UseSubstreams — Set to 'always' to compute in
parallel in a reproducible fashion. Default is 'never'. To
compute reproducibly, set Streams to a type allowing
substreams: 'mlfg6331_64' or 'mrg32k3a'.
Streams — A RandStream
object or cell array consisting of one such object. If you do not
specify Streams, crossval uses the default
stream.
For more information on using parallel computing, see Parallel
Statistics.
Only one of kfold, holdout, leaveout,
or partition can be specified, and partition
cannot be specified with stratify. If both
partition and mcreps are specified, the first
Monte-Carlo repetition uses the partition information in the
cvpartition object, and the repartition method is called to generate new
partitions for each of the remaining repetitions. If no
cross-validation type is specified, the default is 10-fold
cross-validation.
Note When using
cross-validation with classification algorithms, stratification is
preferred. Otherwise, some test sets may not include observations
from all classes.
Examples
Example 1
Compute mean-squared error for regression using 10-fold
cross-validation:
load('fisheriris');
y = meas(:,1);
X = [ones(size(y,1),1),meas(:,2:4)];
regf=@(XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN));
cvMse = crossval('mse',X,y,'predfun',regf)
cvMse =
0.1015
Example 2
Compute misclassification rate using stratified 10-fold
cross-validation:
load('fisheriris');
y = species;
X = meas;
cp = cvpartition(y,'k',10); % Stratified cross-validation
classf = @(XTRAIN, ytrain,XTEST)(classify(XTEST,XTRAIN,...
ytrain));
cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp)
cvMCR =
0.0200
Example 3
Compute the confusion matrix using stratified 10-fold
cross-validation:
load('fisheriris');
y = species;
X = meas;
order = unique(y); % Order of the group labels
cp = cvpartition(y,'k',10); % Stratified cross-validation
f = @(xtr,ytr,xte,yte)confusionmat(yte,...
classify(xte,xtr,ytr),'order',order);
cfMat = crossval(f,X,y,'partition',cp);
cfMat = reshape(sum(cfMat),3,3)
cfMat =
50 0 0
0 48 2
0 1 49
cfMat is the summation of 10 confusion matrices from 10
test sets.
References
[1] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. New
York: Springer, 2001.
See Also