Cross Validation is a basic method to estimate predictive performance of a supervised learning model on unknown data, i.e. to assess the power of generalization of a model.
Why we need Cross Validation
The primary goal of Cross Validation is to void over-fitting. Empirical Risk will decrease with the rise of the complexity of learning model, however, the generalization power will decrease too, that is, the true performance in unknown data may get worse. So empirical Risk is not a effective method to evaluate a practical learning model. We need to estimate the actual risk in large independent data set which doesn't have overlap with training data set. Sometimes, the test data is not available at hand or costly collected. Then, Cross Validation provides an approach to test model in training phase.
How to perform Cross Validation
The basic step for Cross Validation is to split a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set). Different Cross Validation methods are defined according to the data partitioning method.
- Leave-p-out cross validation
- k-fold cross validation
In k-fold cross validation, the original sample is randomly partitioned into k equal size sub-samples. Of the k sub-samples, a single sub-sample is retained as the validation data for testing the model, and the remaining k-1 sub-samples are used as training data. The cross validation process is then repeated k times, with each of the k sub-samples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.
In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.
- Repeated random sub-sampling validation
This method randomly splits the data set into training and validating data.
Practical application of Cross Validation
Cross Validation are commonly used in model selection, parameter tuning and feature selection
- Model Selection
- Parameter Tuning
Usually, there would be one or more hyper-parameters left to set manually, such as K for KNN, for SVM. We can finely tune these parameters by grind search, and then Cross Validation is performed for each parameter setting. The parameters with best predictive performance will be used to build the final model.
There are two common methods for hyper-parameter tuning:
a) nested cross validation
Inner cross validation for hyper-parameter tuning using grid search, outer cross validation for predictive performance estimation. For each inner cross validation procedure, we will get an optimal value for parameter. If the model is stable, optimal parameter will be close to each other in each inner cross validation.
b) prior hyper-parameter setting by experience
Hyper-parameters are fixed by experience, then we perform the cross validation to get an unbiased estimate of performance of a possibly sub-optimal model.
More details on parameter tuning, please refer to [4].
- Feature Selection
Split data into k equal size folds
for i = 1:k
select k-th fold as validation set, the rest as training set
find the TOP N informative features (i.e. the correlation with the response label) train model with selected features on training set
evaluate the result on validation set
end
calculate the average error rate over each fold, treat the Mean ER as the estimation of performance of the model
Note: the Cross Validation iteration must be most outer loop, any supervised feature selection (using correlation with class labels) performed outside of cross validation may result in over-fitting.
Over-fitting is a potential problem whenever you minimize any statistic based on a finite sample of data, cross validation is no difference [5].
It is best to view cross validation as a procedure of assessing the performance for fitting a model, rather than the model itself. In order to build the final model, you can perform the same procedure used in each fold of the cross validation on the entire data set.
More Details on how to perform feature selection using cross validation, please refer to [3].
Limitations
The problem with using Cross Validation is that the training and test data are not independent samples (as they share data) which means that the estimate of the variance of the performance estimation and of the hyper-parameters is likely to be biased (i.e. smaller than it would be for genuinely independent samples of data in each fold).
Rather than repeated Cross Validation, bootstrapping can be used instead.
Reference
[1] http://en.wikipedia.org/wiki/Cross-validation_(statistics)
[3] http://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation
[4] http://stats.stackexchange.com/questions/34652/grid-search-on-k-fold-cross-validation