R机器学习之交叉验证改善模型

最新推荐文章于 2024-08-02 11:40:48 发布

岸芷汀兰whu

最新推荐文章于 2024-08-02 11:40:48 发布

阅读量5.6k

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/u012432611/article/details/50073863

版权

机器学习专栏收录该内容

27 篇文章 0 订阅

订阅专栏

什么事交叉验证

交叉验证就是保留一部分样本集不用于训练模型，而用于预测。

交叉验证的方法

50%测试集，50%训练集
缺点：只用一半数据集训练有可能丢失有用信息，即高偏差
留一法
2.1使用所有数据点，具有较低偏差
2.2 递归执行n次交叉验证，较高执行时间
2.3在测试集上容易产生高方差，因为一旦这个作为测试集的点是个异常点，那就over！
k-折叠交叉验证
k-折叠交叉验证解决了上面两个的问题：
3.1使用大部分数据作为训练集
3.2保证测试集比例

大致步骤：

Randomly split your entire dataset into k”folds”.
For each k folds in your dataset, build your model on k – 1 folds of the data set.
Then, test the model to check the effectiveness for kth fold.
Record the error you see on each of the predictions.
Repeat this until each of the k folds has served as the test set.
The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model.

但是：怎么确定k？
小k容易高偏差，大k容易高方差。通常取k=10

如何测量模型的偏差和方差？

k-折叠交叉验证将会产生k个不同模型的误差估计，理想的情况是它们的和为零。对这些误差取平均得到模型的偏差

类似的计算模型的方差，我们取所有误差的标准偏差。

我们需要在偏差和方差之间做一个权衡，可以控制偏差的情况下减少方差。

python 代码

from sklearn import cross_validation
model = RandomForestClassifier(n_estimators=100)
#Simple K-Fold cross validation. 10 folds.
cv = cross_validation.KFold(len(train), n_folds=10, indices=False)
results = []
# "Error_function" can be replaced by the error function of your analysis
for traincv, testcv in cv:
        probas = model.fit(train[traincv], target[traincv]).predict_proba(train[testcv])
        results.append( Error_function )
print "Results: " + str( np.array(results).mean() )

R代码

setwd('C:/Users/manish/desktop/RData')
library(plyr)
library(dplyr)
library(randomForest)
data <- iris
glimpse(data)
#cross validation, using rf to predict sepal.length
k = 5

data$id <- sample(1:k, nrow(data), replace = TRUE)
list <- 1:k
# prediction and test set data frames that we add to with each iteration over
# the folds
prediction <- data.frame()
testsetCopy <- data.frame()
#Creating a progress bar to know the status of CV
progress.bar <- create_progress_bar("text")
progress.bar$init(k)
#function for k fold
for(i in 1:k){
# remove rows with id i from dataframe to create training set
# select rows with id i to create test set
trainingset <- subset(data, id %in% list[-i])
testset <- subset(data, id %in% c(i))

#run a random forest model
mymodel <- randomForest(trainingset$Sepal.Length ~ ., data = trainingset, ntree = 100)

#remove response column 1, Sepal.Length
temp <- as.data.frame(predict(mymodel, testset[,-1]))

# append this iteration's predictions to the end of the prediction data frame
prediction <- rbind(prediction, temp)

# append this iteration's test set to the test set copy data frame
# keep only the Sepal Length Column
testsetCopy <- rbind(testsetCopy, as.data.frame(testset[,1]))

progress.bar$step()
}
# add predictions and actual Sepal Length values
result <- cbind(prediction, testsetCopy[, 1])
names(result) <- c("Predicted", "Actual")
result$Difference <- abs(result$Actual - result$Predicted)
# As an example use Mean Absolute Error as Evalution 
summary(result$Difference)