R机器学习之交叉验证改善模型

什么事交叉验证

交叉验证就是保留一部分样本集不用于训练模型,而用于预测。

交叉验证的方法

  1. 50%测试集,50%训练集
    缺点:只用一半数据集训练有可能丢失有用信息,即高偏差
  2. 留一法
    2.1使用所有数据点,具有较低偏差
    2.2 递归执行n次交叉验证,较高执行时间
    2.3在测试集上容易产生高方差,因为一旦这个作为测试集的点是个异常点,那就over!

  3. k-折叠交叉验证
    k-折叠交叉验证解决了上面两个的问题:
    3.1使用大部分数据作为训练集
    3.2保证测试集比例

大致步骤:

  1. Randomly split your entire dataset into k”folds”.
    For each k folds in your dataset, build your model on k – 1 folds of the data set.
  2. Then, test the model to check the effectiveness for kth fold.

  3. Record the error you see on each of the predictions.

  4. Repeat this until each of the k folds has served as the test set.

  5. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model.

但是:怎么确定k?
小k容易高偏差,大k容易高方差。通常取k=10

如何测量模型的偏差和方差?

k-折叠交叉验证将会产生k个不同模型的误差估计,理想的情况是它们的和为零。对这些误差取平均得到模型的偏差

类似的计算模型的方差,我们取所有误差的标准偏差。

我们需要在偏差和方差之间做一个权衡,可以控制偏差的情况下减少方差。

python 代码

from sklearn import cross_validation
model = RandomForestClassifier(n_estimators=100)
#Simple K-Fold cross validation. 10 folds.
cv = cross_validation.KFold(len(train), n_folds=10, indices=False)
results = []
# "Error_function" can be replaced by the error function of your analysis
for traincv, testcv in cv:
        probas = model.fit(train[traincv], target[traincv]).predict_proba(train[testcv])
        results.append( Error_function )
print "Results: " + str( np.array(results).mean() )

R代码

setwd('C:/Users/manish/desktop/RData')
library(plyr)
library(dplyr)
library(randomForest)
data <- iris
glimpse(data)
#cross validation, using rf to predict sepal.length
k = 5

data$id <- sample(1:k, nrow(data), replace = TRUE)
list <- 1:k
# prediction and test set data frames that we add to with each iteration over
# the folds
prediction <- data.frame()
testsetCopy <- data.frame()
#Creating a progress bar to know the status of CV
progress.bar <- create_progress_bar("text")
progress.bar$init(k)
#function for k fold
for(i in 1:k){
# remove rows with id i from dataframe to create training set
# select rows with id i to create test set
trainingset <- subset(data, id %in% list[-i])
testset <- subset(data, id %in% c(i))

#run a random forest model
mymodel <- randomForest(trainingset$Sepal.Length ~ ., data = trainingset, ntree = 100)

#remove response column 1, Sepal.Length
temp <- as.data.frame(predict(mymodel, testset[,-1]))

# append this iteration's predictions to the end of the prediction data frame
prediction <- rbind(prediction, temp)

# append this iteration's test set to the test set copy data frame
# keep only the Sepal Length Column
testsetCopy <- rbind(testsetCopy, as.data.frame(testset[,1]))

progress.bar$step()
}
# add predictions and actual Sepal Length values
result <- cbind(prediction, testsetCopy[, 1])
names(result) <- c("Predicted", "Actual")
result$Difference <- abs(result$Actual - result$Predicted)
# As an example use Mean Absolute Error as Evalution 
summary(result$Difference)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值