机器学习练习(三)——交叉验证Cross-validation

一、选择正确的Model基础验证法


from sklearn.datasets import load_iris # iris数据集
from sklearn.model_selection import train_test_split # 分割数据模块
from sklearn.neighbors import KNeighborsClassifier # K最近邻(kNN,k-NearestNeighbor)分类算法

#加载iris数据集
iris = load_iris()
X = iris.data
y = iris.target

#分割数据并
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

#建立模型
knn = KNeighborsClassifier()

#训练模型
knn.fit(X_train, y_train)

#将准确率打印出
print(knn.score(X_test, y_test))
# 0.973684210526     基础验证的准确率


二、选择正确的Model交叉验证法(Cross-validation)

交叉验证的基本思想是把在某种意义下将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分做为验证集(validation set or test set),首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。


from sklearn.cross_validation import cross_val_score # K折交叉验证模块

#使用K折交叉验证模块
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')

#将5次的预测准确率打印出
print(scores)
# [ 0.96666667  1.          0.93333333  0.96666667  1.        ]

#将5次的预测准确平均率打印出
print(scores.mean())
# 0.973333333333

三、准确率和平均方差

一般来说 准确率(accuracy) 会用于判断分类(Classification)模型的好坏。
import matplotlib.pyplot as plt #可视化模块

#建立测试参数集
k_range = range(1, 31)

k_scores = []

#藉由迭代的方式来计算不同参数对模型的影响,并返回交叉验证后的平均准确率
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    k_scores.append(scores.mean())

#可视化数据
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()
从图中可以得知,选择 12~18 k 值最好。高过 18 之后,准确率开始下降则是因为过拟合(Over fitting)的问题。


一般来说平均方差(Mean squared error)会用于判断回归(Regression)模型的好坏。
import matplotlib.pyplot as plt
k_range = range(1, 31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    loss = -cross_val_score(knn, X, y, cv=10, scoring='mean_squared_error')
    k_scores.append(loss.mean())

plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated MSE')
plt.show()

由图可以得知,平均方差越低越好,因此选择 13~18 左右的 K 值会最好。



四、由学习曲线(Learning Curve)来检视过拟合(Overfitting)的问题

from sklearn.learning_curve import learning_curve #学习曲线模块
from sklearn.datasets import load_digits #digits数据集
from sklearn.svm import SVC #Support Vector Classifier
import matplotlib.pyplot as plt #可视化模块
import numpy as np
加载digits数据集,其包含的是手写体的数字,从0到9。数据集总共有1797个样本,每个样本由64个特征组成, 分别为其手写体对应的8×8像素表示,每个特征取值0~16。
digits = load_digits()
X = digits.data
y = digits.target
观察样本由小到大的学习曲线变化, 采用K折交叉验证  cv=10 , 选择平均方差检视模型效能  scoring='mean_squared_error' , 样本由小到大分成5轮检视学习曲线 (10%, 25%, 50%, 75%, 100%) :
train_sizes, train_loss, test_loss = learning_curve(
    SVC(gamma=0.001), X, y, cv=10, scoring='mean_squared_error',
    train_sizes=[0.1, 0.25, 0.5, 0.75, 1])

#平均每一轮所得到的平均方差(共5轮,分别为样本10%、25%、50%、75%、100%)
train_loss_mean = -np.mean(train_loss, axis=1)
test_loss_mean = -np.mean(test_loss, axis=1)
可视化图形:
plt.plot(train_sizes, train_loss_mean, 'o-', color="r",
         label="Training")
plt.plot(train_sizes, test_loss_mean, 'o-', color="g",
        label="Cross-validation")

plt.xlabel("Training examples")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()




  • 13
    点赞
  • 129
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
Group-wise cross-validation is a type of cross-validation that is used when the data has a group structure. It is a more appropriate approach when the samples are collected from different subjects, experiments, or measurement devices. In group-wise cross-validation, the data is divided into groups, and the validation process is performed on each group separately. This ensures that the model is evaluated on data from different groups, which helps to assess its generalization performance in real-world scenarios. Here is an example of how group-wise cross-validation can be implemented using the K-fold cross-validation technique: ```python from sklearn.model_selection import GroupKFold from sklearn.linear_model import LogisticRegression # Assuming we have features X, labels y, and groups g X = ... y = ... groups = ... # Create a group-wise cross-validation iterator gkf = GroupKFold(n_splits=5) # Initialize a model model = LogisticRegression() # Perform group-wise cross-validation for train_index, test_index in gkf.split(X, y, groups): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Fit the model on the training data model.fit(X_train, y_train) # Evaluate the model on the test data score = model.score(X_test, y_test) # Print the evaluation score print("Validation score: ", score) ``` In this example, the data is divided into 5 groups using the GroupKFold function. The model is then trained and evaluated on each group separately. The evaluation score for each fold is printed to assess the model's performance.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值