绘制学习曲线
学习曲线:以训练数据集样本量(number of training samples)为横坐标,以模型在训练样本和交叉验证样本上的平均得分以及得分区间为纵坐标,绘制出的曲线就是学习曲线。
绘制学习曲线步骤:
生成在 附近波动的点作为训练样本:
- 将数据集分成训练数据集和交叉验证数据集;
- 取训练数据集的20%作为训练样本,训练出模型;
- 计算模型在训练数据集和交叉验证数据集上的得分;
- 以训练数据集的样本量为横坐标,以模型在训练数据集和交叉验证数据集上的得分为纵坐标,绘制曲线;
- 训练数据集增加10%, 跳到步骤3继续执行,直到训练数据集的大小为100%为止。y
import numpy as np
n_dots = 200
X = np.linspace(0, 1, n_dots)
y = np.sqrt(X) + 0.2*np.random.rand(n_dots) - 0.1
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
def polynomial_model(degree = 1):
polynomial_features = PolynomialFeatures(degree = degree, include_bias = False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
return pipeline
polynomial_model(degree)函数生成一个多项式模型,其中degree表示多项式的阶数。
% matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim = None, cv = None,
n_jobs = 1, train_sizes = np.linspace(0.1, 1.0, 5)):
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
## learning_curve()函数返回1、训练数据集的样本量序列,2、模型在各训练数据集上的得分,3、模型在各交叉验证数据集上的得分。
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs = n_jobs,
train_sizes = train_sizes)
train_scores_mean = np.mean(train_scores, axis = 1)
train_scores_std = np.std( train_scores, axis = 1)
test_scores_mean = np.mean(test_scores, axis = 1)
test_scores_std = np.std( test_scores, axis = 1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha = 0.1, color = "r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha = 0.1, color = "g")
plt.plot(train_sizes, train_scores_mean, 'o-', color = "r", label = "Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color = "g", label = "Cross-validation score")
plt.legend(loc = "best")
return plt
cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0) ##test_size越大,模型在交叉验
##证数据上的得分的方差越小
titles = ['Learning Curves (Under Fitting)','Learning Curves','Learning Curves Over Fitting']
degrees = [1, 3, 10]
plt.figure(figsize = (18, 4), dpi = 200)
for i in range(len(degrees)):
plt.subplot(1, 3, i + 1)
plot_learning_curve(polynomial_model(degrees[i]), titles[i], X, y, ylim = (0.75, 1.01), cv = cv)
plt.show()
运行结果:
上面三幅图中:
1. 左图为一阶多项式拟合效果。随着训练数据集的增多,模型在训练数据集和交叉验证数据集上的得分(对于回归问题为拟合优度)逐渐靠近。但两者得分整体比较低(高偏差),存在“欠拟合”问题。当发生高偏差时,增加训练数据量不会对算法准确性有较大的改善。
2. 右图为十阶多项式拟合效果。随着训练数据集的增多,模型在交叉验证数据集上的得分逐渐和在训练数据集上的得分靠近,但两者之间的间隙仍然比较大,且模型在交叉验证数据集上的得分的方差较大。存在“过拟合”问题。
3.中间为三阶多项式的拟合效果。模型最终在训练数据集和交叉验证数据集上的得分非常靠近。
过拟合问题的解决方法:
- 获取更多的训练数据:从上面右图来看,更多的训练数据有助于解决过拟合问题;
- 减少输入的特征数量:减少特征数量可以减少模型的复杂度。
欠拟合问题的解决方法:
- 增加有价值的特征;
- 增加多项式特征。