第三章_机器学习理论基础
什么是过拟合,什么是欠拟合?怎么去诊断算法是否过拟合或欠拟合
过拟合是指模型能很好地拟合训练样本,但对新数据的预测准确性很差,欠拟合是指模不能很好地拟合训练样本,且对新数据的预测准确性也不好
分别用一阶多项式,三阶多项式,十阶多项式,来拟合这个数据集
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
n_dots = 20
x = np.linspace(0, 1, n_dots) # [0, 1] 之间创建 20 个点
y = np.sqrt(x) + 0.2*np.random.rand(n_dots) - 0.1;
def plot_polynomial_fit(x, y, order):
p = np.poly1d(np.polyfit(x, y, order))
# 画出拟合出来的多项式所表达的曲线以及原始的点
t = np.linspace(0, 1, 200)
plt.plot(x, y, 'ro', t, p(t), '-', t, np.sqrt(t), 'r--')
return p
plt.figure(figsize=(18, 4))
titles = ['Under Fitting', 'Fitting', 'Over Fitting']
models = [None, None, None]
for index, order in enumerate([1, 3, 10]):
plt.subplot(1, 3, index + 1)
models[index] = plot_polynomial_fit(x, y, order)
plt.title(titles[index], fontsize=20)
结果
模型的拟合成本是什么意思,它和模型的准确性有什么关系?
拟合成本:是衡量模型与训练样本符合程度的指标 Jtest():的值越小说明预测出来的值与实际值差异越小,对新数据的预测准确性就越好。
我们有那些指标来评价一个模型的好坏
准确性
为什么需要交叉验证数据集
我们用测试数据集来选择多项式的阶数d.相当于吧测试数据集提前让模型见过,这样选择出来的多项式式阶数d本身就是对训练集最友好的一个参数,这样模型的准确性测试就失去了意义。不需要横向去对比不同的模型,在工程上,大多数不是是选择模型,而是获取数据,分析数据,挖掘数据。
什么是学习区曲线,为什么要画学习曲线
当训练数据增加时,模型对训练数据集拟合的准确性,以及对交叉验证数据集预测的准确性的变化规律。可以直观地观察到模型的准确性与训练数据集大小的关系
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
n_dots = 200
X = np.linspace(0, 1, n_dots)
y = np.sqrt(X) + 0.2*np.random.rand(n_dots) - 0.1;
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
def polynomial_model(degree=1):
polynomial_features = PolynomialFeatures(degree=degree,
include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
return pipeline
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o--', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
# 为了让学习曲线更平滑,交叉验证数据集的得分计算 10 次,每次都重新选中 20% 的数据计算一遍
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
titles = ['Learning Curves (Under Fitting)',
'Learning Curves',
'Learning Curves (Over Fitting)']
degrees = [1, 3, 10]
plt.figure(figsize=(18, 4))
for i in range(len(degrees)):
plt.subplot(1, 3, i + 1)
plot_learning_curve(polynomial_model(degrees[i]), titles[i], X, y, ylim=(0.75, 1.01), cv=cv)
plt.show()
为什么要用查准率和召回率来评估模型的好坏?查准率和召回率适合那些问题领域
分类不均匀,某一类值过大或过小,只用准确率没有实际意义。准确率和召回率是广泛用于信息检索和统计学分类领域的两个度量值,用来评价结果的质量。其中精度是检索出相关文档数与检索出的文档总数的比率,衡量的是检索系统的查准率; 召回率是指检索出的相关文档数和文档库中所有的相关文档数的比率,衡量的是检索系统的查全率。