统计学习：模型评估与模型选择---多项式拟合目标函数（python实现版）

最新推荐文章于 2023-09-20 18:46:17 发布

NewSuNess

最新推荐文章于 2023-09-20 18:46:17 发布

阅读量804

点赞数

分类专栏：统计学习方法文章标签： python 机器学习 sklearn

本文链接：https://blog.csdn.net/weixin_46131719/article/details/122009268

版权

统计学习方法专栏收录该内容

7 篇文章 2 订阅

订阅专栏

统计学习的目的是使学到的模型不仅对已知数据，而且对于未知数据都能很好地预测能力。不同的学习方法会给出不同的模型。当损失函数给定时，基于损失函数的模型训练误差（training error）和模型的测试误差（test error）就自然成为学习方法评估的标准。
如果一味的追求提高对训练数据的预测能力，所选模型的复杂度往往会比真模型更高。**这种现象被称为–过拟合。**通俗的的解释就是，模型在训练集上表现得非常好，但是在预测集上表现得很差。
本次代码使用sklearn包来完成模型拟合，使用绘图包matplotlib完成绘图制作。
具体的思路就是假设一个真模型的函数，然后在真模型函数上进行随机取样，但是这一取样条件是要服从正态分布。使用随机序列产生随机值进行取样，然后训练模型，使用多项式拟合。

import matplotlib.pyplot as plt
import numpy as np
# 十字交叉验证，模块导入
from sklearn.model_selection import cross_val_score
# 可以将许多算法模型串联起来，比如特征提取、归一化、分类等组织在一起形成典型的机器学习问题工作流
from sklearn import pipeline
# 数据预处理模块，用于多项式拟合
from sklearn.preprocessing import PolynomialFeatures
# 添加线性回归模型
from sklearn.linear_model import LinearRegression


def true_function(x):
    return 2*np.exp(-x)*np.sin(x)


def data_samples():
    np.random.seed(0)
    n_samples = 30
    # .rand 生成0-1之间的随机数字
    X = np.sort(np.random.rand(n_samples)*6)
    y = true_function(X)+np.random.rand(n_samples)*0.05
    return X, y


def train(X, y):
    degress = [1, 3, 5, 15]
    plt.figure(figsize=(20, 4))
    for i in range(4):
        plt.subplot(1, len(degress), i+1)
        poly = PolynomialFeatures(degress[i], include_bias=False)
        linear_regression = LinearRegression()  # 线性回归模型
        pipe = pipeline.Pipeline([("poly", poly), ("linear_regression", linear_regression)])
        pipe.fit(X[:, np.newaxis], y)
        # 交叉验证模型性能
        scores = cross_val_score(pipe, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10)
        # 绘制图像
        X_test = np.linspace(0, 6, 100)
        plt.plot(X_test, pipe.predict(X_test[:, np.newaxis]), label='model')
        plt.plot(X_test, true_function(X_test), label="True Function")
        plt.scatter(X, y, label='Samples')
        plt.legend(loc='best', fontsize=6)
        plt.xlabel("X", fontsize=6)
        plt.ylabel("y", fontsize=6)
        plt.grid()
        plt.title("Degress {}\nMSE = {:.2e}(+/-{:.2e})".format(degress[i], -scores.mean(), scores.std(), fontsize=2))



X, y = data_samples()
train(X,y)
plt.show()