数据分析——“红酒口感”数据集上模型与数据均衡

最新推荐文章于 2021-06-17 21:57:53 发布

J_Xiong0117

最新推荐文章于 2021-06-17 21:57:53 发布

阅读量1.1k

点赞数

分类专栏： python 数据分析

本文链接：https://blog.csdn.net/u013010473/article/details/106454592

版权

python 同时被 2 个专栏收录

104 篇文章 4 订阅

订阅专栏

数据分析

17 篇文章 1 订阅

订阅专栏

文章目录

一.Python代码

#!/usr/bin/env python3
# encoding: utf-8
'''
@file: fwdStepwiseWine.py
@time: 2020/5/31 0031 11:53
@author: Jack
@contact: jack18588951684@163.com
'''

import urllib.request
import numpy as np
from sklearn import datasets, linear_model
from math import sqrt
import matplotlib.pyplot as plt


def xattrSelect(x, idxSet):
    """
    返回属性矩阵x的子集
    :param x:
    :param idxSet:
    :return:
    """
    xOut = []
    for row in x:
        xOut.append([row[i] for i in idxSet])
    return xOut


target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")
data = urllib.request.urlopen(target_url)
xList = []
labels = []
names = []
firstLine = True
for line in data:
    if firstLine:
        names = str(line, encoding='utf-8').strip().split(";")
        firstLine = False
    else:
        row = str(line, encoding='utf-8').strip().split(";")
        labels.append(float(row[-1]))
        row.pop()
        floatRow = [float(num) for num in row]
        xList.append(floatRow)

indices = range(len(xList))
xListTest = [xList[i] for i in indices if i % 3 == 0]
xListTrain = [xList[i] for i in indices if i % 3 != 0]
labelsTest = [labels[i] for i in indices if i % 3 == 0]
labelsTrain = [labels[i] for i in indices if i % 3 != 0]

## 逐个构建属性列表-从空开始
attributeList = []
index = range(len(xList[1]))
indexSet = set(index)
indexSeq = []
oosError = []

for i in index:
    attSet = set(attributeList)
    attTrySet = indexSet - attSet
    # form into list
    attTry = [ii for ii in attTrySet]
    errorList = []
    attTemp = []
    # try each attribute not in set to see which one gives least oos error
    for iTry in attTry:
        attTemp = [] + attributeList
        attTemp.append(iTry)
        # use attTemp to form training and testing sub matrices as list of lists
        xTrainTemp = xattrSelect(xListTrain, attTemp)
        xTestTemp = xattrSelect(xListTest, attTemp)
        # form into numpy arrays
        xTrain = np.array(xTrainTemp)
        yTrain = np.array(labelsTrain)
        xTest = np.array(xTestTemp)
        yTest = np.array(labelsTest)
        # use sci-kit learn linear regression
        wineQModel = linear_model.LinearRegression()
        wineQModel.fit(xTrain, yTrain)
        # use trained model to generate prediction and calculate rmsError
        rmsError = np.linalg.norm((yTest - wineQModel.predict(xTest)),
                                  2) / sqrt(len(yTest))
        errorList.append(rmsError)
        attTemp = []

    iBest = np.argmin(errorList)
    attributeList.append(attTry[iBest])
    oosError.append(errorList[iBest])

print("Out of sample error versus attribute set size")
print(oosError)
print("\n" + "Best attribute indices")
print(attributeList)
namesList = [names[i] for i in attributeList]
print("\n" + "Best attribute names")
print(namesList)
# Plot error versus number of attributes
x = range(len(oosError))
plt.plot(x, oosError, 'k')
plt.xlabel('Number of Attributes')
plt.ylabel('Error (RMS)')
plt.show()
# Plot histogram of out of sample errors for best number of attributes
# Identify index corresponding to min value,
# retrain with the corresponding attributes
# Use resulting model to predict against out of sample data.
# Plot errors (aka residuals)
indexBest = oosError.index(min(oosError))
attributesBest = attributeList[1:(indexBest + 1)]
# Define column-wise subsets of xListTrain and xListTest
# and convert to numpy
xTrainTemp = xattrSelect(xListTrain, attributesBest)
xTestTemp = xattrSelect(xListTest, attributesBest)
xTrain = np.array(xTrainTemp);
xTest = np.array(xTestTemp)
# train and plot error histogram
wineQModel = linear_model.LinearRegression()
wineQModel.fit(xTrain, yTrain)
errorVector = yTest - wineQModel.predict(xTest)
plt.hist(errorVector)
plt.xlabel("Bin Boundaries")
plt.ylabel("Counts")
plt.show()

# scatter plot of actual versus predicted
plt.scatter(wineQModel.predict(xTest), yTest, s=100, alpha=0.10)
plt.xlabel('Predicted Taste Score')
plt.ylabel('Actual Taste Score')
plt.show()

Out of sample error versus attribute set size
[0.7234259255116278, 0.6860993152837196, 0.6734365033420278, 0.6677033213897796, 0.6622558568522274, 0.6590004754154626, 0.6572717206143075, 0.6570905806207697, 0.6569993096446136, 0.6575818940043473, 0.6573909869011338]

Best attribute indices
[10, 1, 9, 4, 6, 8, 5, 3, 2, 7, 0]

Best attribute names
['"alcohol"', '"volatile acidity"', '"sulphates"', '"chlorides"', '"total sulfur dioxide"', '"pH"', '"free sulfur dioxide"', '"residual sugar"', '"citric acid"', '"density"', '"fixed acidity"']

在这里插入图片描述

二.模型与数据的均衡

过拟合是指训练数据和测试数据上的错误存在显著差异。对于真实问题，出现过拟合并不是一个好的结果，过拟合的根源在于 X（特征矩阵）中有太多的列（属性/特征）。解决方案可能是去掉 X 中的一些列。然而去掉一些列又转化为去掉多少列以及哪几列应该去掉的问题。这种蛮力的方法也被称作最佳子集选择。
1.最佳子集选择
最佳子集选择的基本想法是在列的个数上增加一个约束（假设为 nCol），然后从X的所有列中抽取特定个数的列构成数据集，遍历所有列的组合（列数为nCol），找到在测试集上取得最佳效果的nCol值；增加nCol值，重复上述过程。以上过程产生最佳的一列子集、两列子集一直到所有列子集（对应矩阵X）。对于每个子集同样有一个性能与之对应。在部署时直接选择错误率最低的版本来决定是使用一列子集版本、两列子集版本，还是其他版本。最佳子集选择存在的一个问题是该算法需要大量计算，即使属性不多的情况下（属性数对应X的列数），计算量也非常巨大。例如，10 个属性对应于 2^10=1024 个子集。
2.前向逐步回归
前向逐步回归的想法是从1列子集开始，找到效果最佳的那一列属性，接着寻找与其组合与效果最佳的第2列属性，而不是评估所有的2列子集。前向逐步回归过程和最佳子集选择过程基本类似。这种方法产生了参数化的模型族（所有线性回归以列数作为参数）。这些模型在复杂度上存在差异，最后的模型通过在预留样本上计算错误进行选择。上述代码即为在红酒数据集上实现的前向逐步回归的Python代码。代码中首先包含一个函数用于从X矩阵中抽取选择的列（对应于Python 的列表list，该列表的每个元素也是一个列表）。然后该函数将X矩阵与标签向量划分为训练集和测试集。之后，代码完成前面描述的算法。算法的遍历从属性的一个子集开始。第一遍时，该子集为空。对于后续的遍历，该子集包含上一次遍历选择的属性。每一次遍历都会选择一个新的属性添加到属性子集中。待添加的属性是通过对每一个非包含的属性进行测试：选择添加属性以后性能提高最多的属性。每一个属性被加入属性子集以后，使用普通的最小二乘法来拟合模型。对每一个测试属性，在预留样本上评估性能。产生最佳根损失（RSS）的属性被加入属性集，关联的 RSS 错误也会进行计算。

三.评估结果分析

上述图1为RMSE与用于回归的属性个数之间的函数关系。在9个属性全部包含进来以前，错误一直在降低，然后增加。对应第一个输出列表（python 的 list 对象）展示了 RSS 错误，错误一直降低，直到将第10个元素加入列表，然后错误变高。关联的列索引以及关联属性的名称（列名）分别由第二和第三个输出列表给出。
其他几个图对于理解一个学习好的算法性能非常有帮助，这些图指出了性能提升的途径。图3为测试集上每个点的实际标签值与预测标签值的散点图。在理想情况下，图3中的所有点会分布在45度线上，这条线上的真正标签与预测标签是相等的。因为真正得分是整数，所以散点图分布在水平方向上。如果真正标签分布在少量的数值上，将每个数据点绘制成半透明状会很有用，一个区域的颜色深度就能反映点的堆积程度。对得分在5和6上的实际酒品的预测结果非常好。对更极端的值，系统预测效果不好。一般来讲，机器学习算法对边缘数据的预测效果并不好。图2为前向逐步预测算法对酒品预测的错误直方图。有时错误直方图会有2个甚至多个离散的波峰，比如在最右边或者最左边有一个小的波峰。在这种情况下，可以继续寻找错误中不同波峰的解释，添加能够辨识归类的新属性来降低预测错误。

对于上面的输出结果要注意以下几点：
1）整个过程训练了一组模型，这一系列模型进行了参数化（本例中通过线性模型中的属性个数进行区分）；
2）最终选择的模型在样本外（测试集）的错误最小；
3）这里的属性个数被称作复杂度参数，复杂度更高的模型会有更多自由参数，相对于低复杂度的模型更容易对数据产生过拟合；
4）另外注意到第二和第三个输出列表中，属性已经根据其对预测的重要性进行了排序。在包含列编号的List以及属性名的List中，第一个元素是第一个选择的属性，第二个元素是第二个选择的属性，以此类推。用到的属性按顺序排列。这是机器学习任务中一个很重要并且必需的特征。早期机器学习任务大部分都包括寻找（或者构建）用于构建预测的最佳属性集。而能够对属性进行排序的算法对于上述任务非常有帮助；
5）最后一点是挑选模型。模型越复杂，泛化能力越差。在同等情况下，倾向于选择不太复杂的模型。最佳经验是如果属性添加后带来的性能提升只达到小数点后第4位，那么保守起见，可以将这样的属性移除掉。