案例：红酒数据集分析

卖山楂啦prss

已于 2022-10-21 18:13:58 修改

阅读量3.1w

点赞数 67

分类专栏：数据挖掘案例文章标签： python 机器学习数据分析

于 2020-10-06 15:45:21 首次发布

本文链接：https://blog.csdn.net/qq_42374697/article/details/108073110

版权

数据挖掘案例专栏收录该内容

6 篇文章

订阅专栏

本文分析了红酒数据集，探讨了各特征如酸度、甜度与红酒质量的关系。通过可视化发现，酒精度、柠檬酸与红酒质量正相关，而挥发性酸、密度和pH则负相关。建立了甜度分类，并利用线性回归、随机森林和GBDT进行建模，随机森林表现最佳。进一步进行了参数调优，优化了模型预测效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据来源：https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

红酒数据集一共有1599个样本，12个特征。其中11个为红酒的理化性质，quality列为红酒的品质（10分制）。

首先导入需要的库，加载数据集

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('D:\\Py_dataset\\winequality-red.csv',sep = ';')
df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.997	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5

12个字段，具体信息如下：

No	属性	数据类型	字段描述
1	fixed acidity	Numeric	非挥发性酸
2	volatile acidity	Numeric	挥发性酸
3	citric acid	Numeric	柠檬酸
4	residual sugar	Numeric	残糖
5	chlorides	Numeric	氯化物
6	free sulfur dioxide	Numeric	游离二氧化硫
7	total sulfur dioxide	Numeric	总二氧化硫
8	density	Numeric	密度
9	pH	Numeric	酸碱度
10	sulphates	Numeric	硫酸盐
11	alcohol	Numeric	酒精
12	quality (score between 0 and 10)	Numeric	葡萄酒质量（1-10之间）

数据探索及可视化

df.shape #  (1599, 12)
df.info() # 没有缺失值
'''
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
'''
df.describe().T

	count	mean	std	min	25%	50%	75%	max
fixed acidity	1599.0	8.320	1.741	4.600	7.100	7.900	9.200	15.900
volatile acidity	1599.0	0.528	0.179	0.120	0.390	0.520	0.640	1.580
citric acid	1599.0	0.271	0.195	0.000	0.090	0.260	0.420	1.000
residual sugar	1599.0	2.539	1.410	0.900	1.900	2.200	2.600	15.500
chlorides	1599.0	0.087	0.047	0.012	0.070	0.079	0.090	0.611
free sulfur dioxide	1599.0	15.875	10.460	1.000	7.000	14.000	21.000	72.000
total sulfur dioxide	1599.0	46.468	32.895	6.000	22.000	38.000	62.000	289.000
density	1599.0	0.997	0.002	0.990	0.996	0.997	0.998	1.004
pH	1599.0	3.311	0.154	2.740	3.210	3.310	3.400	4.010
sulphates	1599.0	0.658	0.170	0.330	0.550	0.620	0.730	2.000
alcohol	1599.0	10.423	1.066	8.400	9.500	10.200	11.100	14.900
quality	1599.0	5.636	0.808	3.000	5.000	6.000	6.000	8.000

各个变量分布的直方图：

# 设置调色板
color = sns.color_palette()
column= df.columns.tolist()
fig = plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    df[column[i]].hist(bins = 100,color = color[3])
    plt.xlabel(column[i],fontsize = 12)
    plt.ylabel('Frequency',fontsize = 12)
plt.tight_layout()

可以大致看出每个特征的分布情况，在
在这里插入图片描述

fig = plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(df[column[i]],orient = 'v',width = 0.5,color = color[4])
    plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()

在这里插入图片描述

酸性相关的特征分析

该数据集与酸度相关的特征有’fixed acidity’, ‘volatile acidity’, ‘citric acid’,‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’,‘PH’。其中前6中酸度特征都会对PH产生影响。PH在对数尺度，然后对6中酸度取对数做直方图。

acidityfeat = ['fixed acidity', 
				'volatile acidity', 
				'citric acid', 
				'chlorides', 
				'free sulfur dioxide', 
				'total sulfur dioxide',]

fig = plt.figure(figsize = (10,6))
for i in range(6):
    plt.subplot(2,3,i+1)
    v = np.log10(np.clip(df[acidityfeat[i]].values,a_min = 0.001,a_max = None))
    plt.hist(v,bins = 50,color = color[0])
    plt.xlabel('log('+ acidityfeat[i] +')',fontsize = 12)
    plt.ylabel('Frequency')    
plt.tight_layout()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HfA9QJjg-1597720797844)(output_12_1.png)]

plt.figure(figsize = (6,3))

bins = 10**(np.linspace(-2,2))
plt.hist(df['fixed acidity'],bins = bins, edgecolor = 'k',label = 'fixed acidity')
plt.hist(df['volatile acidity'],bins = bins, edgecolor = 'k',label = 'volatile acidity')
plt.hist(df['citric acid'],bins = bins, alpha = 0.8,edgecolor = 'k',label = 'citric acid')

plt.xscale('log')
plt.xlabel('Acid concentration(g/dm^3)')
plt.ylabel('Frequency')
plt.title('Historgram of Acid Concentration')
plt.legend()
plt.tight_layout()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KqPlbjCc-1597720797850)(output_13_1.png)]

df.describe().T

	count	mean	std	min	25%	50%	75%	max
fixed acidity	1599.0	8.320	1.741	4.600	7.100	7.900	9.200	15.900
volatile acidity	1599.0	0.528	0.179	0.120	0.390	0.520	0.640	1.580
citric acid	1599.0	0.271	0.195	0.000	0.090	0.260	0.420	1.000
residual sugar	1599.0	2.539	1.410	0.900	1.900	2.200	2.600	15.500
chlorides	1599.0	0.087	0.047	0.012	0.070	0.079	0.090	0.611
free sulfur dioxide	1599.0	15.875	10.460	1.000	7.000	14.000	21.000	72.000
total sulfur dioxide	1599.0	46.468	32.895	6.000	22.000	38.000	62.000	289.000
density	1599.0	0.997	0.002	0.990	0.996	0.997	0.998	1.004
pH	1599.0	3.311	0.154	2.740	3.210	3.310	3.400	4.010
sulphates	1599.0	0.658	0.170	0.330	0.550	0.620	0.730	2.000
alcohol	1599.0	10.423	1.066	8.400	9.500	10.200	11.100	14.900
quality	1599.0	5.636	0.808	3.000	5.000	6.000	6.000	8.000

甜度（sweetness）

residual sugar主要与酒的甜度有关，干红（<= 4g/L），半干（4-12g/L），半甜（12-45g/L），甜（>= 45g/L），该数据集中没有甜葡萄酒。

df['sweetness'] = pd.cut(df['residual sugar'],bins = [0,4,12,45],labels = ['dry','semi-dry','semi-sweet'])
df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	sweetness
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5	dry
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.997	3.20	0.68	9.8	5	dry
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	5	dry
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	6	dry
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5	dry

plt.figure(figsize = (6,4))
df['sweetness'].value_counts().plot(kind = 'bar',color = color[0])
plt.xticks(rotation = 0)
plt.xlabel('sweetness')
plt.ylabel('frequency')

plt.tight_layout()
print('Figure 5')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DSGRUfc1-1597720797857)(output_17_1.png)]

# 创建一个新特征total acid
df['total acid'] = df['fixed acidity'] + df['volatile acidity'] + df['citric acid']

columns = df.columns.tolist()
columns.remove('sweetness')
columns

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality',
 'total acid']

sns.set_style('ticks')
sns.set_context('notebook',font_scale = 1.1)

column = columns[0:11] + ['total acid']
plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(x = 'quality',y = column[i], data = df,color = color[1],width = 0.6)
    plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()

print('Figure 7:PhysicoChemico Propertise and Wine Quality by Boxplot')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pP9iHmpf-1597720797862)(output_22_1.png)]

从上图可以看出：

红酒品质与柠檬酸，硫酸盐，酒精度成正相关
红酒品质与易挥发性酸，密度，PH成负相关
残留糖分，氯离子，二氧化硫对红酒品质没有什么影响

sns.set_style('dark')
plt.figure(figsize = (10,8))
mcorr = df[column].corr()
mask = np.zeros_like(mcorr,dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')

print('Figure 8:Pairwise colleration plot')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BFj0wI7v-1597720797865)(output_24_1.png)]

密度和酒精浓度

密度和酒精浓度是相关的，物理上，但两者并不是线性关系。另外密度还与酒精中的其中物质含量有关，但是相关性很小。

sns.set_style('ticks')
sns.set_context('notebook',font_scale = 1.4)

plt.figure(figsize = (6,4))
sns.regplot(x = 'density',y = 'alcohol',data = df,scatter_kws = {'s':10},color = color[1])
plt.xlabel('density',fontsize = 12)
plt.ylabel('alcohol',fontsize = 12)

plt.xlim(0.989,1.005)
plt.ylim(7,16)

print('Figure 9: Density vs Alcohol')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TTrIYYAK-1597720797867)(output_26_1.png)]

酸性物质含量和PH

因为PH和非挥发性酸之间存在着-0.68的相关性，因为非挥发性酸的总量特别高，所以total acid这个指标意义不大。

column

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'total acid']

acidity_raleted = ['fixed acidity','volatile acidity','total sulfur dioxide','chlorides','total acid']

plt.figure(figsize = (10,6))

for i in range(5):
    plt.subplot(2,3,i+1)
    sns.regplot(x = 'pH',y = acidity_raleted[i],data = df,scatter_kws = {'s':10},color = color[1])
    plt.xlabel('PH',fontsize = 12)
    plt.ylabel(acidity_raleted[i],fontsize = 12)
    
plt.tight_layout()
print('Figure 10:The correlation between different acid and PH')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BazIOjSM-1597720797872)(output_29_1.png)]

多变量分析

与红酒品质相关性最高的三个特征分别是酒精浓度，挥发性酸含量，柠檬酸。下面研究三个特征对红酒的品质有何影响。

plt.style.use('ggplot')

plt.figure(figsize = (6,4))
sns.lmplot(x = 'alcohol',y = 'volatile acidity',hue = 'quality',data = df,fit_reg = False,scatter_kws = {'s':10},size = 5)
print('Figure 11-1:scatter plot between alcohol and volatile acidity and quality')

在这里插入图片描述

sns.lmplot(x = 'alcohol', y = 'volatile acidity', col='quality', hue = 'quality', 
           data = df,fit_reg = False, size = 3,  aspect = 0.9, col_wrap=3,
           scatter_kws={'s':20})
print("Figure 11-2: Scatter Plots of Alcohol, Volatile Acid and Quality")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2LBulisZ-1597720797877)(output_32_1.png)]

PH和非挥发性酸，柠檬酸

PH和非挥发性酸，柠檬酸成负相关。

sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.4)

plt.figure(figsize=(6,5))
cm = plt.cm.get_cmap('RdBu')
sc = plt.scatter(df['fixed acidity'], df['citric acid'], c=df['pH'], vmin=2.6, vmax=4, s=15, cmap=cm)
bar = plt.colorbar(sc)
bar.set_label('pH', rotation = 0)
plt.xlabel('fixed acidity')
plt.ylabel('citric acid')
plt.xlim(4,18)
plt.ylim(0,1)
print('Figure 12: pH with Fixed Acidity and Citric Acid')

在这里插入图片描述

总结

对于红酒品质影响最重要的三个特征：酒精度、挥发性酸含量和柠檬酸。对于品质高于7的优质红酒和品质低于4的劣质红酒，直观上线性可分，对于品质为5和6的红酒很难进行线性区分。

数据建模

线性回归
集成算法
提升算法
模型评估
确定模型参数

1.数据集切分

1.1 切分特征和标签

1.2 切分训练集个测试集

df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	sweetness	total acid
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5	dry	8.10
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.997	3.20	0.68	9.8	5	dry	8.68
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	5	dry	8.60
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	6	dry	12.04
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5	dry	8.10

# 数据预处理工作

# 检查数据的完整性
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
sweetness               0
total acid              0
dtype: int64

# 将object类型的数据转化为int类型
sweetness = pd.get_dummies(df['sweetness'])
df = pd.concat([df,sweetness],axis = 1)
df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	sweetness	total acid	dry
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5	dry	8.10	1
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.997	3.20	0.68	9.8	5	dry	8.68	1
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	5	dry	8.60	1
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	6	dry	12.04	1
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5	dry	8.10	1

df = df.drop('sweetness',axis = 1)
labels = df['quality']
features = df.drop('quality',axis = 1)

# 对原始数据集进行切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size = 0.3,random_state = 0)

print('训练特征的规模:',train_features.shape)
print('训练标签的规模:',train_labels.shape)
print('测试特征的规模:',test_features.shape)
print('测试标签的规模:',test_labels.shape)

训练特征的规模: (1119, 15)
训练标签的规模: (1119,)
测试特征的规模: (480, 15)
测试标签的规模: (480,)

from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(train_features,train_labels)

prediction = LR.predict(test_features)
prediction[:5]

array([5.75571751, 4.82871294, 6.59036909, 5.36644662, 5.89993476])

#对模型进行评估
from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(test_labels,prediction))
print('线性回归模型的预测误差:',RMSE)

线性回归模型的预测误差: 0.6332278109768246

# 对训练特征和测试特征做标准化处理，观察结果

from sklearn.preprocessing import StandardScaler
train_features_std = StandardScaler().fit_transform(train_features)
test_features_std = StandardScaler().fit_transform(test_features)
LR = LinearRegression()
LR.fit(train_features_std,train_labels)
prediction = LR.predict(test_features_std)

#观察预测结果误差
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('线性回归模型预测误差:',RMSE)

线性回归模型预测误差: 0.6351421172394885

对比原始数据与做了标准化处理的数据，其结果相差不大，所以该数据集不需要做标准化处理。

集成算法：随机森林

from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor()
RF.fit(train_features,train_labels)
prediction = RF.predict(test_features)
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('随机森林模型的预测误差:',RMSE)

随机森林模型的预测误差: 0.6142407237123461

RF.get_params

<bound method BaseEstimator.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)>

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators':[100,200,300,400,500],
             'max_depth':[3,4,5,6],
             'min_samples_split':[2,3,4]}

RF = RandomForestRegressor()
grid = GridSearchCV(RF,param_grid = param_grid,scoring = 'neg_mean_squared_error',cv = 3,n_jobs = -1)
grid.fit(train_features,train_labels)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [3, 4, 5, 6], 'min_samples_split': [2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

grid.best_params_

{'max_depth': 6, 'min_samples_split': 2, 'n_estimators': 300}

RF = RandomForestRegressor(n_estimators = 300,min_samples_split = 2,max_depth = 6)

RF.fit(train_features,train_labels)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=6,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

prediction = RF.predict(test_features)

RF_RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('随机森林模型的预测误差:',RF_RMSE)

随机森林模型的预测误差: 0.6153424077044428

集成算法：GBDT

from sklearn.ensemble import GradientBoostingRegressor

GBDT = GradientBoostingRegressor()
GBDT.fit(train_features,train_labels)
gbdt_prediction = GBDT.predict(test_features)
gbdt_RMSE = np.sqrt(mean_squared_error(gbdt_prediction,test_labels))

print('GBDT模型的预测误差:',gbdt_RMSE)

GBDT模型的预测误差: 0.6232190669430115

GBDT.get_params

<bound method BaseEstimator.get_params of GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)>

随机参数搜索模型 RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
GBDT = GradientBoostingRegressor()
#设置GBDT算法的部分参数
learning_rate = [0.01,0.1,1,10]
max_depth = [3,4,5,6]
min_samples_leaf = [1,2,4]
min_samples_split = [2,5,10]
n_estimators = [int(x) for x in range(100,600,100)]

random_params_group = {'learning_rate':learning_rate,
                      'max_depth':max_depth,
                      'min_samples_leaf':min_samples_leaf,
                      'min_samples_split':min_samples_split,
                      'n_estimators':n_estimators}

random_model = RandomizedSearchCV(GBDT,param_distributions = random_params_group,n_iter = 100,
                                 scoring = 'neg_mean_squared_error',verbose = 2,n_jobs = -1,cv = 3,random_state = 0)