案例:红酒数据集分析

本文分析了红酒数据集,探讨了各特征如酸度、甜度与红酒质量的关系。通过可视化发现,酒精度、柠檬酸与红酒质量正相关,而挥发性酸、密度和pH则负相关。建立了甜度分类,并利用线性回归、随机森林和GBDT进行建模,随机森林表现最佳。进一步进行了参数调优,优化了模型预测效果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

数据来源:https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

红酒数据集一共有1599个样本,12个特征。其中11个为红酒的理化性质,quality列为红酒的品质(10分制)。

首先导入需要的库,加载数据集

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('D:\\Py_dataset\\winequality-red.csv',sep = ';')
df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.9983.510.569.45
17.80.880.002.60.09825.067.00.9973.200.689.85
27.80.760.042.30.09215.054.00.9973.260.659.85
311.20.280.561.90.07517.060.00.9983.160.589.86
47.40.700.001.90.07611.034.00.9983.510.569.45

12个字段,具体信息如下:

No属性数据类型字段描述
1fixed acidityNumeric非挥发性酸
2volatile acidityNumeric挥发性酸
3citric acidNumeric柠檬酸
4residual sugarNumeric残糖
5chloridesNumeric氯化物
6free sulfur dioxideNumeric游离二氧化硫
7total sulfur dioxideNumeric总二氧化硫
8densityNumeric密度
9pHNumeric酸碱度
10sulphatesNumeric硫酸盐
11alcoholNumeric酒精
12quality (score between 0 and 10)Numeric葡萄酒质量(1-10之间)

数据探索及可视化

df.shape #  (1599, 12)
df.info() # 没有缺失值
'''
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
'''
df.describe().T
countmeanstdmin25%50%75%max
fixed acidity1599.08.3201.7414.6007.1007.9009.20015.900
volatile acidity1599.00.5280.1790.1200.3900.5200.6401.580
citric acid1599.00.2710.1950.0000.0900.2600.4201.000
residual sugar1599.02.5391.4100.9001.9002.2002.60015.500
chlorides1599.00.0870.0470.0120.0700.0790.0900.611
free sulfur dioxide1599.015.87510.4601.0007.00014.00021.00072.000
total sulfur dioxide1599.046.46832.8956.00022.00038.00062.000289.000
density1599.00.9970.0020.9900.9960.9970.9981.004
pH1599.03.3110.1542.7403.2103.3103.4004.010
sulphates1599.00.6580.1700.3300.5500.6200.7302.000
alcohol1599.010.4231.0668.4009.50010.20011.10014.900
quality1599.05.6360.8083.0005.0006.0006.0008.000

各个变量分布的直方图:

# 设置调色板
color = sns.color_palette()
column= df.columns.tolist()
fig = plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    df[column[i]].hist(bins = 100,color = color[3])
    plt.xlabel(column[i],fontsize = 12)
    plt.ylabel('Frequency',fontsize = 12)
plt.tight_layout()

可以大致看出每个特征的分布情况,在
在这里插入图片描述

fig = plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(df[column[i]],orient = 'v',width = 0.5,color = color[4])
    plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()

在这里插入图片描述

酸性相关的特征分析

该数据集与酸度相关的特征有’fixed acidity’, ‘volatile acidity’, ‘citric acid’,‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’,‘PH’。其中前6中酸度特征都会对PH产生影响。PH在对数尺度,然后对6中酸度取对数做直方图。

acidityfeat = ['fixed acidity', 
				'volatile acidity', 
				'citric acid', 
				'chlorides', 
				'free sulfur dioxide', 
				'total sulfur dioxide',]

fig = plt.figure(figsize = (10,6))
for i in range(6):
    plt.subplot(2,3,i+1)
    v = np.log10(np.clip(df[acidityfeat[i]].values,a_min = 0.001,a_max = None))
    plt.hist(v,bins = 50,color = color[0])
    plt.xlabel('log('+ acidityfeat[i] +')',fontsize = 12)
    plt.ylabel('Frequency')    
plt.tight_layout()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HfA9QJjg-1597720797844)(output_12_1.png)]

plt.figure(figsize = (6,3))

bins = 10**(np.linspace(-2,2))
plt.hist(df['fixed acidity'],bins = bins, edgecolor = 'k',label = 'fixed acidity')
plt.hist(df['volatile acidity'],bins = bins, edgecolor = 'k',label = 'volatile acidity')
plt.hist(df['citric acid'],bins = bins, alpha = 0.8,edgecolor = 'k',label = 'citric acid')

plt.xscale('log')
plt.xlabel('Acid concentration(g/dm^3)')
plt.ylabel('Frequency')
plt.title('Historgram of Acid Concentration')
plt.legend()
plt.tight_layout()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KqPlbjCc-1597720797850)(output_13_1.png)]

df.describe().T
countmeanstdmin25%50%75%max
fixed acidity1599.08.3201.7414.6007.1007.9009.20015.900
volatile acidity1599.00.5280.1790.1200.3900.5200.6401.580
citric acid1599.00.2710.1950.0000.0900.2600.4201.000
residual sugar1599.02.5391.4100.9001.9002.2002.60015.500
chlorides1599.00.0870.0470.0120.0700.0790.0900.611
free sulfur dioxide1599.015.87510.4601.0007.00014.00021.00072.000
total sulfur dioxide1599.046.46832.8956.00022.00038.00062.000289.000
density1599.00.9970.0020.9900.9960.9970.9981.004
pH1599.03.3110.1542.7403.2103.3103.4004.010
sulphates1599.00.6580.1700.3300.5500.6200.7302.000
alcohol1599.010.4231.0668.4009.50010.20011.10014.900
quality1599.05.6360.8083.0005.0006.0006.0008.000

甜度(sweetness)

residual sugar主要与酒的甜度有关,干红(<= 4g/L),半干(4-12g/L),半甜(12-45g/L),甜(>= 45g/L),该数据集中没有甜葡萄酒。

df['sweetness'] = pd.cut(df['residual sugar'],bins = [0,4,12,45],labels = ['dry','semi-dry','semi-sweet'])
df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitysweetness
07.40.700.001.90.07611.034.00.9983.510.569.45dry
17.80.880.002.60.09825.067.00.9973.200.689.85dry
27.80.760.042.30.09215.054.00.9973.260.659.85dry
311.20.280.561.90.07517.060.00.9983.160.589.86dry
47.40.700.001.90.07611.034.00.9983.510.569.45dry
plt.figure(figsize = (6,4))
df['sweetness'].value_counts().plot(kind = 'bar',color = color[0])
plt.xticks(rotation = 0)
plt.xlabel('sweetness')
plt.ylabel('frequency')

plt.tight_layout()
print('Figure 5')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DSGRUfc1-1597720797857)(output_17_1.png)]

# 创建一个新特征total acid
df['total acid'] = df['fixed acidity'] + df['volatile acidity'] + df['citric acid']

columns = df.columns.tolist()
columns.remove('sweetness')
columns

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality',
 'total acid']
sns.set_style('ticks')
sns.set_context('notebook',font_scale = 1.1)

column = columns[0:11] + ['total acid']
plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(x = 'quality',y = column[i], data = df,color = color[1],width = 0.6)
    plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()

print('Figure 7:PhysicoChemico Propertise and Wine Quality by Boxplot')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pP9iHmpf-1597720797862)(output_22_1.png)]

从上图可以看出:

  • 红酒品质与柠檬酸,硫酸盐,酒精度成正相关
  • 红酒品质与易挥发性酸,密度,PH成负相关
  • 残留糖分,氯离子,二氧化硫对红酒品质没有什么影响
sns.set_style('dark')
plt.figure(figsize = (10,8))
mcorr = df[column].corr()
mask = np.zeros_like(mcorr,dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')

print('Figure 8:Pairwise colleration plot')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BFj0wI7v-1597720797865)(output_24_1.png)]

密度和酒精浓度

密度和酒精浓度是相关的,物理上,但两者并不是线性关系。另外密度还与酒精中的其中物质含量有关,但是相关性很小。

sns.set_style('ticks')
sns.set_context('notebook',font_scale = 1.4)

plt.figure(figsize = (6,4))
sns.regplot(x = 'density',y = 'alcohol',data = df,scatter_kws = {'s':10},color = color[1])
plt.xlabel('density',fontsize = 12)
plt.ylabel('alcohol',fontsize = 12)

plt.xlim(0.989,1.005)
plt.ylim(7,16)

print('Figure 9: Density vs Alcohol')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TTrIYYAK-1597720797867)(output_26_1.png)]

酸性物质含量和PH

因为PH和非挥发性酸之间存在着-0.68的相关性,因为非挥发性酸的总量特别高,所以total acid这个指标意义不大。

column
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'total acid']
acidity_raleted = ['fixed acidity','volatile acidity','total sulfur dioxide','chlorides','total acid']

plt.figure(figsize = (10,6))

for i in range(5):
    plt.subplot(2,3,i+1)
    sns.regplot(x = 'pH',y = acidity_raleted[i],data = df,scatter_kws = {'s':10},color = color[1])
    plt.xlabel('PH',fontsize = 12)
    plt.ylabel(acidity_raleted[i],fontsize = 12)
    
plt.tight_layout()
print('Figure 10:The correlation between different acid and PH')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BazIOjSM-1597720797872)(output_29_1.png)]

多变量分析

与红酒品质相关性最高的三个特征分别是酒精浓度,挥发性酸含量,柠檬酸。下面研究三个特征对红酒的品质有何影响。

plt.style.use('ggplot')

plt.figure(figsize = (6,4))
sns.lmplot(x = 'alcohol',y = 'volatile acidity',hue = 'quality',data = df,fit_reg = False,scatter_kws = {'s':10},size = 5)
print('Figure 11-1:scatter plot between alcohol and volatile acidity and quality')

在这里插入图片描述

sns.lmplot(x = 'alcohol', y = 'volatile acidity', col='quality', hue = 'quality', 
           data = df,fit_reg = False, size = 3,  aspect = 0.9, col_wrap=3,
           scatter_kws={'s':20})
print("Figure 11-2: Scatter Plots of Alcohol, Volatile Acid and Quality")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2LBulisZ-1597720797877)(output_32_1.png)]

PH和非挥发性酸,柠檬酸

PH和非挥发性酸,柠檬酸成负相关。

sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.4)

plt.figure(figsize=(6,5))
cm = plt.cm.get_cmap('RdBu')
sc = plt.scatter(df['fixed acidity'], df['citric acid'], c=df['pH'], vmin=2.6, vmax=4, s=15, cmap=cm)
bar = plt.colorbar(sc)
bar.set_label('pH', rotation = 0)
plt.xlabel('fixed acidity')
plt.ylabel('citric acid')
plt.xlim(4,18)
plt.ylim(0,1)
print('Figure 12: pH with Fixed Acidity and Citric Acid')

在这里插入图片描述

总结

对于红酒品质影响最重要的三个特征:酒精度、挥发性酸含量和柠檬酸。对于品质高于7的优质红酒和品质低于4的劣质红酒,直观上线性可分,对于品质为5和6的红酒很难进行线性区分。

数据建模

  • 线性回归
  • 集成算法
  • 提升算法
  • 模型评估
  • 确定模型参数
1.数据集切分

1.1 切分特征和标签

1.2 切分训练集个测试集

df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitysweetnesstotal acid
07.40.700.001.90.07611.034.00.9983.510.569.45dry8.10
17.80.880.002.60.09825.067.00.9973.200.689.85dry8.68
27.80.760.042.30.09215.054.00.9973.260.659.85dry8.60
311.20.280.561.90.07517.060.00.9983.160.589.86dry12.04
47.40.700.001.90.07611.034.00.9983.510.569.45dry8.10
# 数据预处理工作

# 检查数据的完整性
df.isnull().sum()
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
sweetness               0
total acid              0
dtype: int64
# 将object类型的数据转化为int类型
sweetness = pd.get_dummies(df['sweetness'])
df = pd.concat([df,sweetness],axis = 1)
df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitysweetnesstotal aciddrysemi-drysemi-sweet
07.40.700.001.90.07611.034.00.9983.510.569.45dry8.10100
17.80.880.002.60.09825.067.00.9973.200.689.85dry8.68100
27.80.760.042.30.09215.054.00.9973.260.659.85dry8.60100
311.20.280.561.90.07517.060.00.9983.160.589.86dry12.04100
47.40.700.001.90.07611.034.00.9983.510.569.45dry8.10100
df = df.drop('sweetness',axis = 1)
labels = df['quality']
features = df.drop('quality',axis = 1)

# 对原始数据集进行切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size = 0.3,random_state = 0)

print('训练特征的规模:',train_features.shape)
print('训练标签的规模:',train_labels.shape)
print('测试特征的规模:',test_features.shape)
print('测试标签的规模:',test_labels.shape)
训练特征的规模: (1119, 15)
训练标签的规模: (1119,)
测试特征的规模: (480, 15)
测试标签的规模: (480,)
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(train_features,train_labels)

prediction = LR.predict(test_features)
prediction[:5]

array([5.75571751, 4.82871294, 6.59036909, 5.36644662, 5.89993476])

#对模型进行评估
from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(test_labels,prediction))
print('线性回归模型的预测误差:',RMSE)

线性回归模型的预测误差: 0.6332278109768246

# 对训练特征和测试特征做标准化处理,观察结果

from sklearn.preprocessing import StandardScaler
train_features_std = StandardScaler().fit_transform(train_features)
test_features_std = StandardScaler().fit_transform(test_features)
LR = LinearRegression()
LR.fit(train_features_std,train_labels)
prediction = LR.predict(test_features_std)

#观察预测结果误差
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('线性回归模型预测误差:',RMSE)

线性回归模型预测误差: 0.6351421172394885

对比原始数据与做了标准化处理的数据,其结果相差不大,所以该数据集不需要做标准化处理。

集成算法:随机森林
from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor()
RF.fit(train_features,train_labels)
prediction = RF.predict(test_features)
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('随机森林模型的预测误差:',RMSE)

随机森林模型的预测误差: 0.6142407237123461

RF.get_params
<bound method BaseEstimator.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)>
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators':[100,200,300,400,500],
             'max_depth':[3,4,5,6],
             'min_samples_split':[2,3,4]}

RF = RandomForestRegressor()
grid = GridSearchCV(RF,param_grid = param_grid,scoring = 'neg_mean_squared_error',cv = 3,n_jobs = -1)
grid.fit(train_features,train_labels)
GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [3, 4, 5, 6], 'min_samples_split': [2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)
grid.best_params_

{'max_depth': 6, 'min_samples_split': 2, 'n_estimators': 300}
RF = RandomForestRegressor(n_estimators = 300,min_samples_split = 2,max_depth = 6)

RF.fit(train_features,train_labels)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=6,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
prediction = RF.predict(test_features)

RF_RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('随机森林模型的预测误差:',RF_RMSE)

随机森林模型的预测误差: 0.6153424077044428

集成算法:GBDT
from sklearn.ensemble import GradientBoostingRegressor

GBDT = GradientBoostingRegressor()
GBDT.fit(train_features,train_labels)
gbdt_prediction = GBDT.predict(test_features)
gbdt_RMSE = np.sqrt(mean_squared_error(gbdt_prediction,test_labels))

print('GBDT模型的预测误差:',gbdt_RMSE)

GBDT模型的预测误差: 0.6232190669430115

GBDT.get_params
<bound method BaseEstimator.get_params of GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)>
随机参数搜索模型 RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
GBDT = GradientBoostingRegressor()
#设置GBDT算法的部分参数
learning_rate = [0.01,0.1,1,10]
max_depth = [3,4,5,6]
min_samples_leaf = [1,2,4]
min_samples_split = [2,5,10]
n_estimators = [int(x) for x in range(100,600,100)]

random_params_group = {'learning_rate':learning_rate,
                      'max_depth':max_depth,
                      'min_samples_leaf':min_samples_leaf,
                      'min_samples_split':min_samples_split,
                      'n_estimators':n_estimators}

random_model = RandomizedSearchCV(GBDT,param_distributions = random_params_group,n_iter = 100,
                                 scoring = 'neg_mean_squared_error',verbose = 2,n_jobs = -1,cv = 3,random_state = 0)
评论 13
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值