白葡萄酒质量预测项目

注意:本文引用自专业人工智能社区Venus AI

更多AI知识请参考原站 ([www.aideeplearning.cn])

项目背景

白葡萄酒是一种受欢迎的饮品,其质量受到广泛关注。了解和预测白葡萄酒的质量可以帮助酒厂和消费者做出更明智的选择。质量评估是一项复杂的任务,因为它受到多个化学特性的影响。因此,我们决定建立一个机器学习模型,以高精度预测白葡萄酒的质量。

项目目标

本项目的主要目标是构建一个能够高精度预测白葡萄酒质量的模型。通过使用白葡萄酒的化学特性数据,我们的模型将能够快速、准确地预测白葡萄酒的质量等级。这有助于酒厂改进酿造过程,确保生产高质量的葡萄酒。

项目应用

这个项目的应用非常广泛,一些潜在的应用包括:

  1. 酒厂:酒厂可以使用这个模型来评估他们生产的葡萄酒的质量,并根据预测结果进行改进。
  2. 消费者:消费者可以使用这个模型来选择适合他们口味的葡萄酒。
  3. 餐馆和酒吧:餐馆和酒吧可以使用这个模型来选择他们的酒单中的葡萄酒。
  4. 葡萄酒评审人:葡萄酒评审人可以使用这个模型来辅助他们的评价。

数据集描述

我们将使用的数据集包含了白葡萄酒的多个化学特性,以及每瓶葡萄酒的质量评级。以下是数据集中的主要特征:

  • fixed acidity:固定酸度
  • volatile acidity:挥发性酸度
  • citric acid:柠檬酸含量
  • residual sugar:残留糖分
  • chlorides:氯化物含量
  • free sulfur dioxide:游离二氧化硫含量
  • total sulfur dioxide:总二氧化硫含量
  • density:密度
  • pH:pH值
  • sulphates:硫酸盐含量
  • alcohol:酒精含量
  • quality:质量评级

模型选择与依赖库

为了预测白葡萄酒的质量,我们使用了以下机器学习模型进行训练和比较:

  • K-最近邻算法(K-Nearest Neighbors)
  • 决策树分类器(Decision Tree Classifier)
  • 朴素贝叶斯(Naive Bayes)
  • 随机森林分类器(Random Forest Classifier)
  • 梯度提升分类器(Gradient Boosting Classifier)

依赖库

  • numpy
  • pandas
  • matplotlib
  • seaborn
  • imblearn

评估指标

我们使用了以下评估指标来评估模型的性能:

  • 混淆矩阵(confusion_matrix)
  • 准确度(accuracy_score)
  • 分类报告(classification_report)

代码实现

项目名称:白葡萄酒品质预测

问题陈述:

我们得到一个由白葡萄酒质量和决定葡萄酒质量的成分组成的数据集。 我们需要建立一个能够以最高精度预测葡萄酒质量的模型。

导入库并加载数据集

#导入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
#加载数据集
white_wine_df = pd.read_csv("winequality-white.csv",sep=";")
white_wine_df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.00.270.3620.70.04545.0170.01.00103.000.458.86
16.30.300.341.60.04914.0132.00.99403.300.499.56
28.10.280.406.90.05030.097.00.99513.260.4410.16
37.20.230.328.50.05847.0186.00.99563.190.409.96
47.20.230.328.50.05847.0186.00.99563.190.409.96
white_wine_df.shape
(4898, 12)

white_wine_df.describe()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
count4898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.000000
mean6.8547880.2782410.3341926.3914150.04577235.308085138.3606570.9940273.1882670.48984710.5142675.877909
std0.8438680.1007950.1210205.0720580.02184817.00713742.4980650.0029910.1510010.1141261.2306210.885639
min3.8000000.0800000.0000000.6000000.0090002.0000009.0000000.9871102.7200000.2200008.0000003.000000
25%6.3000000.2100000.2700001.7000000.03600023.000000108.0000000.9917233.0900000.4100009.5000005.000000
50%6.8000000.2600000.3200005.2000000.04300034.000000134.0000000.9937403.1800000.47000010.4000006.000000
75%7.3000000.3200000.3900009.9000000.05000046.000000167.0000000.9961003.2800000.55000011.4000006.000000
max14.2000001.1000001.66000065.8000000.346000289.000000440.0000001.0389803.8200001.08000014.2000009.000000
# 总结数据集
white_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
#检查是否有空值
white_wine_df.isna().sum()
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

不存在空值

探索性数据分析 (EDA)

白葡萄酒品质等级分布

white_wine_df['quality'].value_counts()
6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64
  • 我们拥有的大多数白葡萄酒数据都具有 5 到 7 之间的中等质量。

  • 只有少数葡萄酒的品质非常高或非常低。

绘制独立特征的分布


import math

# Assuming the last column is not needed for plotting
num_features = len(white_wine_df.columns) - 1

# Calculate the number of rows and columns for the subplots
num_rows = math.ceil(num_features / 3)
fig, axes = plt.subplots(num_rows, 3, figsize=(10, 8))

i = 0
for row in range(num_rows):
    for col in range(3):
        if i < num_features:
            sns.histplot(white_wine_df[white_wine_df.columns[i]], ax=axes[row][col], color='teal', kde=True)
            i += 1

# Adjust layout
fig.tight_layout()

几乎所有分布图都遵循近似正态分布。

将葡萄酒的品质标记为低、中、高,并观察每种成分的浓度如何影响白葡萄酒的品质。

white_wine_df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.00.270.3620.70.04545.0170.01.00103.000.458.86
16.30.300.341.60.04914.0132.00.99403.300.499.56
28.10.280.406.90.05030.097.00.99513.260.4410.16
37.20.230.328.50.05847.0186.00.99563.190.409.96
47.20.230.328.50.05847.0186.00.99563.190.409.96
#标签编码
white_wine_df_new = white_wine_df.replace({'quality':{3:'low',4:'low',5:'medium',6:'medium',7:'medium',8:'high',9:'high'}})
white_wine_df_new.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.00.270.3620.70.04545.0170.01.00103.000.458.8medium
16.30.300.341.60.04914.0132.00.99403.300.499.5medium
28.10.280.406.90.05030.097.00.99513.260.4410.1medium
37.20.230.328.50.05847.0186.00.99563.190.409.9medium
47.20.230.328.50.05847.0186.00.99563.190.409.9medium

引入箱线图来根据质量观察葡萄酒成分的浓度。

# Get the columns to plot
columns_to_plot = white_wine_df_new.describe().columns

# Calculate the number of rows and columns for the subplot grid
n_cols = 3  # You can change this to fit your screen better
n_rows = (len(columns_to_plot) + n_cols - 1) // n_cols  # Calculate rows needed

# Create a figure and a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))

# Flatten the axes array for easy iteration
axes_flat = axes.flatten()

# Iterate over the columns and create a boxplot in each subplot
for i, col in enumerate(columns_to_plot):
    sns.boxplot(x='quality', y=col, data=white_wine_df_new, ax=axes_flat[i])

# Hide any unused subplots
for i in range(len(columns_to_plot), len(axes_flat)):
    axes_flat[i].set_visible(False)

# Adjust layout for better display
fig.tight_layout()

# Display the plot
plt.show()

观察结果:

  • 品质高的白葡萄酒,其酒精含量和pH值较高,而氯化物含量较低。
  • 固定酸度和挥发酸度高的白葡萄酒质量较低。

相关性

## 变量之间的相关性
plt.figure(figsize=(15,8))
correlation = white_wine_df.corr()
sns.heatmap((correlation), annot=True, cmap='coolwarm')
<Axes: >

观察:

  • 观察到两个强相关数据
  1. 密度v/s残糖
  2. 密度v/s酒精
  • 游离二氧化硫和总二氧化硫之间存在中等相关性。
  • 密度与残糖呈正相关,同时我们还可以观察到,与非常低的残糖相比,密度对于葡萄酒的质量来说是更有价值的因素。这意味着我们可以删除残留糖特征来消除共线性。

  • 密度与酒精呈负相关,但两者与质量具有良好的相关性。因此,删除它们中的任何一个都不是一个好的选择,而是我们可以将它们两者都引入来派生新的功能。

  • 为了消除二氧化硫和总二氧化硫之间的共线性,我们将通过对它们求和来生成新特征二氧化硫。

# 消除多重共线性
white_wine_df.drop(columns='residual sugar',axis=1,inplace=True)
white_wine_df['density_per_unit_alcohol'] = white_wine_df['density']/white_wine_df['alcohol']
white_wine_df['sulfur dioxide'] = white_wine_df['free sulfur dioxide']+white_wine_df['total sulfur dioxide']
#删除原始列
white_wine_df.drop(columns=['free sulfur dioxide','total sulfur dioxide','density','alcohol'],axis=1,inplace=True)
white_wine_df.head()
fixed acidityvolatile aciditycitric acidchloridespHsulphatesqualitydensity_per_unit_alcoholsulfur dioxide
07.00.270.360.0453.000.4560.113750215.0
16.30.300.340.0493.300.4960.104632146.0
28.10.280.400.0503.260.4460.098525127.0
37.20.230.320.0583.190.4060.100566233.0
47.20.230.320.0583.190.4060.100566233.0
## 变量之间的相关性
plt.figure(figsize=(15,8))
correlation = white_wine_df.corr()
sns.heatmap((correlation), annot=True, cmap='coolwarm')
<Axes: >

现在我们的数据没有显着的相关性。

数据预处理


#数据分割
X = white_wine_df[white_wine_df.columns.drop('quality')]
y = white_wine_df['quality']
y.value_counts()
6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

引入 SMOTE 来平衡数据

由于我们的数据严重不平衡,模型将无法在样本非常少的数据上进行学习,这最终会降低模型的准确性。因此,我们将使用一种称为“合成少数过采样技术”(SMOTE) 的技术来平衡我们的数据。

#使用 SMOTE 进行平衡
oversample = SMOTE(k_neighbors=4)
X,y = oversample.fit_resample(X,y)
y.value_counts()
6    2198
5    2198
7    2198
8    2198
4    2198
3    2198
9    2198
Name: quality, dtype: int64

数据现已平衡

训练-测试分开

#分为训练测试和扩展功能
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
mmscaler = MinMaxScaler()
X_train = mmscaler.fit_transform(X_train)
X_test = mmscaler.transform(X_test)

模型选择

由于目标变量是具有多个类别的分类变量。我们需要应用一个适合多类分类的模型。

适用于该场景的模型有:

  1. k-最近邻。 2.决策树分类器 3.朴素贝叶斯。 4.随机森林分类器
  2. 梯度提升

注意:- 为了检测过度拟合,即未能通过模型​​概括模式,我们将为每个模型使用 k 倍交叉验证技术

1. k-最近邻

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
param_grid = {'n_neighbors': np.arange(1, 20,2)}
knn = KNeighborsClassifier()
knn_clf=KNeighborsClassifier()
knn_gscv = GridSearchCV(knn, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
knn_gscv.fit(X_train,y_train)
y_pred=knn_gscv.predict(X_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")
Confusion Matrix:
[[440   0   0   0   0   0   0]
 [  3 417  13   3   3   1   0]
 [  2  23 324  56  27   7   0]
 [  4  23  71 260  58  24   0]
 [  2   9  11  44 338  35   1]
 [  0   0   0   7   5 428   0]
 [  0   0   0   0   0   0 439]]
Classification Report:
              precision    recall  f1-score   support

           3       0.98      1.00      0.99       440
           4       0.88      0.95      0.91       440
           5       0.77      0.74      0.76       439
           6       0.70      0.59      0.64       440
           7       0.78      0.77      0.78       440
           8       0.86      0.97      0.92       440
           9       1.00      1.00      1.00       439

    accuracy                           0.86      3078
   macro avg       0.85      0.86      0.86      3078
weighted avg       0.85      0.86      0.86      3078

Accuracy:0.86

2.决策树分类器

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
param_grid = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,100]}
dt_clf=DecisionTreeClassifier()
dt_gscv = GridSearchCV(dt_clf, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
dt_gscv.fit(X_train,y_train)
y_pred=dt_gscv.predict(X_test)
Fitting 5 folds for each of 30 candidates, totalling 150 fits
dt_gscv.best_params_
{'criterion': 'gini', 'max_depth': 100}
#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")
Confusion Matrix:
[[428   0   6   2   4   0   0]
 [  6 366  43  18   6   1   0]
 [  9  41 285  75  24   5   0]
 [  3  19  81 251  60  26   0]
 [  2   9  22  55 320  31   1]
 [  1   7   7  24  28 372   1]
 [  0   0   0   0   2   0 437]]
Classification Report:
              precision    recall  f1-score   support

           3       0.95      0.97      0.96       440
           4       0.83      0.83      0.83       440
           5       0.64      0.65      0.65       439
           6       0.59      0.57      0.58       440
           7       0.72      0.73      0.72       440
           8       0.86      0.85      0.85       440
           9       1.00      1.00      1.00       439

    accuracy                           0.80      3078
   macro avg       0.80      0.80      0.80      3078
weighted avg       0.80      0.80      0.80      3078

Accuracy:0.8

3.朴素贝叶斯

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
param_grid = {'var_smoothing': np.logspace(0,-9, num=100)}
gnb_clf = GaussianNB()
gnb_gscv = GridSearchCV(gnb_clf, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
gnb_gscv.fit(X_train,y_train)
y_pred=gnb_gscv.predict(X_test)
Fitting 5 folds for each of 100 candidates, totalling 500 fits
#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")
Confusion Matrix:
[[237  16  67  13  70  10  27]
 [ 36 165 126  27  59  27   0]
 [ 26  44 234  37  85  12   1]
 [ 14  17 104  61 187  44  13]
 [  1   8  44  12 260  72  43]
 [  3   7  50  13 210 122  35]
 [  4   0   0   0   0  13 422]]
Classification Report:
              precision    recall  f1-score   support

           3       0.74      0.54      0.62       440
           4       0.64      0.38      0.47       440
           5       0.37      0.53      0.44       439
           6       0.37      0.14      0.20       440
           7       0.30      0.59      0.40       440
           8       0.41      0.28      0.33       440
           9       0.78      0.96      0.86       439

    accuracy                           0.49      3078
   macro avg       0.52      0.49      0.48      3078
weighted avg       0.52      0.49      0.48      3078

Accuracy:0.49

4.随机森林分类器

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators':[100,200,500,1000], 'max_depth':[50,60,70,80,90,100]}
rf_clf=RandomForestClassifier()
rf_gscv = GridSearchCV(rf_clf, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
rf_gscv.fit(X_train,y_train)
y_pred=rf_gscv.predict(X_test)
rf_gscv.best_params_
{'max_depth': 60, 'n_estimators': 500}
#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")
Confusion Matrix:
[[440   0   0   0   0   0   0]
 [  1 426   6   3   3   1   0]
 [  2  31 340  54  12   0   0]
 [  3   9  80 275  62  11   0]
 [  1   2  15  41 368  12   1]
 [  0   0   1   0   8 431   0]
 [  0   0   0   0   0   0 439]]
Classification Report:
              precision    recall  f1-score   support

           3       0.98      1.00      0.99       440
           4       0.91      0.97      0.94       440
           5       0.77      0.77      0.77       439
           6       0.74      0.62      0.68       440
           7       0.81      0.84      0.82       440
           8       0.95      0.98      0.96       440
           9       1.00      1.00      1.00       439

    accuracy                           0.88      3078
   macro avg       0.88      0.88      0.88      3078
weighted avg       0.88      0.88      0.88      3078

Accuracy:0.88

5.梯度提升模型

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {
    "learning_rate": [0.01,0.05,0.1],
    "n_estimators":[50,100,500]
    }
gb_clf=GradientBoostingClassifier()
gb_gscv = GridSearchCV(gb_clf, param_grid,cv=5,return_train_score=True, verbose=2,scoring='accuracy')
gb_gscv.fit(X_train,y_train)
y_pred=gb_gscv.predict(X_test)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.1s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  12.0s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.0s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.0s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.0s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.7s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.9s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  11.0s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.9s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.9s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  22.7s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.7min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.7min
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.3s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.3s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.2s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.8s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.4s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.5s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.4s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.3s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.4s
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
gb_gscv.best_params_
{'learning_rate': 0.1, 'n_estimators': 500}
#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")
Confusion Matrix:
[[439   0   1   0   0   0   0]
 [  3 396  23  12   5   1   0]
 [  5  38 293  91  12   0   0]
 [  5   8  86 269  59  12   1]
 [  5  10  25  52 321  27   0]
 [  1   4   2   7  27 399   0]
 [  0   0   0   0   1   0 438]]
Classification Report:
              precision    recall  f1-score   support

           3       0.96      1.00      0.98       440
           4       0.87      0.90      0.88       440
           5       0.68      0.67      0.67       439
           6       0.62      0.61      0.62       440
           7       0.76      0.73      0.74       440
           8       0.91      0.91      0.91       440
           9       1.00      1.00      1.00       439

    accuracy                           0.83      3078
   macro avg       0.83      0.83      0.83      3078
weighted avg       0.83      0.83      0.83      3078

Accuracy:0.83

结论:

随机森林的模型精度最高,并且质量标签 5,6 和 7 的精度和召回率最高。 由于所有模型的质量标签 - 3、4、8、9 的精确度和召回率值都很高,因为它包含通过 SMOTE 技术创建的许多重复行来平衡数据,并且模型更容易记住重复的特征,导致高精度。

因此,质量标签的精度、召回率和 f1 分数- 5,6 和 7 非常重要,这对于模型随机森林来说是最高的。

因此我们得出结论,随机森林模型最适合我们的数据,准确度为 0.88。

代码与数据集下载

详情请见白葡萄酒质量预测项目-VenusAI (aideeplearning.cn)

结论

在我们的数据集上,随机森林分类器表现最佳,具有最高的准确度达到0.88,并且具有最佳的精确度和召回率值。这意味着我们的模型能够高度准确地预测白葡萄酒的质量评级,为酒厂和消费者提供了有用的信息。随机森林分类器是这个任务的最佳选择。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值