白葡萄酒质量预测项目

注意：本文引用自专业人工智能社区Venus AI

项目名称：白葡萄酒品质预测

问题陈述：

我们得到一个由白葡萄酒质量和决定葡萄酒质量的成分组成的数据集。我们需要建立一个能够以最高精度预测葡萄酒质量的模型。

导入库并加载数据集

#导入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE

#加载数据集
white_wine_df = pd.read_csv("winequality-white.csv",sep=";")

white_wine_df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6

white_wine_df.shape

(4898, 12)


white_wine_df.describe()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000
mean	6.854788	0.278241	0.334192	6.391415	0.045772	35.308085	138.360657	0.994027	3.188267	0.489847	10.514267	5.877909
std	0.843868	0.100795	0.121020	5.072058	0.021848	17.007137	42.498065	0.002991	0.151001	0.114126	1.230621	0.885639
min	3.800000	0.080000	0.000000	0.600000	0.009000	2.000000	9.000000	0.987110	2.720000	0.220000	8.000000	3.000000
25%	6.300000	0.210000	0.270000	1.700000	0.036000	23.000000	108.000000	0.991723	3.090000	0.410000	9.500000	5.000000
50%	6.800000	0.260000	0.320000	5.200000	0.043000	34.000000	134.000000	0.993740	3.180000	0.470000	10.400000	6.000000
75%	7.300000	0.320000	0.390000	9.900000	0.050000	46.000000	167.000000	0.996100	3.280000	0.550000	11.400000	6.000000
max	14.200000	1.100000	1.660000	65.800000	0.346000	289.000000	440.000000	1.038980	3.820000	1.080000	14.200000	9.000000

# 总结数据集
white_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

#检查是否有空值
white_wine_df.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

不存在空值

探索性数据分析 (EDA)

白葡萄酒品质等级分布

white_wine_df['quality'].value_counts()

6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

我们拥有的大多数白葡萄酒数据都具有 5 到 7 之间的中等质量。
只有少数葡萄酒的品质非常高或非常低。

绘制独立特征的分布


import math

# Assuming the last column is not needed for plotting
num_features = len(white_wine_df.columns) - 1

# Calculate the number of rows and columns for the subplots
num_rows = math.ceil(num_features / 3)
fig, axes = plt.subplots(num_rows, 3, figsize=(10, 8))

i = 0
for row in range(num_rows):
    for col in range(3):
        if i < num_features:
            sns.histplot(white_wine_df[white_wine_df.columns[i]], ax=axes[row][col], color='teal', kde=True)
            i += 1

# Adjust layout
fig.tight_layout()

几乎所有分布图都遵循近似正态分布。

将葡萄酒的品质标记为低、中、高，并观察每种成分的浓度如何影响白葡萄酒的品质。

white_wine_df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6

#标签编码
white_wine_df_new = white_wine_df.replace({'quality':{3:'low',4:'low',5:'medium',6:'medium',7:'medium',8:'high',9:'high'}})

white_wine_df_new.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	medium
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	medium
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	medium
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	medium
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	medium

引入箱线图来根据质量观察葡萄酒成分的浓度。

# Get the columns to plot
columns_to_plot = white_wine_df_new.describe().columns

# Calculate the number of rows and columns for the subplot grid
n_cols = 3  # You can change this to fit your screen better
n_rows = (len(columns_to_plot) + n_cols - 1) // n_cols  # Calculate rows needed

# Create a figure and a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))

# Flatten the axes array for easy iteration
axes_flat = axes.flatten()

# Iterate over the columns and create a boxplot in each subplot
for i, col in enumerate(columns_to_plot):
    sns.boxplot(x='quality', y=col, data=white_wine_df_new, ax=axes_flat[i])

# Hide any unused subplots
for i in range(len(columns_to_plot), len(axes_flat)):
    axes_flat[i].set_visible(False)

# Adjust layout for better display
fig.tight_layout()

# Display the plot
plt.show()

观察结果：

品质高的白葡萄酒，其酒精含量和pH值较高，而氯化物含量较低。
固定酸度和挥发酸度高的白葡萄酒质量较低。

	fixed acidity	volatile acidity	citric acid	chlorides	pH	sulphates	quality	density_per_unit_alcohol	sulfur dioxide
0	7.0	0.27	0.36	0.045	3.00	0.45	6	0.113750	215.0
1	6.3	0.30	0.34	0.049	3.30	0.49	6	0.104632	146.0
2	8.1	0.28	0.40	0.050	3.26	0.44	6	0.098525	127.0
3	7.2	0.23	0.32	0.058	3.19	0.40	6	0.100566	233.0
4	7.2	0.23	0.32	0.058	3.19	0.40	6	0.100566	233.0

数据预处理


#数据分割
X = white_wine_df[white_wine_df.columns.drop('quality')]
y = white_wine_df['quality']

y.value_counts()

6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

引入 SMOTE 来平衡数据

由于我们的数据严重不平衡，模型将无法在样本非常少的数据上进行学习，这最终会降低模型的准确性。因此，我们将使用一种称为“合成少数过采样技术”(SMOTE) 的技术来平衡我们的数据。

#使用 SMOTE 进行平衡
oversample = SMOTE(k_neighbors=4)
X,y = oversample.fit_resample(X,y)

y.value_counts()

6    2198
5    2198
7    2198
8    2198
4    2198
3    2198
9    2198
Name: quality, dtype: int64

数据现已平衡

训练-测试分开

#分为训练测试和扩展功能
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
mmscaler = MinMaxScaler()
X_train = mmscaler.fit_transform(X_train)
X_test = mmscaler.transform(X_test)

模型选择

由于目标变量是具有多个类别的分类变量。我们需要应用一个适合多类分类的模型。

适用于该场景的模型有：

k-最近邻。 2.决策树分类器 3.朴素贝叶斯。 4.随机森林分类器
梯度提升

注意：- 为了检测过度拟合，即未能通过模型概括模式，我们将为每个模型使用 k 倍交叉验证技术

1. k-最近邻

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
param_grid = {'n_neighbors': np.arange(1, 20,2)}
knn = KNeighborsClassifier()
knn_clf=KNeighborsClassifier()
knn_gscv = GridSearchCV(knn, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
knn_gscv.fit(X_train,y_train)
y_pred=knn_gscv.predict(X_test)

Fitting 5 folds for each of 10 candidates, totalling 50 fits

#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")

Confusion Matrix:
[[440   0   0   0   0   0   0]
 [  3 417  13   3   3   1   0]
 [  2  23 324  56  27   7   0]
 [  4  23  71 260  58  24   0]
 [  2   9  11  44 338  35   1]
 [  0   0   0   7   5 428   0]
 [  0   0   0   0   0   0 439]]
Classification Report:
              precision    recall  f1-score   support

           3       0.98      1.00      0.99       440
           4       0.88      0.95      0.91       440
           5       0.77      0.74      0.76       439
           6       0.70      0.59      0.64       440
           7       0.78      0.77      0.78       440
           8       0.86      0.97      0.92       440
           9       1.00      1.00      1.00       439

    accuracy                           0.86      3078
   macro avg       0.85      0.86      0.86      3078
weighted avg       0.85      0.86      0.86      3078

Accuracy:0.86

2.决策树分类器

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
param_grid = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,100]}
dt_clf=DecisionTreeClassifier()
dt_gscv = GridSearchCV(dt_clf, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
dt_gscv.fit(X_train,y_train)
y_pred=dt_gscv.predict(X_test)

Fitting 5 folds for each of 30 candidates, totalling 150 fits

dt_gscv.best_params_

{'criterion': 'gini', 'max_depth': 100}

#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")

Confusion Matrix:
[[428   0   6   2   4   0   0]
 [  6 366  43  18   6   1   0]
 [  9  41 285  75  24   5   0]
 [  3  19  81 251  60  26   0]
 [  2   9  22  55 320  31   1]
 [  1   7   7  24  28 372   1]
 [  0   0   0   0   2   0 437]]
Classification Report:
              precision    recall  f1-score   support

           3       0.95      0.97      0.96       440
           4       0.83      0.83      0.83       440
           5       0.64      0.65      0.65       439
           6       0.59      0.57      0.58       440
           7       0.72      0.73      0.72       440
           8       0.86      0.85      0.85       440
           9       1.00      1.00      1.00       439

    accuracy                           0.80      3078
   macro avg       0.80      0.80      0.80      3078
weighted avg       0.80      0.80      0.80      3078

Accuracy:0.8

3.朴素贝叶斯

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
param_grid = {'var_smoothing': np.logspace(0,-9, num=100)}
gnb_clf = GaussianNB()
gnb_gscv = GridSearchCV(gnb_clf, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
gnb_gscv.fit(X_train,y_train)
y_pred=gnb_gscv.predict(X_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits

#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")

Confusion Matrix:
[[237  16  67  13  70  10  27]
 [ 36 165 126  27  59  27   0]
 [ 26  44 234  37  85  12   1]
 [ 14  17 104  61 187  44  13]
 [  1   8  44  12 260  72  43]
 [  3   7  50  13 210 122  35]
 [  4   0   0   0   0  13 422]]
Classification Report:
              precision    recall  f1-score   support

           3       0.74      0.54      0.62       440
           4       0.64      0.38      0.47       440
           5       0.37      0.53      0.44       439
           6       0.37      0.14      0.20       440
           7       0.30      0.59      0.40       440
           8       0.41      0.28      0.33       440
           9       0.78      0.96      0.86       439

    accuracy                           0.49      3078
   macro avg       0.52      0.49      0.48      3078
weighted avg       0.52      0.49      0.48      3078

Accuracy:0.49

4.随机森林分类器

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators':[100,200,500,1000], 'max_depth':[50,60,70,80,90,100]}
rf_clf=RandomForestClassifier()
rf_gscv = GridSearchCV(rf_clf, param_grid,cv=5,return_train_score=True, verbose=1,scoring='accuracy')
rf_gscv.fit(X_train,y_train)
y_pred=rf_gscv.predict(X_test)

rf_gscv.best_params_

{'max_depth': 60, 'n_estimators': 500}

#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")

Confusion Matrix:
[[440   0   0   0   0   0   0]
 [  1 426   6   3   3   1   0]
 [  2  31 340  54  12   0   0]
 [  3   9  80 275  62  11   0]
 [  1   2  15  41 368  12   1]
 [  0   0   1   0   8 431   0]
 [  0   0   0   0   0   0 439]]
Classification Report:
              precision    recall  f1-score   support

           3       0.98      1.00      0.99       440
           4       0.91      0.97      0.94       440
           5       0.77      0.77      0.77       439
           6       0.74      0.62      0.68       440
           7       0.81      0.84      0.82       440
           8       0.95      0.98      0.96       440
           9       1.00      1.00      1.00       439

    accuracy                           0.88      3078
   macro avg       0.88      0.88      0.88      3078
weighted avg       0.88      0.88      0.88      3078

Accuracy:0.88

5.梯度提升模型

#使用 GridsearchCV 应用 k 折交叉验证并拟合模型。
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {
    "learning_rate": [0.01,0.05,0.1],
    "n_estimators":[50,100,500]
    }
gb_clf=GradientBoostingClassifier()
gb_gscv = GridSearchCV(gb_clf, param_grid,cv=5,return_train_score=True, verbose=2,scoring='accuracy')
gb_gscv.fit(X_train,y_train)
y_pred=gb_gscv.predict(X_test)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.1s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  12.0s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.0s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.0s
[CV] END ................learning_rate=0.01, n_estimators=50; total time=  11.0s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.7s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.9s
[CV] END ...............learning_rate=0.01, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.01, n_estimators=500; total time= 1.8min
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  11.0s
[CV] END ................learning_rate=0.05, n_estimators=50; total time=  10.9s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.9s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.9s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  22.7s
[CV] END ...............learning_rate=0.05, n_estimators=100; total time=  21.8s
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.8min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.7min
[CV] END ...............learning_rate=0.05, n_estimators=500; total time= 1.7min
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.3s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.3s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.2s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.8s
[CV] END .................learning_rate=0.1, n_estimators=50; total time=  10.4s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.5s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.4s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.3s
[CV] END ................learning_rate=0.1, n_estimators=100; total time=  20.4s
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min
[CV] END ................learning_rate=0.1, n_estimators=500; total time= 1.7min

gb_gscv.best_params_

{'learning_rate': 0.1, 'n_estimators': 500}

#评估模型
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{result}")
result1 = classification_report(y_test, y_pred)
print(f"Classification Report:\n{result1}")
result2 = accuracy_score(y_test,y_pred)
print(f"Accuracy:{round(result2,2)}\n\n")

Confusion Matrix:
[[439   0   1   0   0   0   0]
 [  3 396  23  12   5   1   0]
 [  5  38 293  91  12   0   0]
 [  5   8  86 269  59  12   1]
 [  5  10  25  52 321  27   0]
 [  1   4   2   7  27 399   0]
 [  0   0   0   0   1   0 438]]
Classification Report:
              precision    recall  f1-score   support

           3       0.96      1.00      0.98       440
           4       0.87      0.90      0.88       440
           5       0.68      0.67      0.67       439
           6       0.62      0.61      0.62       440
           7       0.76      0.73      0.74       440
           8       0.91      0.91      0.91       440
           9       1.00      1.00      1.00       439

    accuracy                           0.83      3078
   macro avg       0.83      0.83      0.83      3078
weighted avg       0.83      0.83      0.83      3078

Accuracy:0.83