机器学习案例--淡水资源检测

小卡尼

已于 2024-02-27 02:08:27 修改

阅读量1.1k

点赞数 26

文章标签：机器学习人工智能

于 2024-02-22 20:16:21 首次发布

本文链接：https://blog.csdn.net/m0_62412733/article/details/136239439

版权

本文介绍了如何使用InteloneAPIAI分析工具包中的相关库，如Modin、daal4py和sklearnex，对淡水质量数据进行预处理、特征工程和模型训练，最终通过XGBoost实现预测，并展示了随机网格搜索优化模型的过程，包括数据清洗、特征选择和性能评估。

摘要由CSDN通过智能技术生成

1.预测淡水质量

1.1问题描述：

淡水是我们最重要的和最稀缺的自然资源之一，仅占地球总水量的3%.它几乎触及我们日常生活中的方方面面，从饮用，游泳和沐浴到生产食品，电力和我们每天使用的产品。获得安全卫生的供水不仅对人类生活至关重要，而且对正在遭受干旱，污染和气温升高影响的周边生态系统的生存也至关重要。

1.2预期解决方案：

参考英特尔类似的处理方案，并提出自己的想法，基于提供的淡水资源相关数据，训练一个或多个机器学习模型,有效预测淡水质量————这里使用推理时间与F1分数作为评分的主要依据。

1.3要求：

需要使用英特尔® oneAPI AI分析工具包中数据分析与机器学习相关的优化库，基于参考资料的方案，进行淡水质量检测的实现。

1.4数据集：

https://filerepo.idzcn.com/hack2023/datasetab75fb3.zip
数据集特征中包含：
Index：编号
pH：水资源的pH值
Iron：含铁量
Nitrate：硝酸盐含量
Chloride：氯化物含量
Lead：铅含量
Zinc：锌含量
Color：淡水颜色
Turbidity：浑浊度
Fluoride：氟化物含量
Copper：铜含量
Odor：气味
Sulfate Conductivity：硫酸导电率
Chlorine Manganese ：氯锰含量
Total Dissolved Solids ：总溶解固体
Source Water ：来源
Temperature ：温度
Air Temperature ：空气温度
Month ：取水的月份
Day ：取水的日期
Time of Day ：取水的时间
Target：是否为可使用淡水

数据集中正负样本数据分布较为一致。所以最后采用了混淆矩阵来测量准确性。

2.项目实施简介：

对数据集首先进行数据探索，数据预处理，然后划分数据集，利用机器学习建立模型，并进行淡水质量的预测。

2.1实施方案使用的相关技术：

modin

Modin 使用Ray、Dask或Unidist提供一种轻松的方式来加速pandas 、脚本和库。与其他分布式 DataFrame 库不同，Modin 提供与现有 pandas 代码的无缝集成和兼容性。即使使用 DataFrame 构造函数也是相同的。（加速读取数据时间）

daal4py

可将训练好的xgboost转换为 daal4py 模型，以便进一步改进预测时间性能, 利用底层的英特尔® 高级矢量扩展指令集（英特尔® AVX-512）硬件，最大限度地提高英特尔® 至强® 处理器上的梯度提升性能。（未使用转换前，直接使用predict最优情况为21秒）

sklearnex

借助面向 Scikit-learn* 的英特尔® 扩展，加速 Scikit-learn ，并且仍然完全符合所有 Scikit-Learn API 和算法。英特尔® Extension for Scikit-learn是一款免费软件 AI 加速器，可为各种应用程序带来超过 10-100 倍的加速。

面向 Scikit-learn* 的英特尔® 扩展为您提供了一种加速现有 scikit-learn 代码的方法。加速是通过打补丁实现的：用扩展提供的优化版本替换现有的 scikit-learn 算法。

2.2数据探索与处理

数据读取与分析：

df = pd.read_csv('data/waterdata.csv' ) 
print("Data shape: {}\n".format(df.shape))
display(df.head())

并且查看前部分的数据：
数据集数据

df.info()

因为数据中存在object类型数据，这对于模型训练不利，所以这里使用了相应方法将其因子化转换为了数值型数据，这样所有的数据都可以进行相关的分析处理了

display(df.head())
factor = pd.factorize(df['Color'])
print(factor)
df.Color = factor[0]
factor = pd.factorize(df['Source'])
print(factor)
df.Source = factor[0]
factor = pd.factorize(df['Month'])
print(factor)
df.Month = factor[0]

而粗略的查看数据就能看出数据集中存在空值，重复值，需要进行数据处理，再来查看一些数学分析，其中包括数据总数量，平均值，最小值等

df.describe()

在这里插入图片描述
正负样本数据比例查看：
可以看出正负样本数据比例较为一致，所以这里就不采用相关取样的方法来使得样本平衡。

在这里插入图片描述

特征相关性分析：

利用以下代码实现相关性的展示：

bar = df.corr()['Target'].abs().sort_values(ascending=False)[1:]

plt.bar(bar.index, bar, width=0.5)
# 设置figsize的大小
pos_list = np.arange(len(df.columns))
params = {
    'figure.figsize': '20, 18'
}
plt.rcParams.update(params)
plt.xticks(bar.index, bar.index, rotation=-60, fontsize=10)
plt.show()

在这里插入图片描述
可以看见后续的特征数据的相关性都比较差，所以这里采取的做法为删除相关特征以及对应的列数据

删除代码为：

df = df.drop(
    columns=['Index', 'Day', 'Time of Day', 'Month', 'Water Temperature', 'Source', 'Conductivity', 'Air Temperature'])

缺失值，偏差值，重复值处理：

前置中我们可以很明显的看见数据集中存在缺失值，如下利用代码进行查看：

display(df.isna().sum())
missing = df.isna().sum().sum()
duplicates = df.duplicated().sum()
print("\n 共{:,.0f} 缺失值 in the data.".format(missing))
print("共{:,.0f} 重复值 in the data.".format(duplicates))

在这里插入图片描述
可以看出缺失值与重复值确实是挺多的，然后本项目采用填充上下值的平均值的方法进行处理缺失值，并删除重复值，相关代码如下：

df = df.fillna(df.interpolate())
df = df.drop_duplicates()

这是在来查看数据集：
这时就没有缺失值与重复值了。
在这里插入图片描述
偏差值处理：
通过以下代码对偏差值进行处理，通过检查皮尔逊相关系数与缺失值比例是否大于20%，数值型特征方差是否小于0.1来进行删除特征操作。

from scipy.stats import pearsonr

variables = df.columns
df = df

var = df.var()
numeric = df.columns
df = df.fillna(df.interpolate())
for i in range(0, len(var) - 1):
    if var[i] <= 0.1:  # 方差大于10％
        print(variables[i])
        df = df.drop(numeric[i],axis=1)
variables = df.columns

for i in range(0, len(variables)):
    x = df[variables[i]]
    y = df[variables[-1]]
    if pearsonr(x, y)[1] > 0.05:
        print(variables[i])
        df = df.drop(variables[i],axis=1)

variables = df.columns
print(variables)
print(len(variables))

结果如下：
在这里插入图片描述
然后删除Lead所对应的列数据。

数据转换：

通过直方图的形式查看连续性数据的分布，只有类似于正态分布的数据分布才是符合现实规律的数据，并且这有利于模型的训练。
通过以下代码查看直方图：

df[float_cols].hist(bins=50,figsize=(16,12))

结果如下：
在这里插入图片描述
既有符合正态分布的，也有比较不符合的，然后对不符合正态分布的特征数据进行数据转换：

# 针对不规则分布的变量进行非线性变换，一般进行log
log_col = ['Iron', 'Zinc', 'Turbidity', 'Copper', 'Manganese']
show_col = []
for col in log_col:
    df[col + '_log'] = np.log(df[col])
    show_col.append(col + '_log')
    
df[show_col].hist(bins=50,figsize=(16,12))

并将转换结果显示：
在这里插入图片描述
好了，到这里时，基本的数据探索与处理已经结束了。

3.模型训练

模型采用XGBoost，并使用随机网格搜索（RandomizedSearchCV）进行优化。

3.1数据切分

代码如下：

这里使用train_test_split()函数将X和y分割成训练集和测试集，其中测试集占总数据的20%。同时，使用StandardScaler()函数对训练集和测试集进行标准化处理，使得每个特征的均值为0，标准差为1，以提高模型的性能。

X = df.drop( ['Target','Iron', 'Zinc', 'Turbidity', 'Copper', 'Manganese'], axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Train Shape: {}".format(X_train_scaled.shape))
print("Test Shape: {}".format(X_test_scaled.shape))

X_train, X_test = X_train_scaled, X_test_scaled

3.2参数设置：

网格搜索模型参数如下：

param_grid = {
    'max_depth': [10, 15, 20],
    "gamma": [0, 1, 2], 
    "subsample": [0.5, 1], 
    "colsample_bytree": [0.3, 0.5, 1], 
    'min_child_weight': [4, 6, 8], 
    "n_estimators": [10,50, 80, 100],
    "alpha": [3, 4, 5]
}

其中网络参数包括

树的最大深度(max_depth),其取值范围是10，15，20;

评估器数量参数(n _estimators),可选取值为 10、50、80、100;

gamma参数（min_split_loss），可选值为0，1，2，默认是0，分裂节点时，损失函数减小值只有大于等于gamma节点才分裂，gamma值越大，算法越保守，越不容易过拟合，但性能就不一定能保证，需要平衡。

colsample_bytree：可选值为0.3, 0.5, 1，默认= 1，列采样率，也就是特征采样率。范围为（0，1]

subsample：可选值为0.9, 1，默认值= 1，构建每棵树对样本的采样率，如果设置成0.5，XGBoost会随机选择一半的样本作为训练集。范围：（0,1]

min_child_weight：可选值为4, 6, 8，默认值= 1，如果新分裂的节点的样本权重和小于min_child_weight则停止分裂。这个可以用来减少过拟合，但是也不能太高，会导致欠拟合。范围：[0，∞]

alpha（reg_alpha）：可选值为3, 4, 5，默认= 0，权重的L1正则化项。增加此值将使模型更加保守。

3.3 模型训练：

定义XGBoost分类器。

xgb = XGBClassifier(
    learning_rate=0.1,
    n_estimators=15,
    max_depth=12,
    min_child_weight=6,
    gamma=0,
    subsample=1,
    colsample_bytree=1,
    objective='binary:logistic', 
    nthread=4,
    alpha=4,
    scale_pos_weight=1,
    seed=27)

在网格搜索过程中,模型采用 f1_score 作为评价指标，基于 StratifiedKFold 将数据分为3份,尝试之前定义的参数列表,使用 fit 方法对模型进行训练。训练完成后,使用测试集(X_test)进行验证，并将结果输出，代码如下:：

refit_score = "f1_score"

start_time = datetime.datetime.now()
print(start_time)
rd_search = RandomizedSearchCV(xgb, param_grid, n_iter=10, cv=3, refit=refit_score, scoring=sklearn.metrics.make_scorer(f1_score), verbose=10, return_train_score=True)
rd_search.fit(X_train, y_train)
print(rd_search.best_params_)
print(rd_search.best_score_)
print(rd_search.best_estimator_)
print(datetime.datetime.now() - start_time)

通过搜索之后得到的相关最优参数如下：

在这里插入图片描述

{'subsample': 0.5, 'n_estimators': 10, 'min_child_weight': 8, 'max_depth': 15, 'gamma': 0, 'colsample_bytree': 1, 'alpha': 4}

然后进行训练：

start = time.time()
    
y_pred = rd_search.best_estimator_.predict(X_test)
end = time.time()
print('模型拟合时间:{:.2f}'.format(end-start))

# confusion matrix on the test data.
print('\nConfusion matrix of Random Forest optimized for {} on the test data:'.format(refit_score))
print(pd.DataFrame(sklearn.metrics.confusion_matrix(y_test, y_pred),
                   columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))

将测试集的预测结果与实际结果合起来，应用 confusion_matrix 构建混淆矩阵，并输出模型拟合时间，结果如下:
在这里插入图片描述

最终输出f1分数：
在这里插入图片描述

3.4 结果可视化：

使用Scikit-learn库：

from scipy import interp

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (12, 9),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)



tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

skf = StratifiedKFold(n_splits=5)
linetypes = ['--',':','-.','-','-','O']

i = 0
cv_data =skf.split(X_test, y_test)

for train, test in cv_data:
    probas_ = rd_search.predict_proba(X_test[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y_test[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1.5,linestyle = linetypes[i], alpha=0.8,
             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=1, color='r',
         label='Chance', alpha=.6)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.02, 1.02])
plt.ylim([-0.02, 1.02])
plt.xlabel('FPR',fontsize=20)
plt.ylabel('TPR',fontsize=20)
# plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

结果展示：
在这里插入图片描述

4.学习总结：

通过本项目，使我了解到了intel下的oneAPI在机器学习与数据挖掘领域中有很大的作用。
而在实际中体验到了intel AI Analytics Toolkit工具对于问题解决的提升。甚至在几百万条数据中也能很快的进行模型的拟合，这种感觉在平时的机器学习课程中是没有的，曾经在一次机器学习的项目中，数据量还没有这么大，足足跑了一上午，而结果还不尽人意，相比较下来就显得本项目中使用的工具的强大了，相信在未来的工作或者其他方面也能使用到这些工具。