预测淡水质量

本文介绍了如何使用XGBoost算法对淡水可用性和生态系统依赖的预测问题进行建模,通过数据预处理、特征工程和随机网格搜索优化参数,最终模型达到f1分数约为0.91,并利用daal4py加速技术显著减少了推理时间至0.22秒。
摘要由CSDN通过智能技术生成

一、问题以及数据集

1、问题描述:

淡水是我们最重要和最稀缺的自然资源之一,仅占地球总水量的 3%。它几乎触及我们日常生活的方方面面,从饮用、游泳和沐浴到生产食物、电力和我们每天使用的产品。获得安全卫生的供水不仅对人类生活至关重要,而且对正在遭受干旱、污染和气温升高影响的周边生态系统的生存也至关重要。

2、预期解决方案:

通过参考英特尔的类似实现方案,预测淡水是否可以安全饮用和被依赖淡水的生态系统所使用,从而可以帮助全球水安全和环境可持续性发展。这里分类准确度和推理时间将作为评分的主要依据。

3、数据集:

你可以在此处下载数据集。

二、具体解决方案

1、数据预处理

1.1 提前导入所需要的包

import os
import xgboost
from xgboost import XGBClassifier
import time
import warnings

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.graph_objects as go
from sklearn.utils import resample

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
import sklearn
from sklearn.metrics import precision_recall_curve, average_precision_score

1.2 设置环境变量

设置环境变量 NLSLANG 的值为 'SIMPLIFIED CHINESE_CHINA.UTF8'

os.environ['NLSLANG']='SIMPLIFIED CHINESE_CHINA.UTF8'
pd.set_option( 'display.max_columns', 100)

1.3 读取csv文件

读取准备的 csv 文件,并使用 Pandas 将其转换为 DataFrame 数据类型

df = pd.read_csv('./dataset.csv')
print("Data shape: {}\n".format(df.shape))
display(df.head())

df.info()

运行结果:

1.4 计算均值、标准差等统计量

对 DataFrame 中的数值型数据进行描述性统计分析,并将结果存储在 Pandas DataFrame 中,最后将结果合并并输出

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
desc = df.describe()
skewness = df[numeric_cols].skew()
kurtosis = df[numeric_cols].kurtosis()

skewness_df = pd.DataFrame(skewness, columns=['skewness']).transpose()
kurtosis_df = pd.DataFrame(kurtosis, columns=['kur']).transpose()
result = pd.concat([desc, skewness_df, kurtosis_df])

result

运行结果:

1.5 读取csv数据集,计算描述统计量和分布形态指标

import pandas as pd
df = pd.read_csv('./dataset.csv')
desc = df.describe()
skewness = df.select_dtypes(include=['float64', 'int64']).skew().to_frame(name='skewness').T
kurtosis = df.select_dtypes(include=['float64', 'int64']).kurtosis().to_frame(name='kurtosis').T
desc = pd.concat([desc, skewness, kurtosis])

idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
desc.index = idx + ['skewness', 'kurtosis']

desc = desc.round(3)
display(desc.style.background_gradient(subset=pd.IndexSlice['skewness':, :],
                                       cmap='OrRd'))

运行结果:

1.6 对数据进行因子化处理后展示

display(df.head())
factor = pd.factorize(df['Color'])
print(factor)
df.Color = factor[0]
factor = pd.factorize(df['Source'])
print(factor)
df.Source = factor[0]
factor = pd.factorize(df['Month'])
print(factor)
df.Month = factor[0]
df

运行结果:

2.相关性分析

2.1 对相关性进行分析

# 相关性分析
bar = df.corr()['Target'].abs().sort_values(ascending=False)[1:]

plt.bar(bar.index, bar, width=0.5)
# 设置figsize的大小
pos_list = np.arange(len(df.columns))
params = {
    'figure.figsize': '20, 10'
}
plt.rcParams.update(params)
plt.xticks(bar.index, bar.index, rotation=-60, fontsize=10)
plt.show()

运行结果:

2.2 删除不相关的列

 df = df.drop(
    columns=['Index', 'Day', 'Time of Day', 'Month', 'Water Temperature', 'Source', 'Conductivity', 'Air Temperature'])

 3. 缺失值、重复值、偏差值的处理

3.1 查看样本缺失值、重复值情况

display(df.isna().sum())
missing = df.isna().sum().sum()
duplicates = df.duplicated().sum()
print("\nThere are {:,.0f} missing values in the data.".format(missing))
print("There are {:,.0f} duplicate records in the data.".format(duplicates))

运行结果:

3.2 处理偏差值

from scipy.stats import pearsonr
import pandas as pd

# 假设 df 是你的 DataFrame
# 选择数值类型的列
numeric_columns = df.select_dtypes(include=['number']).columns

# 计算数值类型列的方差
variance = df[numeric_columns].var()
variables = df.columns
var = df.var()
numeric = df.columns
df = df.fillna(0)
for i in range(0, len(var) - 1):
    if var[i] <= 0.1:  # 方差大于10%
        print(variables[i])
        df = df.drop(columns=numeric[i])  # 修改这一行,使用关键字参数 columns
variables = df.columns
for i in range(0, len(variables)):
    x = df[variables[i]]
    y = df[variables[-1]]
    if pearsonr(x, y)[1] > 0.05:
        print(variables[i])
        df = df.drop(columns=variables[i])  # 修改这一行,使用关键字参数 columns

variables = df.columns
print(variables)
print(len(variables))

运行结果:

4. 数据不平衡处理

print(df.Target.value_counts())
target = df.Target.value_counts()
target.rename(index={1: 'state 1', 0: 'state o'}, inplace=True)
plt.pie(target, [0, 0.05], target.index, autopct='%1.1f%%')
plt.show()

 运行结果

from imblearn.under_sampling import RandomUnderSampler
import datetime

X = df.iloc[:, 0:len(df.columns.tolist()) - 1].values
y = df.iloc[:, len(df.columns.tolist()) - 1].values

# # 下采样
under_sampler = RandomUnderSampler(random_state=21)
X, y = under_sampler.fit_resample(X, y)

X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Train Shape: {}".format(X_train_scaled.shape))
print("Test Shape: {}".format(X_test_scaled.shape))

X_train, X_test = X_train_scaled, X_test_scaled

运行结果:

5.XGBoost算法

from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score, f1_score, roc_auc_score

param_grid = {
    'max_depth': [10, 15, 20],
    "gamma": [0, 1, 2], # -> 0
    "subsample": [0.9, 1], # -> 1
    "colsample_bytree": [0.3, 0.5, 1], # -> 1
    'min_child_weight': [4, 6, 8], # -> 6
    "n_estimators": [10, 50, 80, 100], # -> 80
    "alpha": [3, 4, 5] # -> 4
}


scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score),
    'roc_auc_score': make_scorer(roc_auc_score),
}

xgb = XGBClassifier(
    learning_rate=0.1,
    n_estimators=15,
    max_depth=12,
    min_child_weight=6,
    gamma=0,
    subsample=1,
    colsample_bytree=1,
    objective='binary:logistic', # 二元分类的逻辑回归,输出概率
    nthread=4,
    alpha=4,
    scale_pos_weight=1,
    seed=27)

6.模型训练

refit_score = "f1_score"
 
start_time = datetime.datetime.now()
print(start_time)
rd_search = RandomizedSearchCV(xgb, param_grid, n_iter=10, cv=3, refit=refit_score, scoring=scorers, verbose=10, return_train_score=True)
rd_search.fit(X_train, y_train)
print(rd_search.best_params_)
print(rd_search.best_score_)
print(rd_search.best_estimator_)
print(datetime.datetime.now() - start_time)

运行结果:

7.推理预测

from datetime import datetime
inference_start_time = datetime.now()
y_pred = rd_search.best_estimator_.predict(X_test)
inference_time = datetime.now() - inference_start_time
print("测试集所用推理时间:", inference_time)

运行结果:

8. 应用 confusion_matrix 构建混淆矩阵

使用经过优化参数后的随机森林分类器对测试数据集X_test进行预测,并计算混淆矩阵,来评估模型在测试数据上的表现

from sklearn.metrics import confusion_matrix

y_pred = rd_search.best_estimator_.predict(X_test)

# confusion matrix on the test data.
print('\nConfusion matrix of Random Forest optimized for {} on the test data:'.format(refit_score))
print(pd.DataFrame(confusion_matrix(y_test, y_pred),
                   columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))

运行结果:

9.结果可视化分析

from scipy import interp
 
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (12, 9),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)
 
 
 
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
 
skf = StratifiedKFold(n_splits=5)
linetypes = ['--',':','-.','-','-','O']
 
i = 0
cv_data =skf.split(X_test, y_test)
 
for train, test in cv_data:
    probas_ = rd_search.predict_proba(X_test[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y_test[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1.5,linestyle = linetypes[i], alpha=0.8,
             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
 
    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=1, color='r',
         label='Chance', alpha=.6)
 
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)
 
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')
 
plt.xlim([-0.02, 1.02])
plt.ylim([-0.02, 1.02])
plt.xlabel('FPR',fontsize=20)
plt.ylabel('TPR',fontsize=20)
# plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

运行结果:

三、总结

      本次使用了XGBoost算法,通过随机网格搜索进行优化。经过10次训练后,模型的f1分数在0.91左右,表现非常好。

       利用Intel oneAPI组件中的daal4py加速技术,使得模型推理时间只需0.22秒左右。

  • 25
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值