Inter校企合作 淡水质量预测

目录

1:淡水质量预测简介

1.1问题描述

2:数据预处理

2.1前置操作

2.2.1导包

2.1.2对运行环境进行设置,使其支持中文显示

2.1.3 使用intel加速组件,来加速训练和预测速度

 2.2数据探索

2.2.1下载数据集

 2.2.2读取用户相关的特征,转化为DataFrame对象并进行展示

2.2.3对英文字段进行因子化,处理为数字型变量

2.3数据预处理

 2.3.1相关性分析

2.3.2检查并处理缺失值、重复值和偏差值 

2.3.3检查并处理数据不平衡的问题

3:使用XGBoost算法进行训练

3.1数据建模

3.1.1模型参数定义

3.1.2分类器定义

3.2模型训练与推理

3.2.1模型训练

3.2.2模型推理

3.2.3使用confusion_matrix 构建混淆矩阵 

4:结果分析 

4.1计算并输出F1值

4.2绘制ROC曲线 

 5:总结


1:淡水质量预测简介

1.1问题描述

        淡水是我们最重要和最稀缺的自然资源之一,仅占地球总水量的 3%。它几乎触及我们日常生活的方方面面,从饮用、游泳和沐浴到生产食物、电力和我们每天使用的产品。获得安全卫生的供水不仅对人类生活至关重要,而且对正在遭受干旱、污染和气温升高影响的周边生态系统的生存也至关重要。因此,利用机器学习技术预测淡水质量是一项至关重要的任务,关乎人类的生存安全和生态系统的健康。

2:数据预处理

2.1前置操作

2.2.1导包

import os
import xgboost
from xgboost import XGBClassifier
import time
import warnings

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.graph_objects as go
from sklearn.utils import resample

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
import sklearn
from sklearn.metrics import precision_recall_curve, average_precision_score

2.1.2对运行环境进行设置,使其支持中文显示

import pandas as pd
os.environ['NLSLANG']='SIMPLIFIED CHINESE_CHINA.UTF8'
pd.set_option( 'display.max_columns', 100)

import os

os.environ['MODIN_ENGINE'] = 'dask'

2.1.3 使用intel加速组件,来加速训练和预测速度

import daal4py as d4p

import modin.pandas as pd
from modin.config import Engine
Engine.put("dask")

from sklearnex import patch_sklearn

patch_sklearn()

 2.2数据探索

2.2.1下载数据集

!wget https://filerepo.idzcn.com/hack2023/datasetab75fb3.zip

 2.2.2读取用户相关的特征,转化为DataFrame对象并进行展示

df = pd.read_csv('./dataset.csv')
print("Data shape: {}\n".format(df.shape))
display(df.head())

df.info()

2.2.3对英文字段进行因子化,处理为数字型变量

display(df.head())
factor = pd.factorize(df['Color'])
print(factor)
df.Color = factor[0]
factor = pd.factorize(df['Source'])
print(factor)
df.Source = factor[0]
factor = pd.factorize(df['Month'])
print(factor)
df.Month = factor[0]

df

2.3数据预处理

 2.3.1相关性分析

bar = df.corr()['Target'].abs().sort_values(ascending=False)[1:]

plt.bar(bar.index, bar, width=0.5)
# 设置figsize的大小
pos_list = np.arange(len(df.columns))
params = {
    'figure.figsize': '40, 10'
}
plt.rcParams.update(params)
plt.xticks(bar.index, bar.index, rotation=-60, fontsize=10)
plt.show()

 df = df.drop(
    columns=['Index', 'Day', 'Time of Day', 'Month', 'Water Temperature', 'Source', 'Conductivity', 'Air Temperature'])

2.3.2检查并处理缺失值、重复值和偏差值 

display(df.isna().sum())
missing = df.isna().sum().sum()
duplicates = df.duplicated().sum()
print("\nThere are {:,.0f} missing values in the data.".format(missing))
print("There are {:,.0f} duplicate records in the data.".format(duplicates))

 

df = df.fillna(0)
df = df.drop_duplicates()

from scipy.stats import pearsonr

variables = df.columns
df = df

var = df.var()
numeric = df.columns
df = df.fillna(0)
for i in range(0, len(var) - 1):
    if var[i] <= 0.1:  # 方差大于10%
        print(variables[i])
        df = df.drop(numeric[i], 1)
variables = df.columns

for i in range(0, len(variables)):
    x = df[variables[i]]
    y = df[variables[-1]]
    if pearsonr(x, y)[1] > 0.05:
        print(variables[i])
        df = df.drop(variables[i], 1)

variables = df.columns
print(variables)
print(len(variables))

处理好后对重复值和缺失值进行检查 

display(df.isna().sum())
missing = df.isna().sum().sum()
duplicates = df.duplicated().sum()
print("\nThere are {:,.0f} missing values in the data.".format(missing))
print("There are {:,.0f} duplicate records in the data.".format(duplicates))

没有啦

2.3.3检查并处理数据不平衡的问题

print(df.Target.value_counts())
target = df.Target.value_counts()
target.rename(index={1: 'state 1', 0: 'state o'}, inplace=True)
plt.pie(target, [0, 0.05], target.index, autopct='%1.1f%%')
plt.show()

from imblearn.under_sampling import RandomUnderSampler
import datetime

X = df.iloc[:, 0:len(df.columns.tolist()) - 1].values
y = df.iloc[:, len(df.columns.tolist()) - 1].values

# # 下采样
under_sampler = RandomUnderSampler(random_state=21)
X, y = under_sampler.fit_resample(X, y)

X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Train Shape: {}".format(X_train_scaled.shape))
print("Test Shape: {}".format(X_test_scaled.shape))

X_train, X_test = X_train_scaled, X_test_scaled

3:使用XGBoost算法进行训练

3.1数据建模

3.1.1模型参数定义

from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score,f1_score, roc_auc_score  
param_grid = {
    'max_depth': [10, 15, 20],
    "gamma": [0, 1, 2], # -> 0
    "subsample": [0.9, 1], # -> 1
    "colsample_bytree": [0.3, 0.5, 1], # -> 1
    'min_child_weight': [4, 6, 8], # -> 6
    "n_estimators": [10,50, 80, 100], # -> 80
    "alpha": [3, 4, 5] # -> 4
}

scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score),
    'roc_auc_score': make_scorer(roc_auc_score),
}

3.1.2分类器定义

xgb = XGBClassifier(
    learning_rate=0.1,
    n_estimators=15,
    max_depth=12,
    min_child_weight=6,
    gamma=0,
    subsample=1,
    colsample_bytree=1,
    objective='binary:logistic', # 二元分类的逻辑回归,输出概率
    nthread=4,
    alpha=4,
    scale_pos_weight=1,
    seed=27)

3.2模型训练与推理

3.2.1模型训练

refit_score = "f1_score"

start_time = datetime.datetime.now()
print(start_time)
rd_search = RandomizedSearchCV(xgb, param_grid, n_iter=10, cv=3, refit=refit_score, scoring=scorers, verbose=10, return_train_score=True)
rd_search.fit(X_train, y_train)
print(rd_search.best_params_)
print(rd_search.best_score_)
print(rd_search.best_estimator_)
print(datetime.datetime.now() - start_time)

3.2.2模型推理

(自设测试集)

from datetime import datetime
# 记录开始时间
inference_start_time = datetime.now()

# 模型推理代码
y_pred = rd_search.best_estimator_.predict(X_test)

# 计算模型推理时间
inference_time = datetime.now() - inference_start_time
print("模型推理时间:", inference_time)

 

(统一测试集) 

test_df = pd.read_csv('./test_data.csv')
print("Data shape: {}\n".format(df.shape))
display(df.head())

df.info()

display(test_df.head())
factor = pd.factorize(df['Color'])
print(factor)
test_df.Color = factor[0]
factor = pd.factorize(test_df['Source'])
print(factor)
test_df.Source = factor[0]
factor = pd.factorize(test_df['Month'])
print(factor)
test_df.Month = factor[0]

test_df


 test_df = test_df.drop(
    columns=['Index', 'Day', 'Time of Day', 'Month', 'Water Temperature', 'Source', 'Conductivity', 'Air Temperature'])

test_df = test_df.fillna(0)
test_df = df.drop_duplicates()

test_df = test_df.drop(columns=['Lead'])
test_df = test_df.drop(columns=['Target'])

X_test = scaler.transform(test_df)

import pandas as pd

# 读取 test_data.csv 文件
test_df = pd.read_csv('test_data.csv')

# 提取 Target 列的值
y_true = test_df['Target']

from datetime import datetime

# 记录开始时间
inference_start_time = datetime.now()

# 模型推理代码
y_pred = rd_search.best_estimator_.predict(X_test)

# 计算模型推理时间
inference_time = datetime.now() - inference_start_time
print("模型推理时间:", inference_time)

# 计算模型在测试集上的f1分数并输出
f1 = f1_score(y_true, y_pred)
print("模型在测试集上的f1分数:", f1)

 

3.2.3使用confusion_matrix 构建混淆矩阵 

from sklearn.metrics import confusion_matrix
y_pred = rd_search.best_estimator_.predict(X_test)

# confusion matrix on the test data.
print('\nConfusion matrix of Random Forest optimized for {} on the test data:'.format(refit_score))
print(pd.DataFrame(confusion_matrix(y_test, y_pred),
                   columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))

4:结果分析 

4.1绘制ROC曲线 

from scipy import interp

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (12, 9),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)



tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

skf = StratifiedKFold(n_splits=5)
linetypes = ['--',':','-.','-','-','O']

i = 0
cv_data =skf.split(X_test, y_test)

for train, test in cv_data:
    probas_ = rd_search.predict_proba(X_test[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y_test[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1.5,linestyle = linetypes[i], alpha=0.8,
             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=1, color='r',
         label='Chance', alpha=.6)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.02, 1.02])
plt.ylim([-0.02, 1.02])
plt.xlabel('FPR',fontsize=20)
plt.ylabel('TPR',fontsize=20)
# plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

 5:总结

 本模型使用XGBoost,并使用随机网格搜索进行优化,经过10次训练 , f1分数在0.83左右 , 准确性较高.通过使用Intel oneAPI组件中的daal4py进行加速,使得推理时间在0.26s左右.所需时间较短,准确性较高.

  • 8
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值