糖尿病预测

最新推荐文章于 2023-12-12 09:02:32 发布

Mayese

最新推荐文章于 2023-12-12 09:02:32 发布

阅读量386

点赞数

分类专栏： # Kaggle 文章标签： python

本文链接：https://blog.csdn.net/weixin_45081871/article/details/124283280

版权

Kaggle 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

数据集介绍

age:年龄
BMI：身体健康指数，衡量人胖瘦程度以及是否健康的标准
Glucose：葡萄糖
Insulin：胰岛素
HOMA：HOMA指数是医学研究人员经过大量的临床研究和基础研究设计出来的HOMA指数，用它来进行胰岛功能，包括胰岛素抵抗以及胰岛β细胞功能的测定，也经过了临床的验证。
Lepin：重组蛋白
Adiponectin：脂肪连接蛋白
Resistin：抵抗素
MCP.1:高活性细胞因子
Classification：1代表没患糖尿病；2代表患糖尿病

EDA（数据探索性分析）

生成EDA的准备和调取（pandas_profiling）

工具包安装
在这里插入图片描述

在这里插入图片描述
安装pdpbox

import pandas_profiling
prodile = pandas_profiling.ProfileReport(df)
#保存EDA
profile.to_file('profile.html')

1.界面
在这里插入图片描述

变量列给出具体的描述值

interactions给出各个变量之间的关系，并绘制出散点图

相关性图（correlations）

缺失值情况

样本（sample）

特征分析

#describe
df.describe()
df.info()
df.isnull().sum()
df.corr() #协方差
#绘图
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
##1绘制热力图，相关性
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot= True,fmt=".1f",square=True)
plt.show()  #%matplotlib inline不能省略，不然加载不出来
#好看一点的
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),vmax=0.3,center=0,square=True,
            linewidths=0.5,cbar_kws={"shrink":0.5},annot= True,fmt=".1f")
plt.tight_layout()
plt.show()  #%matplotlib inline不能省略，不然加载不出来
##2绘制散点图
sns.pairplot(df)
plt.show() #绘制各列之间两两的散点图

1.相关性热力图
在这里插入图片描述
好看一点的

2.两两之间散点图

单列特征分析

sns.displot(df["Age"])
plt.show()

在这里插入图片描述

sns.countplot(x="Classification",data = df,palette = "bwr")

在这里插入图片描述

sns.countplot(x="Age",data=df,palette = "bwr")

在这里插入图片描述

单列特征与标签的关系

#查看年龄和患糖尿病的关系
pd.crosstab(df.Age,df.Classification).plot(kind="bar",figsize=(20,6))

在这里插入图片描述

#添加图例
pd.crosstab(df.Age,df.Classification).plot(kind="bar",figsize=(20,6))
plt.title("Heart Disease Frequency of Ages")
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

在这里插入图片描述

# 箱型图与小提琴图
sns.boxplot(x=df['Classification'],y=df['Age'])
plt.show()

sns.violinplot(x=df['Classification'],y=df['Age'])
plt.show()

在这里插入图片描述

## 绘制散点图（Classification是一个类别数据，其它的都是数值型数据，对应的类别是没有的，不好判断与Classification的关系）
plt.scatter(x=df.Age[df.Classification==1],y=df.Glucose[df.Classification==1],c='red')
plt.scatter(x=df.Age[df.Classification==2],y=df.Glucose[df.Classification==2],c='blue')
plt.xlabel('Age')
plt.ylabel('Glucose')
plt.legend(['Disease','Not Disease'])
plt.show()

在这里插入图片描述

Shape值（可解释性）

X = df.drop('Classification',axis=1)
y = df['Classification']
#训练集和测试集的划分
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)

#构建随机森林模型
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=5,n_estimators=100)
model.fit(X_train,y_train)

y_pred = model.predict(X_test)# 定性预测
y_pred_proba = model.predict_proba(X_test) #定量预测

在这里插入图片描述

#特征重要度
shap.summary_plot(shap_values[1],X_test,plot_type = "bar")

在这里插入图片描述

#各特征值大小与特征的shap值关系图
shap.summary_plot(shap_values[1],X_test)

在这里插入图片描述

Mayese

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
糖尿病预测

数据集介绍age:年龄BMI：身体健康指数，衡量人胖瘦程度以及是否健康的标准Glucose：葡萄糖Insulin：胰岛素HOMA：HOMA指数是医学研究人员经过大量的临床研究和基础研究设计出来的HOMA指数，用它来进行胰岛功能，包括胰岛素抵抗以及胰岛β细胞功能的测定，也经过了临床的验证。Lepin：重组蛋白Adiponectin：脂肪连接蛋白Resistin：抵抗素MCP.1:高活性细胞因子Classification：1代表没患糖尿病；2代表患糖尿病EDA（数据探索性分析）生成
复制链接

扫一扫

专栏目录