利用python机器学习库进行Kaggle皮马印第安人糖尿病预测分析

最新推荐文章于 2025-04-22 15:38:36 发布

Gerard_mok

最新推荐文章于 2025-04-22 15:38:36 发布

阅读量1.8w

点赞数 32

分类专栏：数据分析 python 文章标签：数据分析 python

本文链接：https://blog.csdn.net/Gerard_mok/article/details/99778829

版权

本文使用Python的scikit-learn对Kaggle皮马印第安人糖尿病数据集进行分析，通过逻辑回归构建预测模型，预测准确率达到80.5%。关键特征包括glucose、insulin、BMI和skin_thick。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

利用python机器学习库进行Kaggle皮马印第安人糖尿病预测分析

项目摘要

本项目主要使用python对各医学参数与糖尿病之间的关系进行可视化分析、描述性分析。使用scikit-learn机器学习工具进行推断性分析，对数据标准化、使用逻辑回归算法对测试集数据进行预测，最后使用混淆模型和准确率对模型进行评估。

主要结论：

数据集768人中，有268人患病，500人不患病，患病率为34.90%；
糖尿病患者的平均葡萄糖浓度、平均舒张压、平均皮褶厚度、平均血清胰岛素、平均体重指数、平均糖尿病谱系功能都比正常人高。患病者一般在27-47岁之间，怀孕次数在1-8次之间；
与糖尿病相关性较强的参数为glucose、insulin、BMI、skin_thick；
使用逻辑回归预测模型，在被预测的154名皮马印第安女性中，共124人被准确预测，预测准确率为80.5%。

在这里插入图片描述

一、数据集介绍

该数据集源至美国国家糖尿病、消化及肾脏疾病研究所。数据集的目的是根据已有诊断信息来预测患者是否患有糖尿病。但该数据库存在一定局限性，特别是数据集中的患者都是年龄大于等于21岁的皮马印第安女性。

数据来源：https://www.kaggle.com/uciml/pima-indians-diabetes-database

数据集由若干医学预测变量和一个目标变量Outcome组成，共九个字段。预测变量包括患者的怀孕次数，BMI，胰岛素水平，年龄等。
在这里插入图片描述

二、项目分析

分析步骤

1.提出问题

是否可以利用现有数据准确预测人是否患有糖尿病？

2.理解数据

简单查看表格内容。首先需要更改列名为更好的理解和使用数据。

# 导入数据处理包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 导入数据文件
df = pd.read_csv('diabete.csv')

# 查看数据基本情况
df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

新表列名信息如下：

在这里插入图片描述

2.1查看数据概况、数据类型

glucose、blood_pressure、skin_thick、insulin、BMI 最小数据不应为0，可理解这些数据列没有数据录入，存在数据缺失。处理思路：1.将这些列的0值转换成NaN值；2.根据outcome的结果计算出各列的平均值；3.使用平均值填充缺失值。

数据类型都为数值类型。

# 更改列名为更好的理解和使用
df.rename(columns={
   'Pregnancies': 'preg_times', 
                   'Glucose': 'glucose', 
                   'BloodPressure': 'blood_pressure', 
                   'SkinThickness': 'skin_thick', 
                   'Insulin': 'insulin',
                   'DiabetesPedigreeFunction': 'DPF',
                   'Age': 'age',
                   'Outcome': 'outcome'}, inplace=True) 
df.head()

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

print('数据集列数：%.f， 数据集行数：%.f' % (df.shape[1], df.shape[0]))

数据集列数：9， 数据集行数：768

df.describe()

	preg_times	glucose	blood_pressure	skin_thick	insulin	BMI	DPF	age	outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
preg_times        768 non-null int64
glucose           768 non-null int64
blood_pressure    768 non-null int64
skin_thick        768 non-null int64
insulin           768 non-null int64
BMI               768 non-null float64
DPF               768 non-null float64
age               768 non-null int64
outcome           768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

3 数据清洗

3.1数据预处理

将glucose、blood_pressure、skin_thick、insulin、BMI 0值替换为NaN值。

# 数据清理 glucose、blood_pressure、skin_thick、insulin、BMI
# 存在无效的0值，将所有0值替换为NaN
df_copy = df.copy(deep = True)

最低0.47元/天解锁文章