Kaggle 心脏病分类预测数据分析案例 (逻辑回归,KNN,决策树,随机森林...)

本文链接：https://blog.csdn.net/noob_sufan/article/details/91980521

该博客演示了在Kaggle上的心脏病分类预测数据分析，涉及逻辑回归、KNN、决策树和随机森林模型。通过F1指标、混淆矩阵和ROC曲线对比各模型性能，并探讨了数据集规模对模型融合效果的影响。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文是一篇关于kaggle上一个’心脏病分类预测’数据集的分析小demo

总体过程为:数据观察,数据处理,分别建立逻辑回归,KNN,决策树模型,观察F1指标,混淆矩阵,精准率和召回率曲线,绘制每个模型的ROC曲线进行对比,最后进行模型融合,使用到随机森林.

数据集地址: https://www.kaggle.com/ronitf/heart-disease-uci

数据观察部分

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 解决matplotlib中文问题
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题

# 导入数据
df = pd.read_csv('heart_disease_data/heart.csv')

瞄一瞄数据的总体情况

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB

特征的含义

age 年龄
sex 性别 1=male,0=female
cp  胸痛类型(4种) 值1:典型心绞痛，值2:非典型心绞痛，值3:非心绞痛，值4:无症状
trestbps 静息血压 
chol 血清胆固醇
fbs 空腹血糖 >120mg/dl ,1=true; 0=false
restecg 静息心电图(值0,1,2)
thalach 达到的最大心率
exang 运动诱发的心绞痛(1=yes;0=no)
oldpeak 相对于休息的运动引起的ST值(ST值与心电图上的位置有关)
slope 运动高峰ST段的坡度 Value 1: upsloping向上倾斜, Value 2: flat持平, Value 3: downsloping向下倾斜
ca  The number of major vessels(血管) (0-3)
thal A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
       一种叫做地中海贫血的血液疾病(3 =正常;6 =固定缺陷;7 =可逆转缺陷)
target 生病没有(0=no,1=yes)

df.describe()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
count	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000	303.000000
mean	54.366337	0.683168	0.966997	131.623762	246.264026	0.148515	0.528053	149.646865	0.326733	1.039604	1.399340	0.729373	2.313531	0.544554
std	9.082101	0.466011	1.032052	17.538143	51.830751	0.356198	0.525860	22.905161	0.469794	1.161075	0.616226	1.022606	0.612277	0.498835
min	29.000000	0.000000	0.000000	94.000000	126.000000	0.000000	0.000000	71.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	47.500000	0.000000	0.000000	120.000000	211.000000	0.000000	0.000000	133.500000	0.000000	0.000000	1.000000	0.000000	2.000000	0.000000
50%	55.000000	1.000000	1.000000	130.000000	240.000000	0.000000	1.000000	153.000000	0.000000	0.800000	1.000000	0.000000	2.000000	1.000000
75%	61.000000	1.000000	2.000000	140.000000	274.500000	0.000000	1.000000	166.000000	1.000000	1.600000	2.000000	1.000000	3.000000	1.000000
max	77.000000	1.000000	3.000000	200.000000	564.000000	1.000000	2.000000	202.000000	1.000000	6.200000	2.000000	4.000000	3.000000	1.000000

简单的出图看看特征之间的关系

df.target.value_counts()

1    165
0    138
Name: target, dtype: int64

sns.countplot(x='target',data=df,palette="muted")
plt.xlabel("得病/未得病比例")

Text(0.5,0,'得病/未得病比例')

在这里插入图片描述

df.sex.value_counts()

1    207
0     96
Name: sex, dtype: int64

sns.countplot(x='sex',data=df,palette="Set3")
plt.xlabel("Sex (0 = 女, 1= 男)")

Text(0.5,0,'Sex (0 = 女, 1= 男)')

在这里插入图片描述

plt.figure(figsize=(18,7))
sns.countplot(x='age',data = df, hue = 'target',palette='PuBuGn',saturation=0.8)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.show()

在这里插入图片描述

对数据的认识是很重要的一部分,但是这篇主要针对建模的部分,所以数据探索部分就简单到此

数据处理

对特征中非连续型数值(cp,slope,thal)特征进行处理

first = pd.get_dummies(df['cp'], prefix = "cp")
second = pd.get_dummies(df['slope'], prefix = "slope")
thrid = pd.get_dummies(df['thal'], prefix = "thal")

df = pd.concat([df,first,second,thrid], axis = 1)
df = df.drop(columns = ['cp', 'slope', 'thal'])
df.head(3)

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	ca	...	cp_1	cp_2	cp_3	slope_0	slope_1	slope_2	thal_0	thal_1	thal_2	thal_3
0	63	1	145	233	1	0	150	0	2.3	0<