基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告

最新推荐文章于 2025-05-08 10:54:19 发布

中科豆

最新推荐文章于 2025-05-08 10:54:19 发布

阅读量2.2w

点赞数 45

文章标签： python 机器学习数据分析数据可视化

本文链接：https://blog.csdn.net/qq_40605313/article/details/120061220

版权

基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告

一、实验准备

本数据来源于kaggle,包含14个维度，303个样本，具体的变量说明如下表所示。

变量名	详细说明	取值范围
target	是否患有心脏病（分类变量）	0=否，1=是
age	年龄（连续变量）	[29，77]
sex	性别（分类变量）	1=男，0=女
cp	胸痛经历（分类变量）	1=典型心绞痛，2=非典型性心绞痛，3=非心绞痛，4=无症状
trestbps	静息血压（连续变量Hg）	[94，200]
chols	人体胆固醇（连续变量mg/dl）	[126，564]
fbs	空腹血糖（分类变量>120mg/dl）	1=真，0=假
restecg	静息心电图测量（分类变量）	0=正常，1=有ST-T波异常，2=按Estes标准显示可能或明确的左心室肥厚
thalach	最大心率（连续变量）	[71，202]
exang	运动诱发心绞痛（分类变量）	1=是，0=否
oldpeak	运动相对于休息引起的ST段压低（连续变量）	[0，6.2]
slope	峰值运动ST段的斜率（分类变量）	1=上升，2=平坦，3=下降
ca	主要血管数量（连续变量）	[0，3]
thal	地中海贫血的血液疾病（分类变量）	1=正常，2=固定缺陷，3=可逆缺陷

'''
    -*- coding: utf-8 -*-
    @Author     : DouGang
    @E-mail     : dorza@qq.com
    @Software   : PyCharm, Python3.6
    @Time       : 2021-07-24
'''

导入相关库

# 数据集特征分析相关库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 数据集预处理相关库
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# K近邻算法相关库
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import precision_recall_curve,roc_curve,average_precision_score,auc
# 决策树相关库
from sklearn.tree import DecisionTreeClassifier
# 随机森林相关库
from sklearn.ensemble import RandomForestClassifier
# 逻辑回归相关库
from sklearn.linear_model import LogisticRegression
# SGD分类相关库
from sklearn.linear_model import SGDClassifier

二、数据展示

plt.rcParams['font.sans-serif'] = ['SimHei']    # 设置图表的显示样式

heart_df = pd.read_csv("./dataSet/heart.csv")
print(heart_df.shape)   # 查看数据的维度
print(heart_df.head())  # 查看数据的前5行
print(heart_df.info())  # 展示数据的详细信息
print(heart_df.describe())      # 描述统计相关信息
print(heart_df.isnull().sum())  # 缺少值检查
sns.heatmap(heart_df.isnull())
plt.show()
sns.pairplot(heart_df,hue='target')
plt.show()

(303, 14)
   age  sex  cp  trestbps  chol  fbs  ...  exang  oldpeak  slope  ca  thal  target
0   63    1   3       145   233    1  ...      0      2.3      0   0     1       1
1   37    1   2       130   250    0  ...      0      3.5      0   0     2       1
2   41    0   1       130   204    0  ...      0      1.4      2   0     2       1
3   56    1   1       120   236    0  ...      0      0.8      2   0     2       1
4   57    0   0       120   354    0  ...      1      0.6      2   0     2       1
[5 rows x 14 columns]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

              age         sex          cp  ...          ca        thal      target
count  303.000000  303.000000  303.000000  ...  303.000000  303.000000  303.000000
mean    54.366337    0.683168    0.966997  ...    0.729373    2.313531    0.544554
std      9.082101    0.466011    1.032052  ...    1.022606    0.612277    0.498835
min     29.000000    0.000000    0.000000  ...    0.000000    0.000000    0.000000
25%     47.500000    0.000000    0.000000  ...    0.000000    2.000000    0.000000
50%     55.000000    1.000000    1.000000  ...    0.000000    2.000000    1.000000
75%     61.000000    1.000000    2.000000  ...    1.000000    3.000000    1.000000
max     77.000000    1.000000    3.000000  ...    4.000000    3.000000    1.000000

[8 rows x 14 columns]
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

sns.heatmap(heart_df.isnull())
plt.show()
sns.pairplot(heart_df,hue='target')
plt.show()

在这里插入图片描述

三、数据的描述性信息

# 绘制变量的相关系数
plt.figure(figsize=(10,10))
sns.heatmap(heart_df.corr(),annot=True,fmt='.1f')
plt.show()

在这里插入图片描述

# 查看样本的年龄分布
heart_df['age'].value_counts()
sns.barplot(x=heart_df.age.value_counts().index,y=heart_df.age.value_counts().values)
plt.xlabel('Age')
plt.ylabel('Age Counter')
plt.title('Age Analysis System')
plt.show()

在这里插入图片描述

# 查看年龄列的最大值、最小值以及平均值
minage = min(heart_df.age)
maxage = max(heart_df.age)
meanage = round(heart_df.age.mean(),2)
print('最小年龄:',minage)
print('最大年龄:',maxage)
print('平均年龄:',meanage)
# 将连续变量年龄转换成分类变量年龄的状态
heart_df['age_states']=0
heart_df['age_states'][(heart_df['age']>=29)&(heart_df['age']<40)]='young ages'
heart_df['age_states'][(heart_df['age']>=40)&(heart_df['age']<55)]='middle ages'
heart_df['age_states'][(heart_df['age']>=55)&(heart_df['age']<=77)]='old ages'
# 查看各年龄段的样本数量
print(heart_df['age_states'].value_counts())
'''
    x: x轴上的条形图，直接为series数据 y: y轴上的条形图，直接为series数据
    order代表x轴上各类别的先后顺序
    hue代表类别 hue_order代表带类别的先后顺序
'''
sns.countplot(x='age_states',data=heart_df,order=['young ages','middle ages','old ages'])
plt.xlabel('Age Range')
plt.ylabel('Age Counts')
plt.title('Age State in Dataset')
plt.show()

最小年龄: 29
最大年龄: 77
平均年龄: 54.37
old ages       159
middle ages    128
young ages      16
Name: age_states, dtype: int64

在这里插入图片描述

'''
    通过如下图发现在样本中随着年龄的变化：
    样本的数据量逐渐增多，青年人16，中年人128，老年人159。
'''
# 性别样本数据数据占比 0代表女性 1代表男性
print(heart_df['sex'].value_counts())
sns.countplot(y='sex',data=heart_df)
plt.title('Sex Count in Dataset')
plt.show()

1    207
0     96
Name: sex, dtype: int64

在这里插入图片描述

# 列名代表是否换心脏病 行名代表性别
pd.crosstab(heart_df['sex'],heart_df['target'])
# 性别与是否患有心脏病的关系 0代表女性；1代表男性
pd.crosstab(heart_df['sex'],heart_df['target']).plot(kind="bar",figsize=(12,8),color=['#1CA53B','#AA1111'])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('sex(0=female, 1=male)')
plt.xticks(rotation=0)
plt.legend(["'Haven't Disease","Have Disease"])
plt.ylabel('Frequency')
plt.show()

在这里插入图片描述

# 心脏病预测-性别与患病分析
# 患病的分布情况
fig,axes = plt.subplots(1,2,figsize=(10,5))
ax = heart_df.target.value_counts().plot(kind="bar",ax=axes[0])
ax.set_title("患病分布")
ax.set_xlabel("1：患病，0：未患病")

heart_df.target.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['患病','未患病'],ax=axes[1])
plt.show()

在这里插入图片描述

# 性别和患病的分布
ax1 = plt.subplot(121)
ax = sns.countplot(x="sex",hue='target',data=heart_df,ax=ax1)
ax.set_xlabel("0：女性，1：男性")

ax2 = plt.subplot(222)
heart_df[heart_df['target'] == 0].sex.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['男性','女性'],ax=ax2)
ax2.set_title("未患病性别比例")

ax2 = plt.subplot(224)
heart_df[heart_df['target'] == 1].sex.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['男性','女性'],ax=ax2)
ax2.set_title("患病性别比例")
plt.show()

在这里插入图片描述

fig,axes = plt.subplots(2,1,figsize=(20,10))
sns.countplot(x="age",hue="target",data=heart_df,ax=axes[0])

# 0-45：青年人，45-59：中年人，60-100：老年人
age_type = pd.cut(heart_df.age,bins=[0,45,60,100],include_lowest=True,right=False,labels=['青年人','中年人','老年人'])
age_target_df = pd.concat([age_type,heart_df.target],axis=1)
sns.countplot(x="age",hue='target',data=age_target_df)
plt.show()

在这里插入图片描述

# 统一看下所有特征的分布情况
fig,axes = plt.subplots(7,2,figsize=(10,20))
for x in range(0,14):
    plt.subplot(7,2,x+1)
    sns.distplot(heart_df.iloc[:,x],kde=True)
plt.tight_layout()
plt.show()

在这里插入图片描述

plt.figure(figsize=(8,5))
sns.heatmap(heart_df.corr(),cmap="Blues",annot=True)
plt.show()

在这里插入图片描述

四、特征预处理

# 数据预处理
features = heart_df.drop(columns=['target'])
targets = heart_df['target']
# 将离散型数据，从普通的0,1,2这些，转换成真正的字符串表示

# sex
features.loc[features['sex']==0,'sex'] = 'female'
features.loc[features['sex']==1,'sex'] = 'male'

# cp
features.loc[features['cp'] == 1,'cp'] = 'typical'
features.loc[features['cp'] == 2,'cp'] = 'atypical'
features.loc[features['cp'] == 3,'cp'] = 'non-anginal'
features.loc[features['cp'] == 4,'cp'] = 'asymptomatic'

# fbs
features.loc[features['fbs'] == 1,'fbs'] = 'true'
features.loc[features['fbs'] == 0,'fbs'] = 'false'

# exang
features.loc[features['exang'] == 1,'exang'] = 'true'
features.loc[features['exang'] == 0,'exang'] = 'false'

# slope
features.loc[features['slope'] == 1,'slope'] = 'true'
features.loc[features['slope'] == 2,'slope'] = 'true'
features.loc[features['slope'] == 3,'slope'] = 'true'

# thal
features.loc[features['thal'] == 3,'thal'] = 'normal'
features.loc[features['thal'] == 3,'thal'] = 'fixed'
features.loc[features['thal'] == 3,'thal'] = 'reversable'

# restecg
# 0：普通，1：ST-T波异常，2：可能左心室肥大
features.loc[features['restecg'] == 0,'restecg'] = 'normal'
features.loc[features['restecg'] == 1,'restecg'] = 'ST-T abnormal'
features.loc[features['restecg'] == 2,'restecg'] = 'Left ventricular hypertrophy'

# ca
features['ca'].astype("object")

# thal
features.thal.astype("object")

features.head()

features = pd.get_dummies(features)
features_temp = StandardScaler().fit_transform(features)
# features_temp = StandardScaler().fit_transform(pd.get_dummies(features))

X_train,X_test,y_train,y_test = train_test_split(features_temp,targets,test_size=0.25)

五、各种分类方法实现分类预测和算法评估

5.1 K近邻预测

def plotting(estimator,y_test):
    fig,axes = plt.subplots(1,2,figsize=(10,5))
    y_predict_proba = estimator.predict_proba(X_test)
    precisions,recalls,thretholds = precision_recall_curve(y_test,y_predict_proba[:,1])
    axes[0].plot(precisions,recalls)
    axes[0].set_title("平均精准率：%.2f"%average_precision_score(y_test,y_predict_proba[:,1]))
    axes[0].set_xlabel("召回率")
    axes[0].set_ylabel("精准率")

    fpr,tpr,thretholds = roc_curve(y_test,y_predict_proba[:,1])
    axes[1].plot(fpr,tpr)
    axes[1].set_title("AUC值：%.2f"%auc(fpr,tpr))
    axes[1].set_xlabel("FPR")
    axes[1].set_ylabel("TPR")

# K近邻
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn,features_temp,targets,cv=5)
print("准确率：",scores.mean())

knn.fit(X_train,y_train)

y_predict = knn.predict(X_test)
# 精准率
print("精准率：",precision_score(y_test,y_predict))
# 召回率
print("召回率：",recall_score(y_test,y_predict))
# F1-Score
print("F1得分：",f1_score(y_test,y_predict))

plotting(knn,y_test)
plt.show()

在这里插入图片描述

准确率： 0.7985245901639344
精准率： 0.8
召回率： 0.8421052631578947
F1得分： 0.8205128205128205

5.2 决策树算法评估

tree = DecisionTreeClassifier(max_depth=10)
tree.fit(X_train,y_train)
plotting(tree,y_test)
plt.show()

在这里插入图片描述

5.3 随机森林算法评估

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
plotting(rf,y_test)
plt.show()

在这里插入图片描述

5.4 逻辑回归算法评估

logic = LogisticRegression(tol=1e-10)
logic.fit(X_train,y_train)
plotting(logic,y_test)
plt.show()

在这里插入图片描述

5.5 SGD分类算法评估

sgd = SGDClassifier(loss="log")
sgd.fit(X_train,y_train)
plotting(sgd,y_test)
plt.show()

在这里插入图片描述

5.6 特征重要性分析

# 4.6 心脏病预测-特征重要性分析
importances = pd.Series(data=rf.feature_importances_,index=features.columns).sort_values(ascending=False)
sns.barplot(y=importances.index,x=importances.values,orient='h')
plt.show()

在这里插入图片描述