基于Kaggle心脏病数据集的数据分析和分类预测-StatisticalLearning统计学习实验报告
一、实验准备
本数据来源于kaggle,包含14个维度,303个样本,具体的变量说明如下表所示。
变量名 | 详细说明 | 取值范围 |
---|---|---|
target | 是否患有心脏病(分类变量) | 0=否,1=是 |
age | 年龄(连续变量) | [29,77] |
sex | 性别(分类变量) | 1=男,0=女 |
cp | 胸痛经历(分类变量) | 1=典型心绞痛,2=非典型性心绞痛,3=非心绞痛,4=无症状 |
trestbps | 静息血压(连续变量Hg) | [94,200] |
chols | 人体胆固醇(连续变量mg/dl) | [126,564] |
fbs | 空腹血糖(分类变量>120mg/dl) | 1=真,0=假 |
restecg | 静息心电图测量(分类变量) | 0=正常,1=有ST-T波异常,2=按Estes标准显示可能或明确的左心室肥厚 |
thalach | 最大心率(连续变量) | [71,202] |
exang | 运动诱发心绞痛(分类变量) | 1=是,0=否 |
oldpeak | 运动相对于休息引起的ST段压低(连续变量) | [0,6.2] |
slope | 峰值运动ST段的斜率(分类变量) | 1=上升,2=平坦,3=下降 |
ca | 主要血管数量(连续变量) | [0,3] |
thal | 地中海贫血的血液疾病(分类变量) | 1=正常,2=固定缺陷,3=可逆缺陷 |
'''
-*- coding: utf-8 -*-
@Author : DouGang
@E-mail : dorza@qq.com
@Software : PyCharm, Python3.6
@Time : 2021-07-24
'''
导入相关库
# 数据集特征分析相关库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 数据集预处理相关库
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# K近邻算法相关库
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import precision_recall_curve,roc_curve,average_precision_score,auc
# 决策树相关库
from sklearn.tree import DecisionTreeClassifier
# 随机森林相关库
from sklearn.ensemble import RandomForestClassifier
# 逻辑回归相关库
from sklearn.linear_model import LogisticRegression
# SGD分类相关库
from sklearn.linear_model import SGDClassifier
二、数据展示
plt.rcParams['font.sans-serif'] = ['SimHei'] # 设置图表的显示样式
heart_df = pd.read_csv("./dataSet/heart.csv")
print(heart_df.shape) # 查看数据的维度
print(heart_df.head()) # 查看数据的前5行
print(heart_df.info()) # 展示数据的详细信息
print(heart_df.describe()) # 描述统计相关信息
print(heart_df.isnull().sum()) # 缺少值检查
sns.heatmap(heart_df.isnull())
plt.show()
sns.pairplot(heart_df,hue='target')
plt.show()
(303, 14)
age sex cp trestbps chol fbs ... exang oldpeak slope ca thal target
0 63 1 3 145 233 1 ... 0 2.3 0 0 1 1
1 37 1 2 130 250 0 ... 0 3.5 0 0 2 1
2 41 0 1 130 204 0 ... 0 1.4 2 0 2 1
3 56 1 1 120 236 0 ... 0 0.8 2 0 2 1
4 57 0 0 120 354 0 ... 1 0.6 2 0 2 1
[5 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
age sex cp ... ca thal target
count 303.000000 303.000000 303.000000 ... 303.000000 303.000000 303.000000
mean 54.366337 0.683168 0.966997 ... 0.729373 2.313531 0.544554
std 9.082101 0.466011 1.032052 ... 1.022606 0.612277 0.498835
min 29.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000
25% 47.500000 0.000000 0.000000 ... 0.000000 2.000000 0.000000
50% 55.000000 1.000000 1.000000 ... 0.000000 2.000000 1.000000
75% 61.000000 1.000000 2.000000 ... 1.000000 3.000000 1.000000
max 77.000000 1.000000 3.000000 ... 4.000000 3.000000 1.000000
[8 rows x 14 columns]
age 0
sex 0
cp 0
trestbps 0
chol 0
fbs 0
restecg 0
thalach 0
exang 0
oldpeak 0
slope 0
ca 0
thal 0
target 0
dtype: int64
sns.heatmap(heart_df.isnull())
plt.show()
sns.pairplot(heart_df,hue='target')
plt.show()
三、数据的描述性信息
# 绘制变量的相关系数
plt.figure(figsize=(10,10))
sns.heatmap(heart_df.corr(),annot=True,fmt='.1f')
plt.show()
# 查看样本的年龄分布
heart_df['age'].value_counts()
sns.barplot(x=heart_df.age.value_counts().index,y=heart_df.age.value_counts().values)
plt.xlabel('Age')
plt.ylabel('Age Counter')
plt.title('Age Analysis System')
plt.show()
# 查看年龄列的最大值、最小值以及平均值
minage = min(heart_df.age)
maxage = max(heart_df.age)
meanage = round(heart_df.age.mean(),2)
print('最小年龄:',minage)
print('最大年龄:',maxage)
print('平均年龄:',meanage)
# 将连续变量年龄转换成分类变量年龄的状态
heart_df['age_states']=0
heart_df['age_states'][(heart_df['age']>=29)&(heart_df['age']<40)]='young ages'
heart_df['age_states'][(heart_df['age']>=40)&(heart_df['age']<55)]='middle ages'
heart_df['age_states'][(heart_df['age']>=55)&(heart_df['age']<=77)]='old ages'
# 查看各年龄段的样本数量
print(heart_df['age_states'].value_counts())
'''
x: x轴上的条形图,直接为series数据 y: y轴上的条形图,直接为series数据
order代表x轴上各类别的先后顺序
hue代表类别 hue_order代表带类别的先后顺序
'''
sns.countplot(x='age_states',data=heart_df,order=['young ages','middle ages','old ages'])
plt.xlabel('Age Range')
plt.ylabel('Age Counts')
plt.title('Age State in Dataset')
plt.show()
最小年龄: 29
最大年龄: 77
平均年龄: 54.37
old ages 159
middle ages 128
young ages 16
Name: age_states, dtype: int64
'''
通过如下图发现在样本中随着年龄的变化:
样本的数据量逐渐增多,青年人16,中年人128,老年人159。
'''
# 性别样本数据数据占比 0代表女性 1代表男性
print(heart_df['sex'].value_counts())
sns.countplot(y='sex',data=heart_df)
plt.title('Sex Count in Dataset')
plt.show()
1 207
0 96
Name: sex, dtype: int64
# 列名代表是否换心脏病 行名代表性别
pd.crosstab(heart_df['sex'],heart_df['target'])
# 性别与是否患有心脏病的关系 0代表女性;1代表男性
pd.crosstab(heart_df['sex'],heart_df['target']).plot(kind="bar",figsize=(12,8),color=['#1CA53B','#AA1111'])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('sex(0=female, 1=male)')
plt.xticks(rotation=0)
plt.legend(["'Haven't Disease","Have Disease"])
plt.ylabel('Frequency')
plt.show()
# 心脏病预测-性别与患病分析
# 患病的分布情况
fig,axes = plt.subplots(1,2,figsize=(10,5))
ax = heart_df.target.value_counts().plot(kind="bar",ax=axes[0])
ax.set_title("患病分布")
ax.set_xlabel("1:患病,0:未患病")
heart_df.target.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['患病','未患病'],ax=axes[1])
plt.show()
# 性别和患病的分布
ax1 = plt.subplot(121)
ax = sns.countplot(x="sex",hue='target',data=heart_df,ax=ax1)
ax.set_xlabel("0:女性,1:男性")
ax2 = plt.subplot(222)
heart_df[heart_df['target'] == 0].sex.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['男性','女性'],ax=ax2)
ax2.set_title("未患病性别比例")
ax2 = plt.subplot(224)
heart_df[heart_df['target'] == 1].sex.value_counts().plot(kind="pie",autopct="%.2f%%",labels=['男性','女性'],ax=ax2)
ax2.set_title("患病性别比例")
plt.show()
fig,axes = plt.subplots(2,1,figsize=(20,10))
sns.countplot(x="age",hue="target",data=heart_df,ax=axes[0])
# 0-45:青年人,45-59:中年人,60-100:老年人
age_type = pd.cut(heart_df.age,bins=[0,45,60,100],include_lowest=True,right=False,labels=['青年人','中年人','老年人'])
age_target_df = pd.concat([age_type,heart_df.target],axis=1)
sns.countplot(x="age",hue='target',data=age_target_df)
plt.show()
# 统一看下所有特征的分布情况
fig,axes = plt.subplots(7,2,figsize=(10,20))
for x in range(0,14):
plt.subplot(7,2,x+1)
sns.distplot(heart_df.iloc[:,x],kde=True)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8,5))
sns.heatmap(heart_df.corr(),cmap="Blues",annot=True)
plt.show()
四、特征预处理
# 数据预处理
features = heart_df.drop(columns=['target'])
targets = heart_df['target']
# 将离散型数据,从普通的0,1,2这些,转换成真正的字符串表示
# sex
features.loc[features['sex']==0,'sex'] = 'female'
features.loc[features['sex']==1,'sex'] = 'male'
# cp
features.loc[features['cp'] == 1,'cp'] = 'typical'
features.loc[features['cp'] == 2,'cp'] = 'atypical'
features.loc[features['cp'] == 3,'cp'] = 'non-anginal'
features.loc[features['cp'] == 4,'cp'] = 'asymptomatic'
# fbs
features.loc[features['fbs'] == 1,'fbs'] = 'true'
features.loc[features['fbs'] == 0,'fbs'] = 'false'
# exang
features.loc[features['exang'] == 1,'exang'] = 'true'
features.loc[features['exang'] == 0,'exang'] = 'false'
# slope
features.loc[features['slope'] == 1,'slope'] = 'true'
features.loc[features['slope'] == 2,'slope'] = 'true'
features.loc[features['slope'] == 3,'slope'] = 'true'
# thal
features.loc[features['thal'] == 3,'thal'] = 'normal'
features.loc[features['thal'] == 3,'thal'] = 'fixed'
features.loc[features['thal'] == 3,'thal'] = 'reversable'
# restecg
# 0:普通,1:ST-T波异常,2:可能左心室肥大
features.loc[features['restecg'] == 0,'restecg'] = 'normal'
features.loc[features['restecg'] == 1,'restecg'] = 'ST-T abnormal'
features.loc[features['restecg'] == 2,'restecg'] = 'Left ventricular hypertrophy'
# ca
features['ca'].astype("object")
# thal
features.thal.astype("object")
features.head()
features = pd.get_dummies(features)
features_temp = StandardScaler().fit_transform(features)
# features_temp = StandardScaler().fit_transform(pd.get_dummies(features))
X_train,X_test,y_train,y_test = train_test_split(features_temp,targets,test_size=0.25)
五、各种分类方法实现分类预测和算法评估
5.1 K近邻预测
def plotting(estimator,y_test):
fig,axes = plt.subplots(1,2,figsize=(10,5))
y_predict_proba = estimator.predict_proba(X_test)
precisions,recalls,thretholds = precision_recall_curve(y_test,y_predict_proba[:,1])
axes[0].plot(precisions,recalls)
axes[0].set_title("平均精准率:%.2f"%average_precision_score(y_test,y_predict_proba[:,1]))
axes[0].set_xlabel("召回率")
axes[0].set_ylabel("精准率")
fpr,tpr,thretholds = roc_curve(y_test,y_predict_proba[:,1])
axes[1].plot(fpr,tpr)
axes[1].set_title("AUC值:%.2f"%auc(fpr,tpr))
axes[1].set_xlabel("FPR")
axes[1].set_ylabel("TPR")
# K近邻
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn,features_temp,targets,cv=5)
print("准确率:",scores.mean())
knn.fit(X_train,y_train)
y_predict = knn.predict(X_test)
# 精准率
print("精准率:",precision_score(y_test,y_predict))
# 召回率
print("召回率:",recall_score(y_test,y_predict))
# F1-Score
print("F1得分:",f1_score(y_test,y_predict))
plotting(knn,y_test)
plt.show()
准确率: 0.7985245901639344
精准率: 0.8
召回率: 0.8421052631578947
F1得分: 0.8205128205128205
5.2 决策树算法评估
tree = DecisionTreeClassifier(max_depth=10)
tree.fit(X_train,y_train)
plotting(tree,y_test)
plt.show()
5.3 随机森林算法评估
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
plotting(rf,y_test)
plt.show()
5.4 逻辑回归算法评估
logic = LogisticRegression(tol=1e-10)
logic.fit(X_train,y_train)
plotting(logic,y_test)
plt.show()
5.5 SGD分类算法评估
sgd = SGDClassifier(loss="log")
sgd.fit(X_train,y_train)
plotting(sgd,y_test)
plt.show()
5.6 特征重要性分析
# 4.6 心脏病预测-特征重要性分析
importances = pd.Series(data=rf.feature_importances_,index=features.columns).sort_values(ascending=False)
sns.barplot(y=importances.index,x=importances.values,orient='h')
plt.show()