本文是一篇关于kaggle上一个’心脏病分类预测’数据集的分析小demo
总体过程为:数据观察,数据处理,分别建立逻辑回归,KNN,决策树模型,观察F1指标,混淆矩阵,精准率和召回率曲线,绘制每个模型的ROC曲线进行对比,最后进行模型融合,使用到随机森林.
数据观察部分
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
import seaborn as sns
from pylab import mpl
mpl. rcParams[ 'font.sans-serif' ] = [ 'SimHei' ]
mpl. rcParams[ 'axes.unicode_minus' ] = False
df = pd. read_csv( 'heart_disease_data/heart.csv' )
瞄一瞄数据的总体情况
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age 303 non-null int64
sex 303 non-null int64
cp 303 non-null int64
trestbps 303 non-null int64
chol 303 non-null int64
fbs 303 non-null int64
restecg 303 non-null int64
thalach 303 non-null int64
exang 303 non-null int64
oldpeak 303 non-null float64
slope 303 non-null int64
ca 303 non-null int64
thal 303 non-null int64
target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB
特征的含义
age 年龄
sex 性别 1 = male, 0 = female
cp 胸痛类型( 4 种) 值1 : 典型心绞痛,值2 : 非典型心绞痛,值3 : 非心绞痛,值4 : 无症状
trestbps 静息血压
chol 血清胆固醇
fbs 空腹血糖 > 120mg/ dl , 1 = true; 0 = false
restecg 静息心电图( 值0 , 1 , 2 )
thalach 达到的最大心率
exang 运动诱发的心绞痛( 1 = yes; 0 = no)
oldpeak 相对于休息的运动引起的ST值( ST值与心电图上的位置有关)
slope 运动高峰ST段的坡度 Value 1 : upsloping向上倾斜, Value 2 : flat持平, Value 3 : downsloping向下倾斜
ca The number of major vessels( 血管) ( 0 - 3 )
thal A blood disorder called thalassemia ( 3 = normal; 6 = fixed defect; 7 = reversable defect)
一种叫做地中海贫血的血液疾病( 3 = 正常; 6 = 固定缺陷; 7 = 可逆转缺陷)
target 生病没有( 0 = no, 1 = yes)
df. describe( )
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
target
count
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
303.000000
mean
54.366337
0.683168
0.966997
131.623762
246.264026
0.148515
0.528053
149.646865
0.326733
1.039604
1.399340
0.729373
2.313531
0.544554
std
9.082101
0.466011
1.032052
17.538143
51.830751
0.356198
0.525860
22.905161
0.469794
1.161075
0.616226
1.022606
0.612277
0.498835
min
29.000000
0.000000
0.000000
94.000000
126.000000
0.000000
0.000000
71.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
25%
47.500000
0.000000
0.000000
120.000000
211.000000
0.000000
0.000000
133.500000
0.000000
0.000000
1.000000
0.000000
2.000000
0.000000
50%
55.000000
1.000000
1.000000
130.000000
240.000000
0.000000
1.000000
153.000000
0.000000
0.800000
1.000000
0.000000
2.000000
1.000000
75%
61.000000
1.000000
2.000000
140.000000
274.500000
0.000000
1.000000
166.000000
1.000000
1.600000
2.000000
1.000000
3.000000
1.000000
max
77.000000
1.000000
3.000000
200.000000
564.000000
1.000000
2.000000
202.000000
1.000000
6.200000
2.000000
4.000000
3.000000
1.000000
简单的出图看看特征之间的关系
df. target. value_counts( )
1 165
0 138
Name: target, dtype: int64
sns. countplot( x= 'target' , data= df, palette= "muted" )
plt. xlabel( "得病/未得病比例" )
Text(0.5,0,'得病/未得病比例')
df. sex. value_counts( )
1 207
0 96
Name: sex, dtype: int64
sns. countplot( x= 'sex' , data= df, palette= "Set3" )
plt. xlabel( "Sex (0 = 女, 1= 男)" )
Text(0.5,0,'Sex (0 = 女, 1= 男)')
plt. figure( figsize= ( 18 , 7 ) )
sns. countplot( x= 'age' , data = df, hue = 'target' , palette= 'PuBuGn' , saturation= 0.8 )
plt. xticks( fontsize= 13 )
plt. yticks( fontsize= 13 )
plt. show( )
对数据的认识是很重要的一部分,但是这篇主要针对建模的部分,所以数据探索部分就简单到此
数据处理
对特征中非连续型数值(cp,slope,thal)特征进行处理
first = pd. get_dummies( df[ 'cp' ] , prefix = "cp" )
second = pd. get_dummies( df[ 'slope' ] , prefix = "slope" )
thrid = pd. get_dummies( df[ 'thal' ] , prefix = "thal" )
df = pd. concat( [ df, first, second, thrid] , axis = 1 )
df = df. drop( columns = [ 'cp' , 'slope' , 'thal' ] )
df. head( 3 )
age
sex
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
ca
...
cp_1
cp_2
cp_3
slope_0
slope_1
slope_2
thal_0
thal_1
thal_2
thal_3
0
63
1
145
233
1
0
150
0
2.3
0<