Machine Learning--Heart Disease Prediction 1

Source Information:

( a ) Creators:
– 1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
– 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
– 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
– 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
Robert Detrano, M.D., Ph.D.
( b ) Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779
( c ) Date: July, 1988

Data Introduction:

Based on the introduction of the heart disease data base, while the databases have 76 raw attributes, only 14 of them are actually used. So, I would like to use these 14 features to build the model first and later deal with the raw data with 76 features(if I have time)

The followings are the info of the 14 features:
Attribute Information: ( Only 14 used )

  -- 1. #3  (age)       
  -- 2. #4  (sex)       
  -- 3. #9  (cp)        
  -- 4. #10 (trestbps)  
  -- 5. #12 (chol)      
  -- 6. #16 (fbs)       
  -- 7. #19 (restecg)   
  -- 8. #32 (thalach)   
  -- 9. #38 (exang)     
  -- 10. #40 (oldpeak)   
  -- 11. #41 (slope)     
  -- 12. #44 (ca)        
  -- 13. #51 (thal)      
  -- 14. #58 (num)       (the predicted attribute)


--> 3 age:  age in years. 
--> 4 sex: sex (1 = male; 0 = female) 
--> 9 cp: chest pain type
    -- Value 1: typical angina
    -- Value 2: atypical angina
    -- Value 3: non-anginal pain
    -- Value 4: asymptomatic
--> 10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)
--> 12 chol: serum cholestoral in mg/dl
--> 16 fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
--> 19 restecg: resting electrocardiographic results
    -- Value 0: normal
    -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                elevation or depression of > 0.05 mV)
    -- Value 2: showing probable or definite left ventricular hypertrophy
                by Estes' criteria
--> 32 thalach: maximum heart rate achieved
--> 38 exang: exercise induced angina (1 = yes; 0 = no)
--> 40 oldpeak = ST depression induced by exercise relative to rest
--> 41 slope: the slope of the peak exercise ST segment
    -- Value 1: upsloping
    -- Value 2: flat
    -- Value 3: downsloping
--> 44 ca: number of major vessels (0-3) colored by flourosopy
--> 51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
--> 58 num: diagnosis of heart disease (angiographic disease status)
    -- Value 0: < 50% diameter narrowing
    -- Value 1: > 50% diameter narrowing
    (in any major vessel: attributes 59 through 68 are vessels)

Data Preprocess

Before, Dealing with the data, we need to load it out.


import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn import svm



## load data
trainSet = pd.read_csv("clevelandtrain.csv")
testSet = pd.read_csv("clevelandtest.csv")

xtrain = (trainSet.drop(["heartdisease::category|0|1"], axis=1)).iloc[:,:].values  # (152, 13)
ytrain = trainSet["heartdisease::category|0|1"].iloc[:].values                     # (152,)

xtest = (testSet.drop(["heartdisease::category|0|1"], axis=1)).iloc[:,:].values    # (145, 13)
ytest = testSet["heartdisease::category|0|1"].iloc[:].values                       # (145,)


From above description, we can find that #9 (cp), #19 (restecg), #41 (slope), #51 (thal) are all categorical integer features. So, we are gonna encode them as a one-hot numeric array.

# one-hot-encoder: #9 (cp), #19 (restecg),  #41 (slope), #51 (thal)

xtrain_pre = trainSet.drop(["cp", "restecg", "slope", "thal", "heartdisease::category|0|1"], axis=1).iloc[:,:].values # (152, 9)
xtrain_cp = trainSet["cp"].iloc[:].values
xtrain_restecg = trainSet["restecg"].iloc[:].values
xtrain_slope = trainSet["slope"].iloc[:].values
xtrain_thal = trainSet["thal"].iloc[:].values

ohe1 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')
ohe2 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')
ohe3 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')
ohe4 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')

xtrain_cp = ohe1.fit_transform(xtrain_cp.reshape(-1,1))                    # (152, 4)
xtrain_restecg = ohe2.fit_transform(xtrain_restecg.reshape(-1,1))          # (152, 3)
xtrain_slope = ohe3.fit_transform(xtrain_slope.reshape(-1,1))              # (152, 3)
xtrain_thal = ohe4.fit_transform(xtrain_thal.reshape(-1,1))                # (152, 3)


xTrain = np.hstack((xtrain_pre, xtrain_cp, xtrain_restecg, xtrain_slope, xtrain_thal))   # (152, 22)
yTrain = ytrain                                                                          # (152,)



xtest_pre = testSet.drop(["cp", "restecg", "slope", "thal", "heartdisease::category|0|1"], axis=1).iloc[:,:].values   # (145, 9)
xtest_cp = testSet["cp"].iloc[:].values
xtest_restecg = testSet["restecg"].iloc[:].values
xtest_slope = testSet["slope"].iloc[:].values
xtest_thal = testSet["thal"].iloc[:].values

xtest_cp = ohe1.transform(xtest_cp.reshape(-1,1))                 # (145, 4)
xtest_restecg = ohe2.transform(xtest_restecg.reshape(-1,1))       # (145, 3)
xtest_slope = ohe3.transform(xtest_slope.reshape(-1,1))           # (145, 3)
xtest_thal = ohe4.transform(xtest_thal.reshape(-1,1))             # (145, 3)

xTest = np.hstack((xtest_pre, xtest_cp, xtest_restecg, xtest_slope, xtest_thal))   # (145, 22)
yTest = ytest       

Build Model

First of all, we will build the SVM model and use the cross validation method to find the best parameter.(with RBF Kernel)
After testing, I find the rbf has the best performance

svc = svm.SVC()
parameters_kernel = ['rbf']
parameters_C = np.linspace(100,1000, num=10)
parameters_gamma = np.linspace(1e-3,1e-4, num=10)

parameters = {'kernel': parameters_kernel, 'C':parameters_C, 'gamma':parameters_gamma}

# parameters = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
#               {'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [3]}
#              ]

clf = GridSearchCV(estimator=svc, param_grid=parameters, cv=5)
clf.fit(xTrain,yTrain)

print("Best Parameters:", clf.best_params_)
# print("Best Estimators:\n", clf.best_estimator_)
print("Best Scores:", clf.best_score_)

svcBest = clf.best_estimator_
svcScore =svcBest.score(xTest, yTest)

print("Test Scores:",svcScore)

the followings are the output of SVM (SVC):

Best Parameters: {'C': 300.0, 'gamma': 0.0001, 'kernel': 'rbf'}
Best Scores: 0.7236842105263158
Test Scores: 0.7862068965517242

Form the above analysis, we have got that the best parameters for SVM with RBF kernel.

Now, I build four models, SVM-rbf, SVM-poly, random forest and adaboost. In order compare their result, I print their score together.


svcRBF = SVC(C=300.0,gamma=0.0001,kernel='rbf',probability=True)
svcRBF.fit(xTrain,yTrain)
svcRBFScore = svcRBF.score(xTest, yTest) # test accuracy
print("the test score of svcRBFScore"+str(svcRBFScore))

svcPoly = SVC(C=1.0,degree = 8.666666,coef0=1.0,gamma = 'scale',max_iter=-1,kernel='poly',probability=True)
svcPoly.fit(xTrain,yTrain)
svcPolyScore = svcPoly.score(xTest, yTest) # test accuracy
print("the test score of svcPolyScore"+str(svcPolyScore))


decisonTree = tree.DecisionTreeClassifier()
decisonTreeBagging = BaggingClassifier(decisonTree,max_samples=0.7, max_features=1.0)
decisonTreeAda = AdaBoostClassifier(decisonTree,n_estimators=10,random_state=np.random.RandomState(1))

decisonTreeBagging.fit(xTrain,yTrain)
decisonTreeAda.fit(xTrain,yTrain)
Bagging_score = decisonTreeBagging.score(xTest,yTest)
AdaBoost_score = decisonTreeAda.score(xTest,yTest)

print("the test score of Bagging:", Bagging_score)
print("the test score of Adaboost:", AdaBoost_score)


the test score of svcRBFScore: 0.7862068965517242
the test score of svcPolyScore: 0.696551724137931
the test score of Bagging: 0.8
the test score of Adaboost: 0.7103448275862069

For, now Random Forest (Bagging) seems to have the best performance. Later, I will try some statistic test method to give a more detailed contrast.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
心脏病预测数据集是一个用于预测患者是否患有心脏病的数据集。该数据集包含了不同患者的一些特征变量,如年龄、性别、胸痛类型、胆固醇水平等,以及一个目标变量,表示患者是否患有心脏病。K近邻(K-nearest neighbors,KNN)算法是一种用于分类和回归的基本监督学习算法。 KNN算法的基本思想是根据样本之间的距离来判断样本的分类,即通过计算新样本与已有样本之间的距离,选择最近的K个样本,并根据这些样本的类别进行投票来确定新样本的类别。在心脏病预测数据集中,我们可以利用KNN算法来根据患者的特征变量预测其是否患有心脏病。 KNN算法的具体步骤如下: 1. 根据给定的数据集,计算新样本与每个已有样本之间的距离。常用的距离度量方法有欧氏距离、曼哈顿距离等。 2. 选择K个距离最近的已有样本,并获取其对应的类别。 3. 对K个样本的类别进行统计,选择类别出现最频繁的作为新样本的类别。 4. 预测的结果即为新样本的类别。 在心脏病预测数据集中,我们可以选择适当的K值,如3、5或7。较小的K值可能更容易受到局部的噪声干扰,而较大的K值可能更容易受到整体分布的影响。因此,需要通过交叉验证等方法来选择最合适的K值。 KNN算法的优点是简单而直观,易于实现,并且可以适用于分类和回归问题。然而,由于需要计算新样本与所有已有样本之间的距离,KNN算法在处理大规模数据集时可能会变得较慢。此外,对于不平衡的数据集,KNN算法可能会出现类别预测的偏差。 综上所述,KNN算法可以用于心脏病预测数据集,根据患者的特征变量预测其是否患有心脏病,但需要根据实际情况选择合适的K值,并注意算法的性能和偏差问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值