介绍 (Introduction)

The purpose of this project is to combine the principles of data science and medicine to develop a model that can predict heart disease. The advantage of such a model is that it is easily interpretable and in sync with medical literature, unlike other machine learning models that yield results that are not interpretable. Adopting such an approach has helped me build a model that by screening just 34% of the population can predict the occurrence of heart disease with 84% accuracy.

该项目的目的是结合数据科学和医学原理,以开发可以预测心脏病的模型。 这种模型的优势在于,它易于解释并与医学文献保持同步,这与产生无法解释结果的其他机器学习模型不同。 采用这种方法帮助我建立了一个模型,通过筛查仅34%的人群就可以以84%的准确度预测心脏病的发生。

According to WHO, heart diseases (broadly known as cardiovascular diseases or CVDs) claim an estimated 17.9 million lives each year, which is about 31% of all deaths worldwide. This ranks CVDs as the number one cause of death globally. [1]

据世界卫生组织(WHO)称,心脏病(被广泛称为心血管疾病)每年估计有1790万人死亡,约占全世界所有死亡人数的31%。 这将CVD列为全球第一大死亡原因。 [1]

Now, what if we could build a meaningful model that could predict the likelihood of heart disease in a patient, just based on a few parameters? The word ‘meaningful’ here is very important. We don’t necessarily want a model that will give us the highest accuracy rate, but rather one which incorporates significant features and can be explained from a medical point of view. For this project, I used Google Colab to develop my models.

现在,如果我们可以建立一个有意义的模型,仅根据一些参数来预测患者患心脏病的可能性呢? 这里的“有意义”一词非常重要。 我们不一定需要一种能够为我们提供最高准确率的模型,而是需要一个具有重要特征并可以从医学角度进行解释的模型。 对于这个项目,我使用Google Colab开发了模型。

数据集 (Dataset)

I worked with the ‘Heart Disease Cleveland UCI’ dataset from Kaggle (The dataset has been originally posted under the title ‘Heart Disease Data Set’ by UCI in their ML repository). The Kaggle dataset has the data of 297 patients, 13 features, and 1 binary target variable called ‘condition’( 0 = heart disease absent, 1 = heart disease present). The detailed description of all 14 attributes has been included here.

我使用了Kaggle的“ Heart Disease Cleveland UCI”数据集(该数据集最初由UCI在其ML存储库中以“ Heart Disease Data Set”的标题发布)。 Kaggle数据集包含297位患者的数据,13个特征和1个称为“条件”的二进制目标变量(0 =不存在心脏病,1 =存在心脏病)。 这里包括所有14个属性的详细说明。

步骤1:医学文献怎么说? (Step 1: What does Medical Literature have to say?)

Medical research stresses on 5 factors to be the most influential in predicting heart disease.


  • Age- Increasing age adds to the risk of developing heart disease [2]

    年龄 -年龄增长会增加患心脏病的风险[2]

  • Sex- Males are at a higher risk of heart disease than pre-menopausal females. The risk is comparable between males and post-menopausal females. [3]

    性别 -男性比绝经前的女性患心脏病的风险更高。 男性和绝经后女性的风险相当。 [3]

  • Serum Cholesterol levels-Increased serum cholesterol levels contribute to the development of heart disease. [4]

    血清胆固醇水平 -血清胆固醇水平升高会导致心脏病的发展。 [4]

  • Blood Pressure- Hypertension or high blood pressure is a huge risk factor for the development of heart disease. [5]

    血压 -高血压或高血压是心脏病发展的巨大危险因素。 [5]

  • Chest pain- Approximately 25–50% of patients with heart disease suffer from silent myocardial ischemia (SMI), which means that they do not feel any chest discomfort. Hence, even an absence of chest pain can indicate the presence of heart disease. [6]

    胸痛 -大约25–50%的心脏病患者患有无症状的心肌缺血(SMI),这意味着他们不会感到任何胸部不适。 因此,即使没有胸痛也可能表明存在心脏病。 [6]

Luckily, all 5 factors above are





