python 生存分析_用python教程进行生存分析何时何地

python 生存分析

机器学习编程统计 (Machine Learning, Programming, Statistics)

Author(s): Pratik Shukla

作者:Pratik Shukla

This article covers an extensive review with step-by-step explanations and code for how to perform statistical survival analysis used to investigate the time some event takes to occur, such as patient survival during the COVID-19 pandemic, the time to failure of engineering products, or even the time to closing a sale after an initial customer contact.

本文涵盖了详尽的综述,并提供分步说明和有关如何执行统计生存分析的代码,这些统计生存分析用于调查某些事件发生的时间,例如COVID-19大流行期间的患者生存,工程失败的时间。产品,甚至在最初与客户联系后完成销售的时间。

This tutorial’s code is available on Github and its full implementation on Google Colab.

本教程的代码在 Github 上可用 ,其完整实现在 Google Colab上

📚 Check out our Monte Carlo Simulation Tutorial with Python 📚

📚查看我们的Python蒙特卡洛模拟教程 📚

目录: (Table of Contents:)

  1. Survival Analysis Basics

    生存分析基础
  2. Kaplan-Meier fitter Theory with an Example.

    Kaplan-Meier钳工理论与实例。
  3. Nelson-Aalen fitter Theory with an Example.

    Nelson-Aalen钳工理论与实例。
  4. Kaplan-Meier fitter Based on Different Groups.

    基于不同组的Kaplan-Meier钳工。
  5. Log-Rank-Test with an Example.

    带有示例的Log-Rank-Test。
  6. Cox-Regression with an Example.

    考克斯回归与一个例子。
  7. Resources.

    资源。

1.生存分析基础: (1. Survival Analysis Basics:)

Survival analysis is a set of statistical approaches used to determine the time it takes for an event of interest to occur. We use survival analysis to study the time until some event of interest occurs. Time is usually measured in years, months, weeks, days, and other time measuring units. The event of interest could be anything of interest. It could be an actual death, a birth, a retirement, along with others.

生存分析是一组统计方法,用于确定感兴趣事件发生所需的时间。 我们使用生存分析来研究发生某些感兴趣事件之前的时间 。 时间通常以年,月,周,天和其他时间度量单位进行度量。 感兴趣的事件可以是任何感兴趣的事件。 这可能是实际的死亡,出生,退休以及其他情况。

For instance, how can Survival Analysis be useful to analyze the ongoing COVID-19 pandemic data?

例如,生存分析如何用于分析正在进行的COVID-19大流行数据?

  1. We can find the number of days until patients showed COVID-19 symptoms.

    我们可以找到直到患者出现COVID-19症状的天数。
  2. We can find for which age group it is deadlier.

    我们可以找到哪个年龄段的人更致命。
  3. We can find which treatment has the highest survival probability.

    我们可以找到哪种治疗方法具有最高的生存率。
  4. We can find whether a person’s sex has a significant effect on their survival time?

    我们可以发现一个人的性别对他们的生存时间是否有重大影响?
  5. We can find the median number of days of survival for patients.

    我们可以找到患者存活天数的中位数。
  6. We can find which factor has more impact on patients’ survival.

    我们可以发现哪个因素对患者的生存影响更大。

In this tutorial, we are going to perform a thorough analysis of patients with lung cancer. Do not worry if it seems complicated. Once we go through the logic behind it, we will have the ability to perform survival analysis on any data set. Exciting! Isn’t it?

在本教程中,我们将对肺癌患者进行彻底的分析 如果看起来很复杂,请不要担心。 一旦了解了其背后的逻辑,我们就可以对任何数据集进行生存分析。 令人兴奋! 是不是

Survival analysis is used in a variety of field such as:

生存分析可用于许多领域,例如:

  • Cancer studies for patients survival time analyses.

    用于患者生存时间分析的癌症研究。
  • Sociology for “event-history analysis.”

    社会学 用于“事件历史分析”。

  • In Engineering for “failure-time analysis.”

    在工程学中用于“故障时间分析”。
  • Time until product failure.

    直到产品出现故障的时间。
  • Time until a warranty claim.

    直到保修索赔为止的时间。
  • Time until a process reaches a critical level.

    直到过程达到临界水平的时间。
  • Time from initial sales contact to a sale.

    从初始销售联系到销售的时间。
  • Time from employee hire to either termination or quit.

    从雇用员工到解雇或辞职的时间。
  • Time from a salesperson hires to their first sale.

    从销售人员雇用到首次销售的时间。

In cancer studies, typical research questions are:

癌症研究中 ,典型的研究问题是:

  1. What is the impact of specific clinical characteristics on patient’s survival? For example, is there any difference between people who have higher blood sugar and those who do not?

    具体临床特征对患者生存有何影响? 例如,血糖高的人与血糖高的人有什么区别吗?
  2. What is the probability that an individual survives a specific time (years, months, days)? For example, given a set of cancer patients, we will tell that if 300 days after a cancer diagnosis has been passed, then the probability of that person being alive at that time will be 0.7.

    一个人生存特定时间(年,月,日)的概率是多少? 例如,对于一组癌症患者,我们将告诉您,如果在通过癌症诊断后300天,那么该人当时还活着的概率为0.7。
  3. Are there differences in survival between groups of patients? For example, Let’s say there are two groups of people diagnosed with cancer. Those two groups were given two different kinds of treatments. Our goal here will be to find out if there is a significant difference between the survival time for those two different groups based on the treatment they were given.

    两组患者的生存率是否存在差异? 例如,假设有两组人被诊断出患有癌症。 两组均接受两种不同的治疗。 我们的目标是根据所给予的治疗来找出这两个不同组的生存时间之间是否存在显着差异。

目标 (Objectives)

In this tutorial, we will see the following methods of survival analysis in detail:

在本教程中,我们将详细了解以下生存分析方法:

1) Kaplan-Meier plots to visualize survival curves.

1) Kaplan-Meier图使生存曲线可视化。

2) Nelson-Aalen plots to visualize the cumulative hazard.

2)Nelson-Aalen绘图以可视化累积危害。

3) Log-Rank test to compare the survival curves of two or more groups

3)Log-Rank检验比较两组或更多组的生存曲线

4) Cox-proportional hazards regression finds out the effect of different variables like age, sex, and weight on survival.

4)Cox比例风险回归发现年龄,性别和体重等不同变量对生存的影响。

基础概念 (Fundamental concepts)

We will start this tutorial by understanding some basic definitions and concepts related to survival analysis.

我们将通过了解与生存分析有关的一些基本定义和概念来开始本教程。

癌症研究中的生存时间和事件类型 (Survival time and type of events in cancer studies)

Survival Time: It is usually referred to as an amount of time until when a subject is alive or actively participates in a survey.

生存时间:通常是指直到受试者存活或积极参与调查为止的时间。

There are three main types of events in survival analysis:

生存分析中有三种主要类型的事件:

1) Relapse: Relapse is defined as a deterioration in the subject’s state of health after a temporary improvement.

1)复发:复发定义为受试者暂时改善后健康状况的恶化。

2) Progression: Progression is defined as the process of developing or moving gradually towards a more advanced state. It basically means that the health of the subject under observation is improving.

2)进步:进步被定义为发展或逐渐走向更高级状态的过程。 这基本上意味着所观察对象的健康状况正在改善。

3) Death: Death is defined as the destruction or permanent end of something. In our case, death will be our event of interest.

3)死亡:死亡被定义为某物的破坏或永久终结。 就我们而言,死亡将是我们关注的事件。

审查制度 (Censoring)

As we discussed above, survival analysis focuses on the occurrence of an event of interest. The event of interest can be anything like birth, death, or retirement. However, there is still a possibility that the event we are interested in does not occur. Such observations are known as censored observations.

如上所述,生存分析的重点是关注事件的发生。 感兴趣的事件可以是出生,死亡或退休之类的东西。 但是,仍然有可能发生我们感兴趣的事件。 这样的观察被称为审查观察。

There are three types of censoring:

审查有以下三种类型:

  1. Right Censoring: The subject under observation is still alive. In this case, we can not have our timing when our event of interest(death) occurs.

    正确的检查 :被观察的对象还活着。 在这种情况下,当我们感兴趣的事件(死亡)发生时,我们就没有时间。

  2. Left Censoring: In this type of censoring, the event cannot be observed for some reason. It may also include the event that occurred before the experiment started, such as the number of days from birth when the kid started walking.

    审查在这种审查中,由于某种原因无法观察到事件。 它也可能包括实验开始之前发生的事件,例如孩子出生后开始走路的天数。

3) Interval Censoring: In this type of data censoring, we only have data for a specific interval, so it is possible that the event of interest does not occur during that time.

3)间隔检查在这种类型的数据检查中,我们仅具有特定间隔的数据,因此有可能在这段时间内没有发生关注的事件。

Censoring may occur in the following instances:

在以下情况下可能会进行审查:

  1. A patient has not (yet) experienced the event of interest (death or relapse in our case) within a period.

    病人在一段时间内还没有经历过所关注的事件(在我们的情况下为死亡或复发)。
  2. A patient is not followed anymore.

    不再关注患者。
  3. If a patient moves to another city, then follow-up might not be possible for the hospital staff.

    如果患者搬到另一个城市,则医院工作人员可能无法进行后续随访。
  4. We only have the data for a specific interval.

    我们只有特定时间间隔的数据。

生存和危害功能 (Survival and hazard functions)

We generally use two related probabilities to analyze survival data for a subject.

我们通常使用两个相关的概率来分析受试者的生存数据。

1) Survival Function(S)

1)生存功能

2) Hazard Function (H)

2)危险功能(H)

To find the survival probability of a subject, we will use the survival function S(t), the Kaplan-Meier Estimator. The survival function is defined as the probability that an individual (subject) survives from the time origin (diagnosis of a disease) to a specified future time t. Please note that the time can be in various forms like minutes, days, weeks, months, or years. For example, S(200)=0.7 means that after 200 days, a subject’s survival probability is 0.7. In many deadly diseases, the survival probability decreases as the period increases. If the subject is alive at the end of an experiment, then that data will be censored.

为了找到受试者的生存概率,我们将使用Kaplan-Meier估计器生存函数S(t)。 生存函数定义为个体(受试者)从时间起点(疾病诊断)到指定的未来时间t生存的概率。 请注意,时间可以有多种形式,例如分钟,几天,几周,几个月或几年。 例如,S(200)= 0.7意味着200天后,受试者的生存概率为0.7。 在许多致命疾病中,生存期随时间的延长而降低。 如果受试者在实验结束时还活着,那么该数据将被审查。

The hazard probability, denoted by H(t), is the probability that an individual (subject) who is under observation at a time t has an event (death) at that time. For example, If h(200) = 0.7 means that after 200 days or on the 200th day, the probability of being dead is 0.7. One thing to keep in mind here is that the hazard function gives us the cumulative probability. We will discuss this in detail later in this tutorial.

危险概率,用H(t)表示,是在时间t被观察的个体(受试者)在该时间发生事件(死亡)的概率。 例如,如果h(200)= 0.7表示在200天之后或在第200天,死亡的概率为0.7。 这里要记住的一件事是,危害函数给了我们累积概率。 我们将在本教程的后面部分对此进行详细讨论。

Notice that, in contrast to the survival function, which focuses on the survival of a subject, the hazard function gives us the probability of a subject being dead on a given time. We can note that higher survival probability and lower hazard probability is good for the subject’s health.

请注意,与专注于对象生存的生存函数相反,危险函数为我们提供了对象在给定时间死亡的可能性。 我们可以注意到,较高的生存概率和较低的危害概率对受试者的健康有益。

让我们继续前进到很酷的编码部分! (Let’s move forward to the cool coding part!)

Download the public dataset from the UPC.

UPC下载公共数据集。

资料说明: (Data Description:)

Figure 1: Data description values. | Survival Analysis with Python Tutorial
Figure 1: Data description values.
图1:数据描述值。

2. Kaplan-Meier估计器理论和例子 (2. Kaplan-Meier Estimator Theory and Example)

The Kaplan–Meier estimator is a non-parametric statistic used to estimate the survival function (probability of a person surviving) from the lifetime data. In medical research, it is often used to measure the fraction of patients living for a specific time after treatment or diagnosis. For example: Calculating the amount of time(year, month, day) a particular patient lived after he/she was diagnosed with cancer or his treatment starts. The estimator is named after Edward L. Kaplan and Paul Meier, who submitted similar manuscripts to the American Statistical Association Journal.

Kaplan-Meier估计器是一种非参数统计量,用于根据生命周期数据估计生存函数(一个幸存者的概率)。 在医学研究中,它通常用于测量在治疗或诊断后特定时间内存活的患者比例。 例如:计算特定患者被诊断出患有癌症或开始治疗后的生存时间(年,月,日)。 估算员以爱德华·卡普兰(Edward L. Kaplan)保罗·迈耶 ( Paul Meier )的名字命名,后者向《 美国统计协会杂志》提交了类似的手稿。

The probability of survival at time ti, which is denoted by S(ti), is calculated as follow:

S(ti)表示的在时间ti处的生存概率计算如下:

Figure 2: Formula to calculate the probability of survival at time ti. | Survival Analysis with Python Tutorial
Figure 2: Formula to calculate the probability of survival at time ti.
图2:计算时间ti生存概率的公式。

We can also write the equation above in a simple form as follows:

我们还可以按以下简单形式编写上述等式:

Figure 3: Probability of survival time in a simple form S(ti). | Survival Analysis with Python Tutorial
Figure 3: Probability of survival time in a simple form S(ti).
图3:简单形式S(ti)的生存时间概率。

For Example:

例如:

1) Survival probability at time t=1:

1)在时间t = 1时的生存概率:

Survival probability at time t=1 formula | Survival Analysis with Python Tutorial
Figure 4: Survival probability at time t=1 formula
图4:在时间t = 1公式中的生存概率

2) Survival probability at time t=2:

2)在时间t = 2时的生存概率:

Image for post
Figure 5: Survival probability at time t=2 formula
图5:在时间t = 2时的生存概率

3) Survival probability at time t=3:

3)在时间t = 3时的生存概率:

Image for post
Figure 6: Survival probability at time t=3 formula
图6:在时间t = 3时的生存概率公式

In a more generalized way, the probability of survival for a particular time is given by.

以更一般的方式,给出了特定时间生存的概率。

Image for post
Figure 7: Generalized formula for the probability of survival for a particular time.
图7:特定时间存活概率的通用公式。

From the above equations, we can confidently say that.

从以上等式,我们可以自信地说。

Image for post
Figure 8: Expressing survival generalization.
图8:表达生存概括。

Kaplan-Meier估计器(无任何组) (Kaplan-Meier Estimator (Without any groups))

1) Import required libraries:

1)导入所需的库:

Figure 9: Importing pandas, numpty, matplotlib.pyplot, and lifelines | Survival Analysis with Python Tutorial
Figure 9: Importing pandas, numpty, matplotlib.pyplot, and lifelines.
图9:导入熊猫,numpty,matplotlib.pyplot和生命线。

2) Read the data set:

2)读取数据集:

Figure 10: Reading the dataset. | Survival Analysis with Python Tutorial
Figure 10: Reading the dataset.
图10:读取数据集。

3) Print the columns in our data set:

3)打印我们数据集中的列:

Figure 11: Printing the data columns. | Survival Analysis with Python Tutorial
Figure 11: Printing the data columns.
图11:打印数据列。

4) Get additional information about the dataset:

4)获取有关数据集的其他信息:

It gives us information about the data type of the columns along with their null-value counter. We need to remove the rows with a null value for some of the survival analysis methods.

它为我们提供了有关列的数据类型及其空值计数器的信息。 对于某些生存分析方法,我们需要删除具有空值的行。

Figure 12: Additional info about our dataset. | Survival Analysis with Python Tutorial
Figure 12: Additional info about our dataset.
图12:有关我们的数据集的其他信息。

5) Get statistical information about the dataset:

5)获取有关数据集的统计信息:

It gives us some statistical information like the total number of rows, mean, standard deviation, minimum value, 25th percentile, 50th percentile, 75th percentile, and maximum value for each column.

它为我们提供了一些统计信息,例如每列的总行数,平均值,标准差,最小值,25%,50%,75%和最大值。

Figure 13: Obtaining statistical info about our dataset. | Survival Analysis with Python Tutorial
Figure 13: Obtaining statistical info about our dataset.
图13:获取有关我们的数据集的统计信息。

6) Find out sex distribution using histogram:

6)使用直方图找出性别分布:

This gives us a general idea about how our data is distributed. In the following graph, we can see that around 139 values have a status of 1, and approximately 90 values have a status of 2, which means that there are 139 males and around 90 females in our dataset.

这使我们对如何分配数据有一个总体了解。 在下图中,我们可以看到大约139个值的状态为1,大约90个值的状态为2,这意味着我们的数据集中有139位男性和90位女性。

Figure 14: Plotting histogram for the sex of the patient. | Survival Analysis with Python Tutorial
Figure 14: Plotting histogram for the sex of the patient.
图14:绘制患者性别的直方图。

7) Create an object for Kaplan-Meier-Fitter:

7)为Kaplan-Meier-Fitter创建一个对象:

Figure 15: Creating an object for the Kaplan-Meier-Fitter. | Survival Analysis with Python Tutorial
Figure 15: Creating an object for the Kaplan-Meier-Fitter.
图15:为Kaplan-Meier-Fitter创建对象。
  • 4
    点赞
  • 33
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值