Author(s): Pratik Shukla

作者:Pratik Shukla

This article covers an extensive review with step-by-step explanations and code for how to perform statistical survival analysis used to investigate the time some event takes to occur, such as patient survival during the COVID-19 pandemic, the time to failure of engineering products, or even the time to closing a sale after an initial customer contact.


This tutorial’s code is available on Github and its full implementation on Google Colab.

本教程的代码在 Github 上可用 ,其完整实现在 Google Colab上

目录: (Table of Contents:)

  1. Survival Analysis Basics

  2. Kaplan-Meier fitter Theory with an Example.

  3. Nelson-Aalen fitter Theory with an Example.

  4. Kaplan-Meier fitter Based on Different Groups.

  5. Log-Rank-Test with an Example.

  6. Cox-Regression with an Example.

  7. Resources.


1.生存分析基础: (1. Survival Analysis Basics:)

Survival analysis is a set of statistical approaches used to determine the time it takes for an event of interest to occur. We use survival analysis to study the time until some event of interest occurs. Time is usually measured in years, months, weeks, days, and other time measuring units. The event of interest could be anything of interest. It could be an actual death, a birth, a retirement, along with others.

生存分析是一组统计方法,用于确定感兴趣事件发生所需的时间。 我们使用生存分析来研究发生某些感兴趣事件之前的时间 。 时间通常以年,月,周,天和其他时间度量单位进行度量。 感兴趣的事件可以是任何感兴趣的事件。 这可能是实际的死亡,出生,退休以及其他情况。

For instance, how can Survival Analysis be useful to analyze the ongoing COVID-19 pandemic data?


  1. We can find the number of days until patients showed COVID-19 symptoms.

  2. We can find for which age group it is deadlier.

  3. We can find which treatment has the highest survival probability.

  4. We can find whether a person’s sex has a significant effect on their survival time?

  5. We can find the median number of days of survival for patients.

  6. We can find which factor has more impact on patients’ survival.


In this tutorial, we are going to perform a thorough analysis of patients with lung cancer. Do not worry if it seems complicated. Once we go through the logic behind it, we will have the ability to perform survival analysis on any data set. Exciting! Isn’t it?

在本教程中,我们将对肺癌患者进行彻底的分析 如果看起来很复杂,请不要担心。 一旦了解了其背后的逻辑,我们就可以对任何数据集进行生存分析。 令人兴奋! 是不是

Survival analysis is used in a variety of field such as:


  • Cancer studies for patients survival time analyses.

  • Sociology for “event-history analysis.”

    社会学 用于“事件历史分析”。

  • In Engineering for “failure-time analysis.”

  • Time until product failure.

  • Time until a warranty claim.

  • Time until a process reaches a critical level.

  • Time from initial sales contact to a sale.

  • Time from employee hire to either termination or quit.

  • Time from a salesperson hires to their first sale.


In cancer studies, typical research questions are:

癌症研究中 ,典型的研究问题是:

  1. What is the impact of specific clinical characteristics on patient’s survival? For example, is there any difference between people who have higher blood sugar and those who do not?

    具体临床特征对患者生存有何影响? 例如,血糖高的人与血糖高的人有什么区别吗?
  2. What is the probability that an individual survives a specific time (years, months, days)? For example, given a set of cancer patients, we will tell that if 300 days after a cancer diagnosis has been passed, then the probability of that person being alive at that time will be 0.7.

    一个人生存特定时间(年,月,日)的概率是多少? 例如,对于一组癌症患者,我们将告诉您,如果在通过癌症诊断后300天,那么该人当时还活着的概率为0.7。
  3. Are there differences in survival between groups of patients? For example, Let’s say there are two groups of people diagnosed with cancer. Those two groups were given two different kinds of treatments. Our goal here will be to find out if there is a significant difference between the survival time for those two different groups based on the treatment they were given.

    两组患者的生存率是否存在差异? 例如,假设有两组人被诊断出患有癌症。 两组均接受两种不同的治疗。 我们的目标是根据所给予的治疗来找出这两个不同组的生存时间之间是否存在显着差异。

目标 (Objectives)

In this tutorial, we will see the following methods of survival analysis in detail:


1) Kaplan-Meier plots to visualize survival curves.

1) Kaplan-Meier图使生存曲线可视化。

2) Nelson-Aalen plots to visualize the cumulative hazard.


3) Log-Rank test to compare the survival curves of two or more groups


4) Cox-proportional hazards regression finds out the effect of different variables like age, sex, and weight on survival.


基础概念 (Fundamental concepts)

We will start this tutorial by understanding some basic definitions and concepts related to survival analysis.


癌症研究中的生存时间和事件类型 (Survival time and type of events in cancer studies)

Survival Time: It is usually referred to as an amount of time until when a subject is alive or actively participates in a survey.


There are three main types of events in survival analysis:


1) Relapse: Relapse is defined as a deterioration in the subject’s state of health after a temporary improvement.


2) Progression: Progression is defined as the process of developing or moving gradually towards a more advanced state. It basically means that the health of the subject under observation is improving.

2)进步:进步被定义为发展或逐渐走向更高级状态的过程。 这基本上意味着所观察对象的健康状况正在改善。

3) Death: Death is defined as the destruction or permanent end of something. In our case, death will be our event of interest.

3)死亡:死亡被定义为某物的破坏或永久终结。 就我们而言,死亡将是我们关注的事件。

审查制度 (Censoring)

As we discussed above, survival analysis focuses on the occurrence of an event of interest. The event of interest can be anything like birth, death, or retirement. However, there is still a possibility that the event we are interested in does not occur. Such observations are known as censored observations.

如上所述,生存分析的重点是关注事件的发生。 感兴趣的事件可以是出生,死亡或退休之类的东西。 但是,仍然有可能发生我们感兴趣的事件。 这样的观察被称为审查观察。

There are three types of censoring:


  1. Right Censoring: The subject under observation is still alive. In this case, we can not have our timing when our event of interest(death) occurs.

    正确的检查 :被观察的对象还活着。 在这种情况下,当我们感兴趣的事件(死亡)发生时,我们就没有时间。

  2. Left Censoring: In this type of censoring, the event cannot be observed for some reason. It may also include the event that occurred before the experiment started, such as the number of days from birth when the kid started walking.

    审查在这种审查中,由于某种原因无法观察到事件。 它也可能包括实验开始之前发生的事件,例如孩子出生后开始走路的天数。

3) Interval Censoring: In this type of data censoring, we only have data for a specific interval, so it is possible that the event of interest does not occur during that time.


Censoring may occur in the following instances:


  1. A patient has not (yet) experienced the event of interest (death or relapse in our case) within a period.

  2. A patient is not followed anymore.

  3. If a patient moves to another city, then follow-up might not be possible for the hospital staff.

  4. We only have the data for a specific interval.


生存和危害功能 (Survival and hazard functions)

We generally use two related probabilities to analyze survival data for a subject.


1) Survival Function(S)


2) Hazard Function (H)


To find the survival probability of a subject, we will use the survival function S(t), the Kaplan-Meier Estimator. The survival function is defined as the probability that an individual (subject) survives from the time origin (diagnosis of a disease) to a specified future time t. Please note that the time can be in various forms like minutes, days, weeks, months, or years. For example, S(200)=0.7 means that after 200 days, a subject’s survival probability is 0.7. In many deadly diseases, the survival probability decreases as the period increases. If the subject is alive at the end of an experiment, then that data will be censored.

为了找到受试者的生存概率,我们将使用Kaplan-Meier估计器生存函数S(t)。 生存函数定义为个体(受试者)从时间起点(疾病诊断)到指定的未来时间t生存的概率。 请注意,时间可以有多种形式,例如分钟,几天,几周,几个月或几年。 例如,S(200)= 0.7意味着200天后,受试者的生存概率为0.7。 在许多致命疾病中,生存期随时间的延长而降低。 如果受试者在实验结束时还活着,那么该数据将被审查。

The hazard probability, denoted by H(t), is the probability that an individual (subject) who is under observation at a time t has an event (death) at that time. For example, If h(200) = 0.7 means that after 200 days or on the 200th day, the probability of being dead is 0.7. One thing to keep in mind here is that the hazard function gives us the cumulative probability. We will discuss this in detail later in this tutorial.

危险概率,用H(t)表示,是在时间t被观察的个体(受试者)在该时间发生事件(死亡)的概率。 例如,如果h(200)= 0.7表示在200天之后或在第200天,死亡的概率为0.7。 这里要记住的一件事是,危害函数给了我们累积概率。 我们将在本教程的后面部分对此进行详细讨论。

Notice that, in contrast to the survival function, which focuses on the survival of a subject, the hazard function gives us the probability of a subject being dead on a given time. We can note that higher survival probability and lower hazard probability is good for the subject’s health.

请注意,与专注于对象生存的生存函数相反,危险函数为我们提供了对象在给定时间死亡的可能性。 我们可以注意到,较高的生存概率和较低的危害概率对受试者的健康有益。

让我们继续前进到很酷的编码部分! (Let’s move forward to the cool coding part!)

Download the public dataset from the UPC.


资料说明: (Data Description:)

Figure 1: Data description values. | Survival Analysis with Python Tutorial
Figure 1: Data description values.

2. Kaplan-Meier估计器理论和例子 (2. Kaplan-Meier Estimator Theory and Example)

The Kaplan–Meier estimator is a non-parametric statistic used to estimate the survival function (probability of a person surviving) from the lifetime data. In medical research, it is often used to measure the fraction of patients living for a specific time after treatment or diagnosis. For example: Calculating the amount of time(year, month, day) a particular patient lived after he/she was diagnosed with cancer or his treatment starts. The estimator is named after Edward L. Kaplan and Paul Meier, who submitted similar manuscripts to the American Statistical Association Journal.

Kaplan-Meier估计器是一种非参数统计量,用于根据生命周期数据估计生存函数(一个幸存者的概率)。 在医学研究中,它通常用于测量在治疗或诊断后特定时间内存活的患者比例。 例如:计算特定患者被诊断出患有癌症或开始治疗后的生存时间(年,月,日)。 估算员以爱德华·卡普兰(Edward L. Kaplan)保罗·迈耶 ( Paul Meier )的名字命名,后者向《 美国统计协会杂志》提交了类似的手稿。

The probability of survival at time ti, which is denoted by S(ti), is calculated as follow:


Figure 2: Formula to calculate the probability of survival at time ti. | Survival Analysis with Python Tutorial
Figure 2: Formula to calculate the probability of survival at time ti.

We can also write the equation above in a simple form as follows:


Figure 3: Probability of survival time in a simple form S(ti). | Survival Analysis with Python Tutorial
Figure 3: Probability of survival time in a simple form S(ti).

For Example:


1) Survival probability at time t=1:

1)在时间t = 1时的生存概率:

Survival probability at time t=1 formula | Survival Analysis with Python Tutorial
Figure 4: Survival probability at time t=1 formula
图4:在时间t = 1公式中的生存概率

2) Survival probability at time t=2:

2)在时间t = 2时的生存概率:

Image for post
Figure 5: Survival probability at time t=2 formula
图5:在时间t = 2时的生存概率

3) Survival probability at time t=3:

3)在时间t = 3时的生存概率:

Image for post
Figure 6: Survival probability at time t=3 formula
图6:在时间t = 3时的生存概率公式

In a more generalized way, the probability of survival for a particular time is given by.


Image for post
Figure 7: Generalized formula for the probability of survival for a particular time.

From the above equations, we can confidently say that.


Image for post
Figure 8: Expressing survival generalization.

Kaplan-Meier估计器(无任何组) (Kaplan-Meier Estimator (Without any groups))

1) Import required libraries:


Figure 9: Importing pandas, numpty, matplotlib.pyplot, and lifelines | Survival Analysis with Python Tutorial
Figure 9: Importing pandas, numpty, matplotlib.pyplot, and lifelines.

2) Read the data set:


Figure 10: Reading the dataset. | Survival Analysis with Python Tutorial
Figure 10: Reading the dataset.

3) Print the columns in our data set:


Figure 11: Printing the data columns. | Survival Analysis with Python Tutorial
Figure 11: Printing the data columns.

4) Get additional information about the dataset:


It gives us information about the data type of the columns along with their null-value counter. We need to remove the rows with a null value for some of the survival analysis methods.

它为我们提供了有关列的数据类型及其空值计数器的信息。 对于某些生存分析方法,我们需要删除具有空值的行。

Figure 12: Additional info about our dataset. | Survival Analysis with Python Tutorial
Figure 12: Additional info about our dataset.

5) Get statistical information about the dataset:


It gives us some statistical information like the total number of rows, mean, standard deviation, minimum value, 25th percentile, 50th percentile, 75th percentile, and maximum value for each column.


Figure 13: Obtaining statistical info about our dataset. | Survival Analysis with Python Tutorial
Figure 13: Obtaining statistical info about our dataset.

6) Find out sex distribution using histogram:


This gives us a general idea about how our data is distributed. In the following graph, we can see that around 139 values have a status of 1, and approximately 90 values have a status of 2, which means that there are 139 males and around 90 females in our dataset.

这使我们对如何分配数据有一个总体了解。 在下图中,我们可以看到大约139个值的状态为1,大约90个值的状态为2,这意味着我们的数据集中有139位男性和90位女性。

Figure 14: Plotting histogram for the sex of the patient. | Survival Analysis with Python Tutorial
Figure 14: Plotting histogram for the sex of the patient.

7) Create an object for Kaplan-Meier-Fitter:


Figure 15: Creating an object for the Kaplan-Meier-Fitter. | Survival Analysis with Python Tutorial
Figure 15: Creating an object for the Kaplan-Meier-Fitter.
