电路分析导论_生存分析导论

最新推荐文章于 2021-03-13 04:08:50 发布

weixin_26752765

最新推荐文章于 2021-03-13 04:08:50 发布

阅读量442

点赞数

文章标签： python linux java

原文链接：https://towardsdatascience.com/introduction-to-survival-analysis-6f7e19c31d96

版权

电路分析导论

In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a company (stops purchasing, cancels the subscription, etc.). Retention refers to keeping the clients of a business active (the definition of active highly depends on the business model).

在我们竞争异常激烈的时代，所有企业都面临客户流失/保留的问题。为了快速提供背景信息，当客户停止使用公司的服务(停止购买，取消订阅等)时，就会发生流失。保留是指使业务的客户保持活动状态(活动的定义在很大程度上取决于业务模型)。

Intuitively, companies want to increase retention by preventing churn. This way, their relationship with the customers is longer and thus potentially more profitable. What is more, in most cases the company’s cost of retaining a customer is much lower than that of acquiring a new customer, for example, via performance marketing. For businesses, the concept of retention is closely connected to customer lifetime value (CLV), which the businesses want to maximize. But that is a topic for another article.

直观上，公司希望通过防止流失来增加保留率。这样，他们与客户的关系就会更长，因此可能会带来更大的利润。更重要的是，在大多数情况下，公司保留客户的成本要比例如通过绩效营销获得新客户的成本低得多。对于企业而言，保留的概念与企业希望最大化的客户生命周期价值 (CLV)紧密相关。但这是另一篇文章的主题。

With this article, I want to start a short series focusing on survival analysis, which is often an underestimated, yet very interesting branch of statistical learning. In this article, I provide a general introduction to survival analysis and its building blocks. First I explain the required concepts and then describe different approaches to analyzing time-to-event data. Let’s start!

在本文中，我想开始一个简短的系列，着重于生存分析，这通常是统计学学习中被低估但非常有趣的分支。在本文中，我对生存分析及其组成部分进行了一般性介绍。首先，我解释了必需的概念，然后描述了分析事件数据的不同方法。开始吧！

生存分析导论 (Introduction to Survival Analysis)

Survival analysis is a field of statistics that focuses on analyzing the expected time until a certain event happens. Originally, this branch of statistics developed around measuring the effects of medical treatment on patients’ survival in clinical trials. For example, imagine a group of cancer patients who are administered a certain new form of treatment. Survival analysis can be used for analyzing the results of that treatment in terms of the patients’ life expectancy.

生存分析是一个统计领域，专注于分析直到发生某个事件之前的预期时间。最初，该统计分支的发展是围绕在临床试验中测量药物治疗对患者生存的影响。例如，想象一组接受某种新形式治疗的癌症患者。生存分析可用于根据患者的预期寿命来分析该治疗的结果。

However, survival analysis is not restricted to investigating deaths and can be just as well used for determining the time until a machine fails or — what may at first sound a bit counterintuitively— a user of a certain platform converts to a premium service. That is possible because survival analysis focuses on the time until an event happens, without actually defining the event as a negative one. The conditions that apply to the most popular methods of survival analysis are:

但是，生存分析并不仅限于调查死亡情况，它还可以用于确定机器故障或某个平台的用户转换为优质服务之前的时间(起初听起来有些反直觉)。之所以可以这样做是因为生存分析着眼于事件发生之前的时间，而没有将事件实际定义为否定事件。适用于最流行的生存分析方法的条件是：

the event of interest is clearly defined and well-specified, so there is no ambiguity about whether it happened or not,
对感兴趣的事件进行了明确的定义和明确的规定，因此对于它是否发生没有歧义，
the event can occur only once for each subject — this is clear in case of death, but if we applied the analysis to churn, this might be a more complicated case, as a churned user might be reactivated and churn again.
该事件对于每个主题只能发生一次-在死亡的情况下很明显，但是如果我们将分析应用于客户流失，则情况可能更复杂，因为流失的用户可能会重新激活并再次流失。

We have already established that survival analysis is used for modeling the time-to-event series, in other words, lifetimes (hence also the name of the Python library which is the go-to tool for this kind of analyses). Generally speaking, we can use survival analysis to try to answer questions like:

我们已经建立了生存分析用于建模事件发生时间序列 (即生存期)的方法(因此也称为Python库的名称，Python库是此类分析的必备工具)。一般而言，我们可以使用生存分析来尝试回答以下问题：

what percentage of the population will survive past a certain time?
一定时间后将有百分之几的人口生存？
of the survivors, what will be their death/failure rate?
的幸存者中，他们的死亡/失败率是多少？
how do particular characteristics (for example, such features as age, gender, geographical location, etc.) affect the probability of survival?
特定特征(例如年龄，性别，地理位置等特征)如何影响生存概率？

Having briefly described the general idea of survival analysis, it is time to introduce a few concepts that are crucial for a thorough understanding of the subject.

简要描述了生存分析的一般概念之后，现在该介绍一些对彻底理解该主题至关重要的概念。

Image for post — Photo by Scott Graham on Unsplash

审查制度 (Censoring)

Censoring can be described as the missing data problem in the domain of survival analysis. Observations are censored when the information about their survival time is incomplete. There are different kinds of censoring, such as:

审查可以描述为生存分析领域中的数据丢失问题。当有关生存时间的信息不完整时，将对观测进行审查。审查方式有多种，例如：

right-censoring,
权利审查
interval-censoring,
间隔检查
left-censoring.
左审查。

To keep this section short, we just discuss the one that is encountered most frequently — right-censoring. Let’s come back to the example with cancer treatment. Imagine, that the study of the effects of the new medicine lasts 5 years (this is an arbitrary number, not actually based on anything). It can happen that after 5 years, some of the patients survived and thus have not experienced the death event. At the same time, the authors of the study lost contact with some patients — they might have relocated to another country, they might have actually died, but no confirmation was ever received. Those cases are affected by right-censoring, that is, their true survival time is equal to or greater than the observed survival time (in this case, the 5 years of the study). The following image illustrates right-censoring.

为了使本节简短，我们只讨论最常遇到的一个问题- 右删失 。让我们回到有关癌症治疗的例子。想象一下，对新药效果的研究持续了5年(这是一个任意数字，实际上并不是基于任何东西)。可能发生的情况是，在5年后，一些患者幸存了下来，因此没有经历过死亡事件。同时，该研究的作者与某些患者失去了联系-他们可能已搬迁到另一个国家，他们可能实际上已经死亡，但从未收到任何确认。这些案例受权利审查的影响，也就是说，它们的真实生存时间等于或大于观察到的生存时间(在本例中为研究的5年)。下图说明了权限检查。

The existence of censoring is also the reason why we cannot use simple OLS for problems in the survival analysis. That is because OLS effectively draws a regression line that minimizes the sum of squared errors. But for censored data, the error terms are unknown and therefore we cannot minimize the MSE. Applying some simple solutions such as using the censorship date as the date of the death event or dropping the censored observations can severely bias the results.

审查的存在也是我们无法在生存分析中使用简单OLS解决问题的原因。这是因为OLS有效地绘制了一条回归线，该回归线使平方误差的总和最小。但是对于被检查的数据，错误项是未知的，因此我们无法最小化MSE。应用一些简单的解决方案，例如使用检查日期作为死亡事件的日期或放弃检查的观察结果，可能会严重影响结果。

For information regarding different kinds of censoring, please go here.

有关各种检查的信息，请转到此处。

生存功能 (The Survival Function)

The survival function is a function of time (t) and can be represented as

生存函数是时间( t )的函数，可以表示为

where Pr() stands for the probability and T for the time of the event of interest for a random observation from the sample. We can interpret the survival function as the probability of the event of interest (for example, the death event) not occurring by the time t.

其中， Pr()代表概率， T代表关注事件的时间，可以从样本中进行随机观察。我们可以将生存函数解释为感兴趣的事件(例如，死亡事件)在时间t之前未发生的概率。

The survival function takes values in the range between 0 and 1 (inclusive) and is a non-increasing function of t.

生存函数的取值范围是0到1(含)之间，并且是t的非递增函数。

危害功能 (The Hazard Function)

We can think of the hazard function (or hazard rate) as the probability of the subject experiencing the event of interest within a small (or to be more precise, infinitesimal) interval of time, assuming that the subject has survived up until the beginning of the said interval. The hazard function can be represented as:

我们可以将危害函数 (或危害率)视为对象在很小(或更确切地说是无穷小)的时间间隔内经历关注事件的概率，前提是对象一直存活到开始。所说的间隔。危害函数可以表示为：

where the expression in the numerator is the conditional probability of the event of interest occurring in the given time interval, provided it has not happened before. dt in the denominator is the width of the considered interval of time. When we divide the former by the latter, we effectively obtain the rate of the event’s occurrence per unit of time. Lastly, by taking the limit as the width of the interval goes to zero, we end up with the instantaneous rate of occurrence, so the risk of an event happening at a particular point in time.

其中分子中的表达式是感兴趣事件在给定时间间隔内发生的条件概率，前提是该事件以前没有发生过。分母中的dt是所考虑的时间间隔的宽度。当我们将前者除以后者时，我们可以有效地获得每单位时间事件发生的比率。最后，通过在间隔的宽度变为零时取极限，我们得出瞬时发生率，因此事件在特定时间点发生的风险。

You might wonder why the hazard rate is defined using this small interval of time. The reason for that lies in the fact that the probability of a continuous random variable being equal to a particular value is zero. That is why we need to consider the probability of the event happening in a very small interval of time.

您可能想知道为什么使用这么短的时间间隔来定义危险率。其原因在于，连续随机变量等于特定值的概率为零。这就是为什么我们需要考虑事件在很小的时间间隔内发生的可能性。

Technical note: to be theoretically correct, it is important to mention that the hazard function is not actually a probability and the name hazard rate is the more fitting one. That is because even though the expression in the numerator is the probability, the dt in the denominator can actually result in a value of the hazard rate greater than 1 (it is still limited to 0 at the lower interval).

技术说明：从理论上讲是正确的，重要的是要提到危害函数实际上并不是概率，而危害率这个名称更合适。这是因为即使分子中的表达式是概率，分母中的dt实际上也可以导致危险率的值大于1(在较低的时间间隔仍限制为0)。

Lastly, the survival and hazard functions are related to each other as specified by the following formula:

最后，生存和危害功能相互关联，如下式所示：

To give the equation a bit of context, the integral in the brackets is called the cumulative hazard and can be interpreted as the sum of the risks the subject faces going from time-point 0 to t.

为了使方程更准确，将方括号中的积分称为累积危害，可以将其解释为受试者从时间点0到t所面临的风险之和。

生存分析的不同方法 (Different approaches to Survival Analysis)

As survival analysis is an entire domain of different statistical methods for working with time-to-event series, there are naturally many different approaches we could follow. On a high level, we could split them into three main groups:

由于生存分析是处理事件间隔时间序列的不同统计方法的整个领域，因此自然可以采用许多不同的方法。在较高的层次上，我们可以将它们分为三个主要组：

Non-parametric — with these approaches, we make no assumptions about the underlying distribution of data. Perhaps the most popular example from this group is the Kaplan-Meier curve, which — in short — is a method of estimating and plotting the survival probability as a function of time.
非参数 -使用这些方法，我们不对数据的基本分布进行任何假设。该组中最受欢迎的示例也许是Kaplan-Meier曲线 ，简而言之，它是一种估计和绘制生存概率随时间变化的方法。
Semi-parametric — as you could have guessed, this group is in between the two extremes and makes very few assumptions. Most importantly, there are no assumptions about the shape of the hazard function/rate. The most popular method from this group is the Cox regression, which we can use to identify the relationship between the hazard function and a set of explanatory variables (predictors).
半参数 -正如您可能已经猜到的，该组介于两个极端之间，并且很少进行假设。最重要的是，没有关于危害函数/速率的形状的假设。该组中最流行的方法是Cox回归 ，我们可以使用它来识别危害函数和一组解释变量(预测变量)之间的关系。
Parametric — you might have encountered this approach while doing your studies. The idea is to use some statistical distributions (some of the popular ones include exponential, log, Weibull, or Lomax) to estimate how long a subject will survive. Often, we use maximum likelihood estimation (MLE) to fit the distribution (or actually the distribution’s parameters) to the data for the best performance.
参数化 -学习时可能会遇到这种方法。想法是使用一些统计分布(一些流行的分布包括指数分布，对数分布，Weibull分布或Lomax分布)来估计对象可以存活多长时间。通常，我们使用最大似然估计(MLE)使分布(或实际上是分布的参数)适合数据，以获得最佳性能。

The methods mentioned in this short list are by no means exhaustive and there are many more interesting approaches to analyzing time-to-event data using machine- or deep-learning-based techniques. I will try to cover the most interesting ones in the following posts, so stay tuned :)

此简短列表中提到的方法绝不是穷举，并且有很多有趣的方法可以使用基于机器学习或深度学习的技术来分析事件数据。我将在以下帖子中尝试介绍最有趣的内容，敬请期待：)

结论 (Conclusions)

In this article, I tried to provide a brief yet thorough introduction to the domain of survival analysis. I believe that this area is often overlooked when talking about different data science solutions. However, by using some simple (or not so simple at all!) solutions we can provide valuable insights for the company or stakeholders and generate actual value-added.

在本文中，我试图对生存分析领域进行简要而全面的介绍。我认为，在谈论不同的数据科学解决方案时，通常会忽略这一领域。但是，通过使用一些简单(或根本不是那么简单！)解决方案，我们可以为公司或利益相关者提供有价值的见解，并产生实际的增值。

This article is only the beginning of a short series, and I will keep on adding the following parts below. In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.

本文只是一个简短系列的开始，我将继续在下面添加以下部分。如果您有任何疑问或建议，请在评论中让我知道，或在Twitter上与您联系。

In the meantime, you might like some of my other articles:

同时，您可能会喜欢我的其他一些文章：

翻译自: https://towardsdatascience.com/introduction-to-survival-analysis-6f7e19c31d96

电路分析导论

weixin_26752765

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
电路分析导论_生存分析导论

电路分析导论In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a compa...
复制链接

扫一扫