mcmc预测_mcmc方法预测保留曲线

本文介绍如何利用Markov Chain Monte Carlo (MCMC) 方法来预测用户留存曲线,结合了统计建模与机器学习,对数据进行深入分析。
摘要由CSDN通过智能技术生成

mcmc预测

Hey Guys,

大家好,

Today I will go through a recent solution I’ve developed for predicting the retention curve of a given cohort.

今天,我将介绍我为预测给定队列的保留曲线而开发的最新解决方案。

定义 (Definition)

In general, retention is a measurement that estimates how sticky the users you bring to the app. A high retention percent after 7 days, for example, shows that the users your bring to the app stays for a long period of time, and thus give you an indication that they are quality users.

通常,留存率是一种评估,可以估算您带给应用程序的用户的黏性。 例如,7天后的较高保留率表明您带给应用程序的用户会停留很长一段时间,从而表明您是优质用户。

The retention curve looks something like this:

保留曲线如下所示:

Example of a retention curve
Retention Curve
保持曲线

So it starts from a very high place, as we would expect, in the early days of a cohort users tend to stick more. We see the exponential decay that the retention curve has, which also highly characterize the behavior of the users in the app.

因此,正如我们所期望的,它是从一个很高的地方开始的,在同类人群的早期,用户倾向于坚持更多。 我们看到了保留曲线的指数衰减,这也高度表征了应用程序中用户的行为。

Our aim in this article is to predict this exact curve for a given cohort, in order to estimate how well these users will stick to the app, how many active users we predict to have from this given cohort, and so forth.

我们在本文中的目的是预测给定同类群组的确切曲线,以估计这些用户对应用程序的坚持程度,我们预计在给定同类群组中有多少活跃用户,等等。

3种预测保留率的方法 (3 Methods to Predict Retention)

So, after understanding what retention means, we can go ahead and think about possible solutions to the problem at hand.

因此,在了解保留的含义之后,我们可以继续考虑可能出现的问题的解决方案。

  1. Average Retention CurveThis solution is probably the most intuitive. We just take many cohorts, look at their behavior, and then averaging their behavior to form an average retention curve. With this retention curve, we will estimate new cohorts by assuming they will be close to the average behavior.

    平均保留曲线该解决方案可能是最直观的。 我们只是采取了许多同类研究,研究他们的行为,然后平均他们的行为以形成平均保留曲线。 使用此保留曲线,我们将通过假设新群组接近平均行为来估计它们。

  2. Fit a curve based only on the existing dataBy this approach, we will gather the relevant information we have on a given cohort, that is, its existing points we already have for its retention curve (for example, for cohorts at age 5 we can already use the first 4 days of this cohort’s retention curve to estimate its full 30 days retention curve), and try to fit a curve based on that observed data.

    仅根据现有数据拟合曲线通过这种方法,我们将收集给定队列的相关信息,即其保留曲线已经拥有的现有点(例如,对于5岁的队列,我们​​已经可以使用该队列的前4天保留曲线以估算其30天的完整保留曲线),并尝试根据观察到的数据拟合曲线。

  3. MCMC Method that takes both 1 and 2 into accountWhy use only one approach when we can benefit from both? In this Bayesian method, we use the prior knowledge on the cohort’s curve, and the actual data we have gained so far.

    同时考虑1和2的MCMC方法当我们可以从两种方法中受益时,为什么只使用一种方法呢? 在这种贝叶斯方法中,我们使用了同类群组曲线上的先验知识以及迄今为止获得的实际数据。

We will go through these 3 approaches throughout this article and compare them to achieve the best method to estimate the retention curve with.

在整篇文章中,我们将对这三种方法进行比较,并对它们进行比较,以获得估计保留曲线的最佳方法。

制定保留曲线 (Formulate the Retention Curve)

The next step in our journey is to formulate the retention curve in a parametric manner.

我们旅程的下一步是以参数方式绘制保留曲线。

As we’ve already covered, the retention curve resembles an exponential decay. Due to that reason, we can formulate this curve by:

如前所述,保留曲线类似于指数衰减。 因此,我们可以通过以下公式来绘制该曲线:

Image for post
Retention curve formulation
保留曲线公式

We have two parameters in this equation: b (intercept) and c (slope).We can see that we can formulate the retention curve pretty good when using this estimation to formulate the behavior of the retention curve we’ve seen earlier:

在该方程式中,我们有两个参数: b(截距)和c(斜率)。 我们可以看到,使用此估算值来表示我们之前看到的保持曲线的行为时,可以很好地制定出保持曲线:

Image for post
Actual behavior vs Formulated behavior
实际行为与公式化行为

Thus, our aim in the 3 methods suggested above, is to estimate the parameters b and c, in order to make the curves as closest as possible.

因此,我们在上述3种方法中的目的是估计参数b和c,以使曲线尽可能地接近。

方法评估与比较 (Methods Evaluation and Comparison)

Method 1: Average Retention CurveIn this section, we will seek to find the average behavior of our retention curve.Since the curve can be defined in a parametric manner as we’ve seen in the last section, we can just find the average b and c of our cohort’s distribution. These average will serve as the average retention curve.

方法1:平均保留曲线在本节中,我们将寻求找到保留曲线的平均行为,因为如上一节所述,可以通过参数化方式定义曲线,所以我们可以找到平均值b和我们队列的分布c。 这些平均值将用作平均保留曲线。

Image for post
Average Retention Curve Performance
平均保留曲线性能

As we can see from the plot above, the average curve shows decent results, achieving near 0 error on average, with error bars hovering around the +-2.5% from the actual retention curve.

从上图可以看出,平均曲线显示了不错的结果,平均误差接近0,误差线在实际保留曲线的+ -2.5%左右。

Method 2: Average Retention Curve

方法2:平均保留曲线

Now, for this section, the performance we’ll show really depends on how much data we actually have. Logically, the more data we have — the more accurate we will be. For example, based on 5 days of data, we get pretty bad results:

现在,在本节中,我们将展示的性能实际上取决于我们实际拥有的数据量。 从逻辑上讲,我们拥有的数据越多-我们将越准确。 例如,基于5天的数据,我们得出的结果很糟糕:

Image for post

We see that in comparison to Method 1, we are no longer on the 0% error line, as we see a shift down. The standard deviation, however, did reduce. That is due to the personal treatment we impose for each of the cohorts.

我们看到,与方法1相比,我们不再位于0%误差线上,因为我们看到了向下移动。 但是,标准偏差确实减小了。 这是由于我们对每个队列强加了个人待遇。

On the other hand, if we wait a bit for more data, we can achieve much better results than Method 1. Here for example are the results based on 18 days of data instead of just 5 days:

另一方面,如果我们稍等一会儿,我们可以获得比方法1更好的结果。例如,这里是基于18天数据而不是仅仅5天的结果:

Image for post

Wow! that’s a big improvement. We achieve a very low error with far lower variability in comparison to Method 1.

哇! 这是一个很大的进步。 与方法1相比,我们实现了非常低的误差且变异性低得多。

These last two graphs can serve as sort of an intro to the third method.We’ve seen that we got advantages from both methods in different situations.When there’s a lack of data, we might probably prefer sticking on the average curve, but when we have a sufficient amount of data, we will no longer need the average behavior, since we can already rely on what we see from the specific cohort.

最后两张图可以作为第三种方法的介绍,我们已经看到在不同情况下这两种方法都具有优势,当缺乏数据时,我们可能更喜欢坚持平均曲线,但是当我们拥有足够的数据量,我们将不再需要一般的行为,因为我们已经可以依靠在特定人群中看到的信息。

This way of thinking fits exactly to the Bayesian approach in Method 3. In the Bayesian approach, we will have a prior (which in this context, will be represented by the average retention curve parameters), a likelihood function (which will be represented by the actual data we’ve seen so far), and from these two components, we will form a posterior, which will encompass the information gathered from both components and form a unified measurement.

这种思维方式完全适合方法3中的贝叶斯方法。在贝叶斯方法中,我们将具有先验函数(在此情况下,将由平均保留曲线参数表示),似然函数(将由以下公式表示): (到目前为止,我们已经看到的实际数据),并且从这两个组成部分中我们将形成一个后验,后验将包含从这两个组成部分中收集的信息并形成一个统一的度量。

Image for post

Since we do not have the normalization constant, we will use the MCMC algorithm to overcome this hurdle.

由于我们没有归一化常数,因此我们将使用MCMC算法来克服这一障碍。

Method 3: MCMC Based Retention CurveUsing the explained method, we will achieve these results after 5 days of data:

方法3:基于MCMC的保留曲线使用解释的方法,我们将在5天的数据之后实现以下结果:

Image for post

We can see that the new (green) method sticks around the prior data when the evidence is low. There is some weight to the actual data but it doesn’t pull the posterior based model very much.

我们可以看到,当证据不足时,新的(绿色)方法会围绕先前的数据。 实际数据有一定的权重,但是它并不能很好地支持基于后验的模型。

Now, when based on 18 days of data:

现在,基于18天的数据:

Image for post

We now see how the posterior is shifted towards the actual data, leaving the prior knowledge behind as it should.

现在,我们了解后验如何向实际数据转移,而将先验知识留在应有的位置。

概要 (Summary)

We went through 3 methods to estimate the retention curve. One is based on the average behavior, one is based on the actual data, and the one that performed best, the MCMC method, took both actual and prior data into account.

我们通过3种方法估算了保留曲线。 一种是基于平均行为,另一种是基于实际数据,而性能最佳的一种是MCMC方法,同时考虑了实际数据和先前数据。

I hope you found the information useful and will help you form your company’s retention curve as well!

我希望您发现这些信息有用,也可以帮助您形成公司的保留曲线!

翻译自: https://towardsdatascience.com/predicting-retention-curve-with-mcmc-method-311b36a3cf5b

mcmc预测

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值