集合计数二项式反演_对计数数据使用负二项式

最新推荐文章于 2022-12-29 12:14:09 发布

weixin_26752765

最新推荐文章于 2022-12-29 12:14:09 发布

阅读量587

点赞数 1

文章标签： python java 大数据

原文链接：https://towardsdatascience.com/use-a-negative-binomial-for-count-data-c68c062de203

版权

本文探讨了在处理计数数据时如何利用负二项式分布进行分析，介绍了集合计数和二项式反演的概念，帮助理解如何在大数据场景下有效地处理计数数据。

摘要由CSDN通过智能技术生成

集合计数二项式反演

The Negative Binomial distribution is a discrete probability distribution that you should have in your toolkit for count data. For example, you might have data on the number of pages someone visited before making a purchase or the number of complaints or escalations associated with each customer service representative. Given this data, you might want to model the process and, later, see if some covariates affect the parameters. And in many contexts, you might find that a negative binomial distribution is a good fit.

负二项分布是您应该在工具包中用于计数数据的离散概率分布。例如，您可能具有有关某人在购买之前访问的页面数或与每个客户服务代表相关的投诉或上报数量的数据。给定此数据后，您可能需要对过程进行建模，然后再查看是否有一些协变量影响参数。在许多情况下，您可能会发现负二项式分布很合适。

In this article we’ll introduce the distribution and compute its probability mass function (PMF). We’ll cover its basic properties (mean and variance) by using the binomial theorem. This is in contrast to the usual treatments you will find which either just give you a formula or use more complicated tools to derive the results. Finally, we’ll turn to focus on the distributions’ interpretations.

在本文中，我们将介绍分布并计算其概率质量函数(PMF)。我们将使用二项式定理介绍其基本属性(均值和方差)。这与您会发现的常规处理方法相反，后者只是给您提供公式或使用更复杂的工具来得出结果。最后，我们将重点关注发行版的解释。

负二项分布 (The Negative Binomial Distribution)

Suppose you are going to flip a biased coin that has probability p of coming up heads, which we will call a “success.” Furthermore, you are going to flip the coin continuously until at r successes occur. Let k be the number of failures along the way (so k+r coin flips happen in total).

假设您要抛弃一枚有偏见的硬币，该硬币的正面朝上的概率为p ，我们称之为“成功”。此外，你要不断地翻转硬币，直到在r成功发生。令k为一路失败的次数(因此总共发生了k + r次硬币翻转)。

In the context of our examples, we could imagine:

在我们的示例上下文中，我们可以想象：

A user might browse your website. On each page they have a probability of p=1% of seeing an item they want to buy. We imagine that when they have put r=3 items in their basket, they are ready to checkout. k is the number of pages they will browse and not buy from. Of course we will want to fit the model to find the true values of r and p as well as if/how they vary between users.
用户可能浏览您的网站。在每一页上，他们看到想要购买的商品的概率为p = 1％。我们假设当他们把r = 3时篮子里的东西，他们准备结帐。 k是他们将浏览而不是购买的页面数。当然，我们将需要对模型进行拟合以找到r和p的真实值以及它们在用户之间是否/如何变化。
A customer service representative might in general receive complaints. After receiving complaints, there is a probability p that they will be reprimanded. Then after r times being told off, they will stop getting complaints due to changed behavior. k is the number of complaints on which they are not reprimanded before they change their behavior.
客户服务代表通常可能会收到投诉。接到投诉后，有概率p，他们将受到谴责。然后，在被告知r次之后，由于行为改变，他们将停止投诉。 k是在改变行为之前没有受到谴责的投诉数量。

Whether you actually think this is true is, as always, up to your prior beliefs and how well the model fits the data. Also, note that the number of failures is closely related to the number of events (k versus k plus r).

与往常一样，您是否真的认为这是真的，取决于您先前的信念以及模型对数据的拟合程度。另外，请注意，失败的数量与事件的数量(k对k加r)密切相关。

It is relatively straightforward to write down the probability mass function using some combinatorics. The probability that the r-th success happens on the (k+r)-th coin flip is:

使用某些组合来写下概率质量函数相对简单。第(k + r)次掷硬币成功发生第r次成功的概率为：

The probability that there are r–1 successes on the first k+r–1 flips, times
的概率有R-1上的前k + R-1翻转成功，倍
The probability of success on the (k+r)-th flip.
第( k + r)-次翻转成功的概率。

There are (k+r–1) choose k orderings of (r–1) successes and k failure on the first k+r–1 flips. (The number of ways to arrange k A’s and (r–1) B’s in a line). Each has the same probability of occurring. This gives the PMF:

有第(k + R-1)选择k排序的(R-1)的成功而k失败上的前k + R-1翻转。 (将k A和(r–1)B排列成一行的方式的数目)。每个都有相同的发生概率。这给出了PMF：

Hopefully you remember some basic facts about combinations and permutations. If not, here is a brief review of facts you can convince yourself of to help you out. Suppose there are 3 A’s and 2 B’s and you want to arrange them into a string like “AAABB” or “ABABA”. The number of ways to do this is 5 choose 2 (there are 5 total things and 2 B’s) which is the same as 5 choose 3 (there are 3 A’s). To see this, pretend that each letter is actually a distinct symbols (so the 5 symbols are A1, A2, A3, B1, B2). Then there are 5!=120 ways to arrange the distinct symbols. But there are 3!=6 ways to rearrange the A1 A2 A3 without changing the placements of the A’s, and 2!=2 ways to arrange the B’s. So the total number is 5!/2!3! = 10.

希望您能记住有关组合和排列的一些基本事实。如果没有，这里是对事实的简要回顾，您可以说服自己来帮助您。假设有3个A和2个B，并且您想将它们排列成字符串，例如“ AAABB”或“ ABABA”。这样做的方法是5 选择 2(共有5个事物和2 B)，与5选择3(存在3 A)相同。为此，假设每个字母实际上是一个不同的符号(因此5个符号是A1，A2，A3，B1，B2)。然后有5！= 120种方式来排列不同的符号。但是有3！= 6种方法可以在不更改A的位置的情况下重新排列A1 A2 A3，还有2！= 2种方法来排列B's。因此总数为5！/ 2！3！ = 10。

Now, the trick is, binomials also work for negative numbers on top, or with non-integers. For example, if we expand what we have above, we can add a minus sign to each of the k terms in the numerator:

现在，诀窍是，二项式也可以在顶部使用负数，也可以用于非整数。例如，如果扩展上面的内容，则可以为分子中的k个项中的每一个添加减号：

Hence the name “negative binomial.”

因此，名称为“负二项式”。

The other trick to keep in mind is that we can define binomials with non-integer numbers. Using the fact that the Γ function (Gamma function) satisfies, for positive integers n,

要记住的另一个技巧是我们可以使用非整数来定义二项式。利用Γ函数( 伽马函数 )满足正整数n的事实，

We can write our binomial coefficients in the form

我们可以将二项式系数写成

And this enables us to allow that, in the negative binomial distribution, the parameter r does not have to be an integer. This will be useful because when we estimate our models, we generally don’t have a way to constrain r to be an integer. So a non-integer value for r won’t be a problem. (We will require r to be positive, however). We’ll come back to how to interpret a non-integer value of r.

这使我们能够在负二项式分布中使参数r不必为整数。这将很有用，因为当我们估计模型时，通常没有办法将r约束为整数。因此， r的非整数值不会有问题。 (但是，我们将要求r为正)。我们将回到如何解释r的非整数值。

负二项分布的性质 (Properties of the Negative Binomial Distribution)

We would like to compute the expectation and variance. As a warmup, let’s check that the negative binomial distribution is in fact a probability distribution. For convenience, let q=1–p.

我们想计算期望值和方差。作为热身，让我们检查负二项式分布实际上是否是概率分布。为了方便起见，让q = 1–p 。

The crucial point is the third line, where we used the binomial theorem (yes, it works with negative exponents).

关键是第三行，我们使用了二项式定理 (是的，它适用于负指数)。

Now let’s compute the expectation:

现在让我们计算期望值：

To get the third line, we used the identity

为了获得第三行，我们使用了身份

Where we used the binomial theorem again to get the third to last line.

在这里我们再次使用二项式定理来获得倒数第三行。

Warning: this is the opposite of what you will find on Wikipedia as of this writing. It is what you will find from Wolfram (the makers of Mathematica). This is because Wikipedia thinks about the number of successes before r failures, where as we count failures before r successes. In general, there is a variety of similar ways to parameterize/interpret the distribution, so be careful you have everything straight when looking at formulas in different places.

警告：与本文撰写时在Wikipedia上发现的相反。这是从Wolfram (Mathematica的制造商)那里找到的。这是因为维基百科认为，关于成功的前[R失败的次数，在这里，我们计数R成功之前失败。通常，可以使用多种类似的方法来对分布进行参数化/解释，因此在不同位置查看公式时，请务必小心。

Next, we can compute the variance in two steps. First, we repeat the trick from above, using the identity twice this time to get the third line. We again use the binomial theorem to compute the sum and obtain the third-to-last line.

接下来，我们可以分两步计算方差。首先，我们从上面重复技巧，这次使用两次标识来获得第三行。我们再次使用二项式定理来计算总和并获得倒数第二行。

Now we can compute:

现在我们可以计算：

Again, this is the opposite of what is on Wikipedia.

同样，这与Wikipedia相反。

负二项分布的解释 (Interpretation of the Negative Binomial Distribution)

We have covered the “defining interpretation” of the Negative Binomial Distribution: it is the number of failures before r success occur, with the probability of success at each step being p. But there are a few other ways to look at the distribution that can be illuminating and also help interpret the case where r is not an integer.

我们已经讨论了负二项式分布的“定义解释”：它是r成功发生之前的失败次数，每一步成功的概率为p 。但是，还有其他一些方法可以查看可能具有启发性的分布，并且还可以帮助解释r不是整数的情况。

过度分散的泊松分布 (Over-Dispersed Poisson Distribution)

The Poisson distribution is a very simple model for count data, which assumes that events happen randomly at a certain rate. Then it models the distribution of how many events will occur in a given time interval. In the context of our examples, it would say that:

泊松分布是用于计数数据的非常简单的模型，它假定事件以一定速率随机发生。然后，它模拟在给定时间间隔内将发生多少事件的分布。在我们的示例中，它会说：

Customer service representatives get complaints at a constant rate. The variation in counts is just determined by random variation. (Compare the model where their behavior eventually changes). Again, in modeling this, we could model a difference in rate between representatives based on exogenous covariates.
客户服务代表不断收到投诉。计数的变化仅由随机变化确定。 (比较他们的行为最终改变的模型)。同样，在对此建模时，我们可以基于外部协变量对代表之间的汇率差异进行建模。

One big problem with the Poisson distribution is that the variance is equal to the mean. This may not fit our data. Let’s say we parameterize our Negative Binomial distribution with a mean λ and stopping parameter r. Then we have

泊松分布的一个大问题是方差等于均值。这可能不适合我们的数据。假设我们使用平均值λ和停止参数r来参数化负二项式分布。那我们有

Our probability mass function becomes

我们的概率质量函数变为

Now let’s consider what happens if we take the limit as r →∞ holding λ fixed. (This means that the probability of success goes to 1 as well, in the way defined by p=r/[λ+r]). In this limit, the binomial term approaches (–r) to the power of k divided by k! and r + λ approaches r.

现在让我们考虑一下，如果将λ固定为r→∞时，将发生限制。 (这意味着成功概率也以p = r / [λ+ r]定义的方式也变为1)。在此极限下，二项式项的值接近(-r)k的幂除以k！ r +λ接近r。

In the last line, the r to the k-th powers cancel and we have used the definition of the exponential. The result is that we recover the Poission distribution.

在最后一行中，第k次幂的r抵消，我们使用了指数的定义。结果是我们恢复了Poission分布。

Therefore, we can interpret the Negative Binomial Distribution as a generalization of the Poisson distribution. If the distribution is in fact Poission, we will see a large r and p close to 1. This makes sense because as p approaches 1, the variance approaches the mean. When p is smaller than one, the variance is higher than that of a Poisson distribution with the same mean, so we can see that the Negative Binomial distribution generalizes Poisson by increasing the variance.

因此，我们可以将负二项分布解释为泊松分布的推广。如果分布实际上是Poission，我们将看到一个大的r和p接近1。这是有道理的，因为当p接近1时，方差接近均值。当p小于1时，方差大于均值相同的泊松分布，因此我们可以看到负二项分布通过增加方差来推广泊松。

泊松分布的混合 (Mixture of Poisson Distributions)

The Negative Binomial Distribution also arises as a mixture of Poisson random variables. For example, suppose that our customer service representatives each receive complaints at a given rate (they never change their behavior), but that rate varies between representatives. If that rate is randomly distributed according to a Gamma distribution, we get a Negative Binomial Distribution for the ensemble.

负二项分布也可以由泊松随机变量混合而成。例如，假设我们的客户服务代表每人以给定的比率接收投诉(他们从不改变其行为)，但是代表之间的比率有所不同。如果该比率是根据Gamma分布随机分布的，则该集合将得到负二项分布。

The intuition behind this is as follows. We initially said the Negative Binomial Distribution was the count of failures before r successes when we do coin flips. Instead, replace the coin flip with two Poisson processes. Process one (the “success” process) has rate p and process two, the “failure” process, has rate (1-p). This means that instead of thinking of the Negative Binomial Distribution as counting coin flips, we think that there are independent processes generating “success” and “failure” independently and we just count how many failures before a certain number of successes.

其背后的直觉如下。我们最初说负二项式分布是当我们进行硬币翻转时r成功之前的失败计数。取而代之的是用两个Poisson工序代替硬币翻转。进程一(“成功”进程)的速率为p ，进程二(“失败”进程)的速率为(1-p)。这意味着我们不认为负二项式分布是对硬币翻转进行计数，而是认为有独立的过程独立地产生“成功”和“失败”，而我们只计算在一定数量的成功之前发生了多少次失败。

Now, the Gamma Distribution is the distribution of waiting times for Poisson processes. Let T be the waiting time for r successes from the “success” process. T is Gamma distributed. Then the number of failures has a mean of (1–p)T and is Poisson distributed.

现在，伽马分布是泊松过程的等待时间的分布。令T为“成功”过程中r成功的等待时间。 T是伽马分布的。然后，故障次数的平均值为(1-p)T，并且是泊松分布。

结论 (Conclusion)

The last few points worth pointing out. First of all, there is no analytic way to fit the Negative Binomial Distribution to data. Instead, use the Maximum Likelihood Estimator and numerical estimation. You can use the statsmodels package to do this in Python.

最后几点值得指出。首先，没有将负二项式分布拟合到数据的分析方法。而是使用最大似然估计器和数值估计。您可以使用statsmodels包在Python中执行此操作。

Also, it is possible to do Negative Binomial regression, modeling the effects of covariates. We’ll save that for a future article.

同样，可以进行负二项式回归，对协变量的影响进行建模。我们将其保存在以后的文章中。