ab 与第三方模块通讯_符合AB测试卡方测试引擎

最新推荐文章于 2024-09-22 17:12:26 发布

weixin_26755331

最新推荐文章于 2024-09-22 17:12:26 发布

阅读量436

点赞数

文章标签： python

原文链接：https://medium.com/bukalapak-data/meet-the-engine-of-a-b-testing-chi-square-test-30e8a8ab44c5

版权

ab 与第三方模块通讯

Understand the concept and perform one from scratch

了解概念并从头开始执行一个

A/B testing is a user experience research methodology to prove causal relationships. It represents shorthand notation for a simple controlled experiment, where users are randomly served with two different variants: variation A (control) and variation B (Young, 2014). The interest is then to figure out which variant is the better performer, in terms of some predefined metrics.

A / B测试是一种用于证明因果关系的用户体验研究方法。它代表了一个简单的受控实验的简写形式，其中为用户随机提供了两种不同的变体：变体A(对照)和变体B(Young，2014年)。然后，有兴趣根据一些预定义的指标来找出哪个变体的性能更好。

Here at Bukalapak, A/B tests are highly prevalent. We perform them regularly to optimize almost any aspect of our app. Our use cases range from revamping a certain page, to testing different recommendation algorithms, to tweaking specific user shopping journey; essentially to provide the best possible experience for our users.

在Bukalapak ，A / B测试非常普遍。我们会定期执行它们，以优化我们应用程序的几乎所有方面。我们的用例范围包括修改特定页面，测试不同的推荐算法，调整特定的用户购物流程；实质上是为我们的用户提供最佳体验。

A/B testing is even more beneficial since it proves causality — rather than just correlation. Therefore, it is perfectly in-line with Bukalapak Data Team’s principle: “Turning insight to action”, so from looking at associations in the data we want to prove causality via A/B test, and then charge on actions for growth.

A / B测试甚至更有益，因为它证明了因果关系-而不仅仅是相关性。因此，这完全符合Bukalapak数据团队的原则： “将洞察力转化为行动” ，因此，通过查看数据中的关联，我们希望通过A / B测试证明因果关系，然后对增长行动收取费用。

实践中的A / B测试 (A/B Testing in practice)

One example of A/B testing at Bukalapak is one we recently did with our Mitra (our agent partner) address filling. For some context, we encourage our Mitra to fill in their address, since it allows us to provide them with better user experience (e.g. to notify them with relevant promos happening close to their location). We divide them into different segments and send different types (wording) of push notification.

Bukalapak的A / B测试示例之一是我们最近对Mitra进行的测试 (我们的代理商合作伙伴)地址填写。在某些情况下，我们鼓励Mitra填写他们的地址，因为它可以使我们为他们提供更好的用户体验(例如，通知他们附近发生的相关促销活动)。我们将它们划分为不同的细分，并发送不同类型(措辞)的推送通知。

Image for post — A/B Testing our notification strategy with our Mitra.

Turns out, the push notification with a community-empowerment message outperformed the other which offers rewards (voucher; cashback) in return for filling in their address. From this A/B test, we could conclude that the content of messaging mattered more than potential carrots (and saving some extra unnecessary costs).

事实证明，带有社区授权消息的推送通知的性能优于其他提供了奖励(凭证，现金返还)以换取填写其地址的消息的推送通知。通过此A / B测试，我们可以得出结论，消息传递的内容比潜在的胡萝卜更为重要(并节省了一些不必要的额外费用)。

Realizing the benefits that A/B tests have to offer, we at Bukalapak develop Splitter: our in-house A/B testing platform that allows us to perform experiments quickly at scale. All we need to do is to set up the variant treatment and define the success metrics of interest. Beyond these two, Splitter will handle the rests (from splitting the traffic — to serve users different treatments— to analyzing the experiment result).

意识到A / B测试必须提供的好处，我们Bukalapak开发了Splitter ：我们的内部A / B测试平台，可让我们快速进行大规模实验。我们需要做的就是设置变体处理并定义感兴趣的成功指标。除了这两个以外，Splitter还将处理其余的工作(从分配流量(为用户提供不同的服务)到分析实验结果)。

But what goes under the hood inside Splitter? What scientific tool actually powers A/B tests like the one mentioned above? From a statistical point of view, an A/B test is actually another form of hypothesis testing, in which we need to resort to a certain statistical testing method to gather the conclusion from. As it turned out, the chi-square test is precisely the method that we were looking for.

但是Splitter的内幕到底是什么？哪种科学工具实际上可以像上面提到的那样为A / B测试提供动力？从统计的角度来看，A / B检验实际上是假设检验的另一种形式，其中我们需要诉诸某种统计检验方法来收集结论。事实证明， 卡方检验正是我们所寻找的方法。

In this blog, we will walkthrough the theoretical concept of chi-square test. Next, we will go through a working example, i.e. analyzing an A/B test example from scratch, so that we deeply understand how things work.

在此博客中，我们将介绍卡方检验的理论概念。接下来，我们将通过一个工作示例，即从头开始分析A / B测试示例，以便我们深入了解事物的工作原理。

卡方检验 (Chi-Square Test)

The chi-squared test (for independence) is a statistical test to evaluate whether or not the distributions of two or more categorical variables — each variable has two or more possible values— are actually independence or homogenous (i.e. how the values are distributed on the two variables are relatively the same).

卡方检验(用于独立性)是一种统计检验，用于评估两个或多个分类变量(每个变量具有两个或多个可能的值)的分布是否实际上是独立的或同质的(即，值在变量上的分布方式)两个变量相对相同)。

As the definition suggests, the data on which chi-square tests are used is a typical contingency table like one below.

顾名思义，使用卡方检验的数据是典型的列联表，如下所示。

In the contingency table above, we have two variables in learning methods (visual and auditory), each has two possible values (pass the exam or fail).

在上面的列联表中，我们在学习方法上有两个变量(视觉和听觉)，每个变量都有两个可能的值(通过或未通过考试)。

A data set like this is often called an “R×C table,” where R is the number of rows and C is the number of columns. This is a 2×2 table (McDonald, 2014).

这样的数据集通常称为“ R×C表”，其中R是行数，C是列数。这是一张2×2的桌子(麦当劳，2014年)。

Hypotheses to be tested

假设要测试

In the standard form of chi-squared test, the null-alternative pair of hypotheses to be tested are as follows:

在卡方检验的标准形式中，要检验的零替代假设是：

Null: The variables are independent, meaning the value distributions on the variables are relatively the same
Null ：变量是独立的，这意味着变量上的值分布相对相同
Alternative: The variables are not independent, how the values are distributed depends on the variable
替代方法 ：变量不是独立的，值的分配方式取决于变量

Test statistics

测试统计

Given a contingency table, the test statistics of chi-square test is formulated as follows.

给定列联表，卡方检验的检验统计公式如下。

where

哪里

Moreover,

此外，

The test statistics in Equation 1 is known to approximate the chi-square distribution with degree of freedom (R-1)x(C-1) (Frost 2020). For example, a 2x2 contingency table like the one in Figure 1 implies (2–1)x(2–1) = 1 as its degree of freedom.

已知公式1中的测试统计量可以近似以自由度(R-1)x(C-1)的卡方分布(Frost 2020)。例如，像图1所示的2x2列联表表示其自由度为(2-1)x(2-1)= 1。

Compare test statistics to table value

将测试统计信息与表值进行比较

After we compute the test statistics, we compare it with the table value. Precisely, our table value is

计算完测试统计信息后，我们将其与表值进行比较。准确地，我们的表值是

where k and alpha are the degree of freedom and our predefined significance level, respectively.

其中k和alpha分别是自由度和我们预定义的显着性水平。

If we find our test statistics is greater than the above table value, we can confidently reject our null hypothesis. That is, we conclude that the variables are not independent, how the values are distributed depends on the variable at the given significance level.

如果我们发现检验统计量大于上面的表值，则可以放心地拒绝我们的原假设。也就是说，我们得出结论，变量不是独立的，值的分布方式取决于给定显着性水平下的变量。

工作实例 (Working example)

It’s now time to get our hands dirty. Let’s have a concrete example!

现在该弄脏我们的手了。让我们举一个具体的例子！

Suppose there is a digital company that wants to improve the redemption rate of its promo vouchers by revamping their current MyVoucher page design. So, we have the following two competing designs:

假设有一家数字公司希望通过修改其当前MyVoucher页面设计来提高其促销凭证的兑换率。因此，我们有以下两种竞争设计：

Control: the existing design
控制：现有设计
Variant: the revamped design
变体：改良设计

They roll the experiment by serving each of users with one of the two designs randomly and record his/her action accordingly — whether or not he/she redeems the voucher.

他们通过为每个用户随机提供两种设计之一来滚动实验，并相应记录他/她的行为-他/她是否兑换了优惠券。

Suppose we have the following result.

假设我们得到以下结果。

Note that we can derive an equivalent table as follows from the table in Figure 3, which might be more familiar for business users.

请注意，我们可以从图3中的表中得出一个等效表，如下所示，对于业务用户来说可能更熟悉。

We see from Figure 4 that the redemption rate from the revamped design is higher than what the existing design yielded. Nevertheless, it might be the case that the difference was actually caused by some inherent random noise (not statistically significant). Using the chi-square test, it is then our task to check whether or not the difference is significant.

从图4中我们可以看到，经过改进的设计的赎回率高于现有设计的收益率。但是，差异可能实际上是由某些固有的随机噪声引起的(统计上不显着)。然后使用卡方检验来检查差异是否显着。

Hypotheses

假设

In other words, we want to test two competing hypotheses as follows. Notice the difference in the wording, compared with the previous one we explain in the concept part — nevertheless, they are equivalent.

换句话说，我们要检验两个相互竞争的假设，如下所示。请注意，与我们在概念部分中解释的上一句话相比，措词有所不同–尽管如此，它们是等效的。

Null: There is no significant difference in redemption rates obtained by the two designs
无：两种设计获得的赎回率没有显着差异
Alternative: There is a significant difference in redemption rates obtained by the two designs
替代方案：两种设计获得的赎回率存在显着差异

Before we carry out any computation, to prevent cheating with data, we set our alpha to be 0.05 (5%) throughout our analysis.

在进行任何计算之前，为了防止数据作弊，在整个分析过程中，我们将alpha设置为0.05(5％)。

Computing expected cell value

计算预期单元格值

We first compute the expected value for each cell of the contingency table in Figure 3. To this end, we use Equation 2.

首先，我们为图3中的列联表的每个单元格计算期望值。为此，我们使用公式2。

For convenience, let’s put these values in a table.

为了方便起见，让我们将这些值放在表中。

Computing test statistics

计算测试统计

Next, we compute the test statistics, whose formula is given in Equation 1. This is quite straightforward since we already have the two ingredients, namely the actual result table (Figure 3) and the expected value table (Figure 5).

接下来，我们计算测试统计量，其公式由公式1给出。这非常简单，因为我们已经具有两个成分，即实际结果表(图3)和期望值表(图5)。

Gather the conclusion

收集结论

After we have the test statistics value (67), we need to compare it with the table value, that is the random variable value when the chi-square distribution with (2–1)x(2–1) = 1 degree of freedom hits the probability of (1–0.05) = 0.95. The value is 3.84 (source).

得到测试统计值(67)之后，我们需要将其与表格值进行比较，表格值是当(2-1)x(2-1)= 1自由度的卡方分布时的随机变量值达到(1-0.05)= 0.95的概率。值是3.84( source )。

Therefore, we know that our test statistics (67) is greater than the table value (3.84). Thus we reject the null hypothesis. There is enough evidence to state that there is a significant difference in redemption rates obtained by the two designs.

因此，我们知道我们的测试统计信息(67) 大于表值(3.84)。因此，我们拒绝零假设。有足够的证据表明这两种设计获得的赎回率存在显着差异。

Moreover, since the redemption rate of the revamped design is higher than control’s (see Figure 4), we can conclude that the revamped design is the winner of this experiment — the revamped design is better for MyVoucher page than the control — at 5% significance level.

此外，由于修改后的设计的赎回率高于控件的赎回率(请参见图4)，因此我们可以得出结论：修改后的设计是本实验的赢家-修改后的设计对MyVoucher页面的效果要优于控件-5％。水平。

结束语 (Closing remarks)

From this article, we understand that an A/B testing can be seen as a statistical hypothesis testing problem. We learn the concept of the chi-square test, the statistical tool that powers such A/B testings. Afterwards, we implement the test to analyze a working example of A/B testing from scratch. Hopefully, by doing so, you’ll have a better understanding of the methodology.

从本文中，我们了解到A / B检验可以看作是统计假设检验问题。我们学习卡方检验的概念，这是为此类A / B测试提供支持的统计工具。之后，我们实施该测试以从头开始分析A / B测试的工作示例。希望这样做，您将对方法学有更好的了解。

Since this article is not — by any means — intended to be a comprehensive reading on either A/B testing nor chi-square test, here are some points to be aware of:

由于这篇文章是不是 -以任何方式-旨在是对是A / B测试，也没有卡方检验的全面阅读，这里有几点需要注意的：

First, regarding chi-square test. The chi-squared test is only valid if the sample size is relatively large, i.e. > 1000 (McDonald, 2014). If this threshold is not met, the test result might be not reliable. In such cases, one can use Fisher’s exact test instead.

首先，关于卡方检验。卡方检验仅在样本量较大时有效 ，即> 1000(McDonald，2014)。如果未达到此阈值，则测试结果可能不可靠。在这种情况下，可以使用费舍尔精确检验代替。

Second, there is another type of chi-square test, other than the independence/homogeneity test we discussed in this article. The counterpart is called chi-square test for goodness of fit. Briefly speaking, the test is used when we want to test whether or not the distribution of a categorical variable follows a specific (given/assumed) distribution.

其次，除了我们在本文中讨论的独立性/同质性测试之外，还有另一种卡方检验。对应性称为卡方检验，以证明拟合优度。简而言之，当我们要测试分类变量的分布是否遵循特定(给定/假定)分布时，使用该测试。

Third, regarding the metrics of A/B testing. We can use other metrics that aren’t based on proportions (such as redemption rate in this article). Depending on the problem at hand, we might want to evaluate some numeric-continuous metrics (for instance: average transaction amount) through A/B testing. For the consequence, we need to resort to different statistical technique to analyze such experiments.

第三，关于A / B测试的指标。我们可以使用其他不基于比例的指标(例如本文的赎回率)。根据手头的问题，我们可能希望通过A / B测试评估一些连续的数字指标(例如：平均交易金额)。因此，我们需要求助于不同的统计技术来分析此类实验。

Finally, we can generalize A/B testing to include more than two non-control variants (a.k.a. multivariate A/B testing). Again, it results in a (slightly) different statistical method used to draw conclusions from them.

最后，我们可以将A / B测试概括为包括两个以上的非对照变量(又称为多变量A / B测试 )。同样，这会导致(略有不同)统计方法用来得出结论。

Thanks for reading, and happy experimenting!

感谢您的阅读和实验愉快！