ab实验置信度_为什么您的Ab测试需要置信区间

最新推荐文章于 2024-06-21 10:31:34 发布

张_伟_杰

最新推荐文章于 2024-06-21 10:31:34 发布

阅读量3.1k

点赞数

原文链接：https://medium.com/criteo-labs/why-your-ab-test-needs-confidence-intervals-bec9fe18db41

版权

ab实验置信度

by Aloïs Bissuel, Vincent Grosbois and Benjamin Heymann

AloïsBissuel，Vincent Grosbois和Benjamin Heymann撰写

The recent media debate on COVID-19 drugs is a unique occasion to discuss why decision making in an uncertain environment is a complicated but fundamental topic in our technology-based, data-fueled societies.

最近有关COVID-19药物的媒体辩论是讨论我们为何在以技术为基础，以数据为动力的社会中为什么在不确定的环境中进行决策是一个复杂而基本的话题的难得的机会。

Aside from systematic biases — which are also an important topic —any scientific experiment is subject to a lot of unfathomable, noisy, or random phenomena.For example when one wants to know the effect of a drug on a group of patients, one may always wonder “If I were to take a similar group to renew the test, would I get the same result?” Such an experiment would at best end up with similar results. Hence the right question is probably something like “Would my conclusions still hold?” Several dangers await the uncautious decision-maker.

除了系统性偏见(这也是一个重要的话题)之外，任何科学实验都受到许多难以理解，嘈杂或随机现象的影响。例如，当一个人想知道某种药物对一组患者的作用时，可能总是纳闷：“如果我要参加一个类似的小组来延长考试时间，我会得到相同的结果吗？” 这样的实验充其量只能得到相似的结果。因此，正确的问题可能类似于“我的结论还成立吗？” 谨慎的决策者有几个危险等待着。

A first danger comes from the design and context of the experiment. Suppose you are given a billion dices, and that you roll each dice 10 times. If you get a dice that gave you ten times a 6, does it mean it is biased?This is possible, but on the other hand, the design of this experiment made it extremely likely to observe such an outlier, even if all dices are fair. In this case, the eleventh throw of this dice will likely not be a 6 again

第一个危险来自实验的设计和环境 。假设给您十亿个骰子，并且将每个骰子掷10次。如果您得到的骰子是6的10倍，是否表示它有偏见？这是可能的，但另一方面，本实验的设计使其极有可能观察到这样的异常值，即使所有骰子都是公平。在这种情况下，此骰子的第11次掷出可能不再是6

A second danger comes from human factors. In particular incentives and cognitive bias.Cognitive bias, because when someone is convinced of something, he is more inclined to listen to — and report — a positive rather than a negative signal.Incentives, because society — shall it be in the workplace, the media or scientific communities — is more inclined to praise statistically positive results than negative ones.For instance, suppose an R&D team is working on a module improvement by doing A/B tests. Suppose also that positive outcomes are very unlikely, and negative ones very likely, and that the result of experiments is noisy. An incautious decision-maker is likely to roll out more negative experiments than positive ones, hence the overall effort of the R&D team will end up deteriorating the module.

第二个危险来自人为因素 。特别是动机和认知偏见 。 认知偏见 ，因为当某人确信某件事时，他更倾向于倾听并报告正面的信号，而不是负面的信号。 激励措施是因为社会(无论是在工作场所，媒体还是科学界)都应该赞扬统计学上的积极结果，而不是消极的结果。例如，假设一个研发团队正在通过A / B测试来改进模块。还假设积极的结果是极不可能的，而消极的结果是极有可能的，并且实验的结果是嘈杂的。一个不谨慎的决策者可能会推出更多的负面实验，而不是正面的实验，因此研发团队的整体努力最终会使模块恶化。

At Criteo, we use A/B testing to make decision while coping with uncertainty. It is something we understand well because it is at the heart of our activities.Still, A/B testing raises many questions that are technically involved, and so require some math and statistical analysis. We propose to answer some of them in this blog post.

在Criteo，我们使用A / B测试来做出决策，同时应对不确定性。 我们很了解这是因为它是我们活动的核心。仍然，A / B测试提出了许多技术上涉及的问题，因此需要一些数学和统计分析。 我们建议在此博客文章中回答其中一些问题。

Image for post — When can you conclude on this A/B-test?

We will introduce several statistical tools to determine if an A/B-test is significant, depending on the type of metric we are looking at. We will focus first on additive metrics, where simple statistical tools give direct results, and then we will introduce the bootstrap method which can be used in more general settings.

我们将介绍几种统计工具来确定A / B检验是否有效，具体取决于我们正在查看的指标类型。 我们将首先关注简单的统计工具可直接得出结果的加性指标，然后我们将介绍可在更一般的设置中使用的引导方法。

如何总结A / B测试的重要性 (How to conclude on the significance of an A/B-test)

We will present good practices based on the support of the distribution (binary / non-binary) and on the type of metric (additive / non-additive).In the following sections, we will propose different techniques that allow us to assess if an A/B-test change is significant, or if we are not able to confidently conclude that the A/B-test had any effect.

我们将基于分布的支持(二进制/非二进制)和度量类型(加法/非加法)提出良好做法。在以下各节中，我们将提出不同的技术，使我们能够评估A / B测试更改意义重大，或者如果我们无法自信地得出结论，A / B测试有效。

We measure the impact of the change done in the A/B-test by looking at metrics. In the case of Criteo, this metric could for instance be “the number of sales made by a user on a given period”.

我们通过查看指标来衡量在A / B测试中所做更改的影响。在Criteo的情况下，该指标例如可以是“用户在给定时期内的销售数量”。

We measure the same metric on two populations: The reference and the test population. Each population has a different distribution, from which we gather data points (and eventually compute the metric). We also assume that the A/B-test has separated the population randomly and that the measures are independent with respect to each other.

我们在两个总体上测量相同的指标：参考总体和测试总体。每个总体都有不同的分布，从中我们可以收集数据点(并最终计算指标)。我们还假设A / B检验已随机分离了总体，并且这些度量相对于彼此是独立的。

累加指标的一些特殊情况 (Some special cases for additive metrics)

Additive metrics can be computed at small granularity (for instance at the display or user level), and then summed up to form the final metric. Examples might include the total cost of an advertiser campaign (which is the sum of the cost of each display), or number of buyers for a specific partner.

可以以较小的粒度(例如在显示级别或用户级别)计算附加指标，然后将其加起来以形成最终指标。例如，可能包括广告客户活动的总费用(即每次展示费用的总和)，或特定合作伙伴的购买者数量。

指标仅包含零或一 (The metric contains only zeros or ones)

At Criteo, we

最低0.47元/天解锁文章

张_伟_杰

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
ab实验置信度_为什么您的Ab测试需要置信区间

ab实验置信度by Aloïs Bissuel, Vincent Grosbois and Benjamin Heymann AloïsBissuel，Vincent Grosbois和Benjamin Heymann撰写 The recent media debate on COVID-19 drugs is a unique occasion to discuss why decision ...
复制链接

扫一扫