a/b测试_如何进行A / B测试？-CSDN博客

a/b测试

The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.

A / B测试的想法是将不同的内容呈现给不同的变体(用户组)，收集他们的React和用户行为，并使用结果在将来构建产品或营销策略。

A/B testing is a methodology of comparing multiple versions of a feature, a page, a button, headline, page structure, form, landing page, navigation and pricing etc. by showing the different versions to customers or prospective customers and assessing the quality of interaction by some metric (Click-through rate, purchase, following any call to action, etc.).

A / B测试是通过向客户或潜在客户显示不同版本并评估质量来比较功能，页面，按钮，标题，页面结构，表单，着陆页，导航和定价等多个版本的方法按某种指标(点击率，购买，任何号召性用语等)进行互动的次数。

This is becoming increasingly important in a data-driven world where business decisions need to be backed by facts and numbers.

在数据驱动的世界中，这一点变得越来越重要，在这个世界中，业务决策需要事实和数字的支持。

如何进行标准的A / B测试 (How to conduct a standard A/B test)

Formulate your Hypothesis
制定假设
Deciding on Splitting and Evaluation Metrics
确定划分和评估指标
Create your Control group and Test group
创建控制组和测试组
Length of the A/B Test
A / B测试时间
Conduct the Test
进行测试
Draw Conclusions
得出结论

1.提出你的假设 (1. Formulate your hypothesis)

Before conducting an A/B testing, you want to state your null hypothesis and alternative hypothesis:

在进行A / B测试之前，您需要陈述零假设和替代假设：

The null hypothesis is one that states that there is no difference between the control and variant group.The alternative hypothesis is one that states that there is a difference between the control and variant group.

零假设 是一个状态存在 的控制和变体group.The 备选假设 没有区别 是一个状态存在 的控制和变体组之间的差。

Imagine a software company that is looking for ways to increase the number of people who pay for their software. The way that the software is currently set up, users can download and use the software free of charge, for a 7-day trial. The company wants to change the layout of the homepage to emphasise with a red logo instead of blue logo that there is a 7-day trial available for the company’s software.

想象一下，一家软件公司正在寻找增加软件购买费用的人数的方法。用户可以免费下载和使用该软件的当前设置方式，试用期为7天。该公司希望更改首页的布局，以红色徽标代替蓝色徽标来强调该公司的软件有7天的试用期。

Here is an example of hypothesis test: Default action: Approve blue logo.Alternative action: Approve red logo.Null hypothesis: Blue logo does not cause at least 10% more license purchase than red logo.Alternative hypothesis: Red logo does cause at least 10% more license purchase than blue logo.

以下是假设检验的示例： 默认操作：批准蓝色徽标。 替代措施：批准红色徽标。 无假设：蓝色徽标不会导致购买的许可证比红色徽标多至少10％。 替代假设：红色徽标确实导致购买的许可证比蓝色徽标多至少10％。

It’s important to note that all other variables need to be held constant when performing an A/B test.

重要的是要注意，在执行A / B测试时，所有其他变量都必须保持恒定。

2.确定划分和评估指标 (2. Deciding on Splitting and Evaluation Metrics)

We should consider two things: where and how we should split users into experiment groups when entering the website, and what metrics we will use to track the success or failure of the experimental manipulation. The choice of unit of diversion (the point at which we divide observations into groups) may affect what evaluation metrics we can use.

我们应该考虑两件事：进入网站时应在何处以及如何将用户分为实验组，以及我们将使用什么指标来跟踪实验操作的成功或失败。转移单位的选择(将观察分为几组的点)可能会影响我们可以使用的评估指标。

The control, or ‘A’ group, will see the old homepage, while the experimental, or ‘B’ group, will see the new homepage that emphasises the 7-day trial.

对照组(即“ A”组)将看到旧的主页，而实验组(即“ B”组)将看到强调7天试用期的新主页。

Three different splitting metric techniques:

三种不同的拆分指标技术：

a) Event-based diversionb) Cookie-based diversion c) Account-based diversion

a)基于事件的转移b)基于Cookie的转移c)基于帐户的转移

An event-based diversion (like a pageview) can provide many observations to draw conclusions from, but if the condition changes on each pageview, then a visitor might get a different experience on each homepage visit. Event-based diversion is much better when the changes aren’t as easily visible to users, to avoid disruption of experience.

基于事件的转移 (如综合浏览量)可以提供许多观察结果，以得出结论，但是如果条件在每个综合浏览量上都发生变化，那么访问者可能会在每次首页访问中获得不同的体验。当更改对用户而言不那么容易看到时，基于事件的转移要好得多，这样可以避免体验中断。

In addition, event-based diversion would let us know how many times the download page was accessed from each condition, but can’t go any further in tracking how many actual downloads were generated from each condition.

此外，基于事件的转移将使我们知道从每个条件访问了多少次下载页面，但无法进一步跟踪从每个条件产生了多少实际下载。

Account-based can be stable, but is not suitable in this case. Since visitors only register after getting to the download page, this is too late to introduce the new homepage to people who should be assigned to the experimental condition.

基于帐户的帐户可以稳定，但在这种情况下不适合。由于访问者仅在进入下载页面后进行注册，因此将新首页介绍给应该分配到实验条件的人们为时已晚。

So this leaves the consideration of cookie-based diversion, which feels like the right choice. Cookies also allow tracking of each visitor hitting each page. The downside of cookie based diversion, is that it get some inconsistency in counts if users enter the site via incognito window, different browsers, or cookies that expire or get deleted before they make a download. As a simplification, however, we’ll assume that this kind of assignment dilution will be small, and ignore its potential effects.

因此，这无需考虑基于cookie的转移 ，这似乎是正确的选择。 Cookies还可以跟踪每个访问者访问每个页面的情况。基于cookie的转移的缺点是，如果用户通过隐身窗口，不同的浏览器或过期或在下载前被删除的cookie进入站点，则计数会出现一些不一致的情况。但是，为简化起见，我们将假定这种分配稀释很小，并忽略其潜在影响。

In terms of evaluation metrics, we should prefer using the download rate (# downloads / # cookies) and purchase rate (# licenses / # cookies) relative to the number of cookies as evaluation metrics.

在 评估指标方面 ，相对于Cookie数量，我们应该更喜欢使用下载率 (＃次下载/＃cookie)和购买率 (＃个许可/＃cookies)作为评估指标。

Product usage statistics like the average time the software was used in the trial period are potentially interesting features, but aren’t directly related to our experiment. Certainly, these statistics might help us dig deeper into the reasons for observed effects after an experiment is complete. But in terms of experiment success, product usage shouldn’t be considered as an evaluation metric.

产品使用情况统计信息(例如软件在试用期内的平均使用时间)可能是有趣的功能，但与我们的实验没有直接关系。当然，这些统计信息可能有助于我们在实验完成后更深入地观察观察到的效果的原因。但就实验成功而言，不应将产品使用情况视为评估指标。

3.创建您的对照组和测试组 (3. Create your control group and test group)

Once you determine your null and alternative hypothesis, the next step is to create your control and test (variant) group. There are two important concepts to consider in this step, sampling and sample size.

一旦确定了零假设和替代假设，下一步就是创建对照和测试(变量)组。在此步骤中，有两个重要概念需要考虑，即采样和样本量。

SamplingRandom sampling is one most common sampling techniques. Each sample in a population has an equal chance of being chosen. Random sampling is important in hypothesis testing because it eliminates sampling bias, and it’s important to eliminate bias because you want the results of your A/B test to be representative of the entire population rather than the sample itself.

采样随机采样是一种最常见的采样技术。总体中的每个样本都有相等的机会被选中。随机抽样在假设检验中很重要，因为它消除了抽样偏差，而消除偏差也很重要，因为您希望A / B检验的结果能够代表整个总体而不是样本本身。

A problem of A/B tests is that if you haven’t defined your target group properly or you’re in the early stages of your product, you may not know a lot about your customers. If you’re not sure who they are (try creating some user personas to get started!) then you might end up with misleading results. Important to understand which sampling method that suits your use case.

A / B测试的问题是，如果您没有正确定义目标组，或者您处于产品的早期阶段，那么您可能对客户了解的不多。如果不确定他们是谁(尝试创建一些用户角色来开始！)，那么最终可能会产生误导性的结果。重要的是要了解哪种采样方法适合您的用例。

Sample SizeIt’s essential that you determine the minimum sample size for your A/B test prior to conducting it so that you can eliminate under coverage bias, bias from sampling too few observations.

样本大小 ，你先确定你的A / B测试的最小样本量进行，这样你可以在覆盖偏倚 ，从取样太少观察偏见消除它是必不可少的。

4. A / B测试的时间 (4. Length of the A/B test)

A calculator like this one can help you determine the length of time you need to get any real significance from your A/B tests.

像这样的计算器可以帮助您确定从A / B测试中获得任何实际意义所需的时间。

History data shows that there are about 3250 unique visitors per day. There are about 520 software downloads per day (a .16 rate) and about 65 licenses purchased each day (a .02 rate). In an ideal case, both the download rate and license purchase rate should increase with the new homepage; a statistically significant negative change should be a sign to not deploy the homepage change. However, if only one of our metrics shows a statistically significant positive change we should be happy enough to deploy the new homepage

历史数据显示，每天大约有3250位唯一身份访问者。每天大约有520个软件下载( .16比率 )，每天购买约65个许可证( .02比率 )。在理想情况下，下载率和许可证购买率均应随新首页的增加而增加；具有统计意义的负面变化应该是不部署主页更改的标志。但是，如果只有一项指标显示出统计上显着的积极变化，那么我们应该很乐意部署新的首页

For an overall 5% Type I error rate with Bonferroni correction and 80% power, we should require 6 days to reliably detect a 50 download increase per day and 21 days to detect an increase of 10 license purchases per day. Performing both individual tests at a .05 error rate carries the risk of making too many Type I errors. As such, we’ll apply the Bonferroni correction to run each test at a .025 error rate so as to protect against making too many errors.

对于具有Bonferroni校正和80％功率的5％I型错误率，我们应该需要6天才能可靠地检测到每天50个下载量的增加，而需要21天才能检测到每天10个许可证购买量的增加。以.05的错误率执行两项测试都可能导致I型错误过多。因此，我们将应用Bonferroni校正以.025的错误率运行每个测试，以防止发生太多错误。

Use the link above for the test days calculations: Estimated existing conversion rate (%): 16% Minimum improvement in conversion rate you want to detect (%): 50/520*100 %Number of variations/combinations (including control): 2Average number of daily visitors: 3250Percent visitors included in test? 100%Total number of days to run the test: 6 days

使用上面的链接进行测试日计算： 估计现有转化率(％)： 16％ 您要检测的转化率的最小改进(％)： 50/520 * 100％ 变体/组合数(包括对照)： 2 每日平均访客人数： 3250 测试中是否包含访客？ 100％ 运行测试的总天数： 6 天

Estimated existing conversion rate (%): 2 % Minimum improvement in conversion rate you want to detect (%): 10/65*100 %Number of variations/combinations (including control): 2Average number of daily visitors: 3250Percent visitors included in test? 100%Total number of days to run the test: 21 days

估计的现有转化率(％)： 2％ 您要检测的转化率的最低改进(％)： 10/65 * 100％ 变体/组合数(包括对照)： 2 平均每日访问者数量： 3250 包含的 访问者 百分比在测试中？ 100％ 运行测试的总天数： 21天

One thing that isn’t accounted for in the base experiment length calculations is that there is going to be a delay between when users download the software and when they actually purchase a license. That is, when we start the experiment, there could be about seven days before a user account associated with a cookie actually comes back to make their purchase. Any purchases observed within the first week might not be attributable to either experimental condition. As a way of accounting for this, we’ll run the experiment for about one week longer to allow those users who come in during the third week a chance to come back and be counted in the license purchases tally.

在基础实验时长计算中未考虑的一件事是，用户下载软件的时间与实际购买许可证之间将有一个延迟。也就是说，当我们开始实验时，可能需要大约7天的时间，与Cookie相关联的用户帐户才能真正恢复购买。在第一周内观察到的任何购买都可能与实验条件无关。为了说明这一点，我们将实验进行大约一周的时间，以使在第三周内进入的用户有机会回来并计入许可证购买计数。

As for biases, we don’t expect users to come back to the homepage regularly. Downloading and license purchasing are actions we expect to only occur once per user, so there’s no real ‘return rate’ to worry about. One possibility, however, is that if more people download the software under the new homepage, the expanded user base is qualitatively different from the people who came to the page under the original homepage. This might cause more homepage hits from people looking for the support pages on the site, causing the number of unique cookies under each condition to differ. If we do see something wrong or out of place in the invariant metric (number of cookies), then this might be an area to explore in further investigations.

至于偏见，我们不希望用户定期返回首页。下载和购买许可证是我们希望每个用户仅执行一次的操作，因此无需担心真正的“回报率”。但是，一种可能性是，如果有更多的人在新首页下下载该软件，则扩展的用户基础在质量上将不同于访问原始首页下的页面的人。这可能会导致人们在网站上寻找支持页面的点击量增加，从而导致每种情况下唯一Cookie的数量有所不同。如果我们在固定指标(Cookie的数量)中确实发现了错误或不正确的地方，那么这可能是需要进一步研究的领域。

5.进行测试 (5. Conduct the test)

Once you conduct your experiment and collect your data, you want to determine if the difference between your control group and variant group is statistically significant. There are a few steps in determining this:

完成实验并收集数据后，您要确定对照组和变异组之间的差异是否在统计上显着。确定此步骤有几个步骤：

First, you want to set your alpha, the probability of making a type 1 error. Typically the alpha is set at 5% or 0.05
首先，您要设置alpha ，即发生1型错误的概率。通常将alpha设置为5％或0.05
Second, you want to determine the probability value (p-value) by first calculating the t-statistic using the formula above or using z-score.
其次，您想通过首先使用上述公式或使用z分数计算t统计量来确定概率值(p值)。
Lastly, compare the p-value to the alpha. If the p-value is greater than the alpha, do not reject the null!
最后，将p值与alpha进行比较。如果p值大于alpha，请不要拒绝null！

5.1使用实际统计数据比较结果 (5.1 Use actual statistics to compare the results)

Do not rely on simple 1 on 1 comparison metrics to dictate what works and does not work. “Version A yields a 20 percent conversion rate and Version B yields a 22 percent conversion rate, therefore we should switch to Version B!” Please do not do this. Use actual confidence intervals, z-scores, and statistically significant data.

不要依靠简单的一对一比较指标来确定哪些有效，哪些无效。 “ 版本A产生20％的转换率， 版本B产生22％的转换率，因此我们应该切换到版本B！” 请不要这样做。使用实际的置信区间，z得分和具有统计意义的数据。

5.2产品增长 (5.2 Product Growth)

Changing colours and layout may have a marginal impact on your key performance metrics. However, these results seem to be very short-lived. Product growth does not result from changing a button from red to blue, it comes from building a product that people want to use.

更改颜色和布局可能会对关键绩效指标产生轻微影响。但是，这些结果似乎是短暂的。产品的增长并非来自将按钮从红色更改为蓝色的结果，而是来自构建人们想要使用的产品。

Instead of choosing feature that you think might work, you can use an A/B test to know what works.

您可以使用A / B测试来了解有效的方法，而不是选择您认为可能有效的功能。

5.3分析数据 (5.3 Analyse Data)

For the first evaluation metric, download rate, there was an extremely convincing effect. An absolute increase from 0.1612 to 0.1805 results in a z-score of 7.87 (z-score = 0.1805–0.1612/0.0025) and p-value < .00001, well beyond any standard significance bound. However, the second evaluation metric, license purchasing rate, only shows a small increase from 0.0210 to 0.0213 (following the assumption that only the first 21 days of cookies account for all purchases). This results in a p-value of 0.398 (z = 0.26).

对于第一个评估指标，下载率，具有令人信服的效果。从0.1612到0.1805的绝对增加会导致z得分为7.87(z得分= 0.1805–0.1612 / 0.0025)，p值<.00001，远远超出了任何标准显着性范围。但是，第二个评估指标，即许可证购买率，仅显示从0.0210到0.0213的小幅增长(假设所有购买的数据仅占cookie的前21天)。这导致p值为0.398(z = 0.26)。

6.得出结论 (6. Draw Conclusions)

Despite the fact that statistical significance wasn’t obtained for the number of licenses purchased, the new homepage appeared to have a strong effect on the number of downloads made. Based on our goals, this seems enough to suggest replacing the old homepage with the new homepage. Establishing whether there was a significant increase in the number of license purchases, either through the rate or the increase in the number of homepage visits, will need to wait for further experiments or data collection.

尽管没有获得购买许可证数量的统计意义，但新主页似乎对下载的数量产生了很大影响。根据我们的目标，这似乎足以建议用新主页替换旧主页。要确定购买许可证的数量是否显着增加(无论是通过访问率还是通过首页访问的数量增加)，都需要等待进一步的实验或数据收集。

One inference we might like to make is that the new homepage attracted new users who would not normally try out the program, but that these new users didn’t convert to purchases at the same rate as the existing user base. This is a nice story to tell, but we can’t actually say that with the data as given. In order to make this inference, we would need more detailed information about individual visitors that isn’t available. However, if the software did have the capability of reporting usage statistics, that might be a way of seeing if certain profiles are more likely to purchase a license. This might then open additional ideas for improving revenue.

我们可能要做出的一个推断是，新首页吸引了通常不会试用该程序的新用户，但是这些新用户没有以与现有用户群相同的速度转换为购买商品。这是一个很好的故事，但是我们不能用给定的数据这么说。为了进行推断，我们将需要有关不可用的单个访客的更多详细信息。但是，如果该软件确实具有报告使用情况统计信息的功能，则可能是查看某些配置文件是否更有可能购买许可证的一种方式。然后，这可能会打开其他想法来提高收入。