A recent report by Martin Goodson of Qubit (a conversion optimization startup) asserted that Most Winning A/B Test Results Are Illusory, due, for the most part, to such tests being badly executed. According to the author, this is something that can lead not only to the ‘needless modification of websites’, but also in some cases can do damage to a company’s profitability.

Qubit (转换优化初创公司)的Martin Goodson最近的一份报告断言, 大多数获胜的A / B测试结果都是虚假的 ,这在很大程度上由于此类测试执行不当所致。 作者认为,这不仅会导致“不必要的网站修改”,而且在某些情况下会损害公司的盈利能力。

So why is this the case? And how can designers and businesses ensure that if A/B testing (also known as split testing) is carried out, it’s done properly and effectively?

那么为什么会这样呢? 设计师和企业如何确保如果执行A / B测试 (也称为拆分测试),则测试是否正确有效?

什么是A / B测试? (What is A/B Testing?)

To start with a quick primer, A/B testing is a way to compare two versions of a web page (it might be, for example, a landing page) to see which of the two versions performs the best. To carry out a test, two groups of people will each see a different page and results are measured by how the groups interact with each page.

首先快速入门,A / B测试是比较网页的两个版本(例如,可能是登录页面)以查看两个版本中哪个版本效果最好的一种方法。 为了进行测试,两组人将各自看到不同的页面,并通过这组人与每个页面的交互方式来衡量结果。

For example, a page that contains a strong call-to-action (CTA) in a certain area of the page may be pitted against another that is similar, but has the CTA in a different place and might use different wording or color.


Other aspects of pages that are commonly used in A/B testing include:

A / B测试中常用的页面的其他方面包括:

  • Headlines and product descriptions

  • Forms

  • Page layouts

  • Special offers

  • Images

  • Text (long form, short form)

  • Buttons


However, according to Goodson, it’s often the case that the tests carried out on these pages return results that are false and the expected ‘uplift’ (increase in conversions) is never realized.


A / B测试的数学 (The Math of A/B Testing)

Simply put, you could say that A/B testing can be as simple as conversions versus non-conversions. From these possibilities, it’s then a case of calculating the number of visits and what percentage of these converted. In his report, Goodson points to two methods of testing: statistical power and multiple testing.

简而言之,您可以说A / B测试可以像转换对非转换一样简单。 通过这些可能性,就可以计算出访问次数以及这些访问所转化的百分比。 古德森在报告中指出了两种测试方法: 统计功效多重测试

On the former, Goodson explains:


Statistical power is simply the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. It is normally expressed as a percentage.

统计功效只是当真正存在潜在差异时,统计测试将检测到两个值之间的差异的概率。 通常以百分比表示。

However, this can be affected by the size of the sample — if there are not that many people taking part, then the likelihood is that you won’t get realistic results. Further to this, gaining uplift and true results is something that also depends on how long the test runs.

但是,这可能会受到样本量的影响-如果参加的人数不多,则很可能无法获得真实的结果。 除此之外,获得提升和真实结果还取决于测试运行多长时间。

In order to know how to calculate running time, it’s necessary to properly calculate sample sizes before implementing and running the test. Getting this wrong will return false results and it’s unlikely that any uplift in sales will be seen, even if the test has indicated that they will.

为了知道如何计算运行时间,有必要在实施和运行测试之前正确计算样本量。 弄错这一点将返回错误的结果,即使测试已表明会实现销售增长,也不太可能看到。

Multiple testing often relies on software and uses classical p-values for testing statistical significance. So both models rely on statistics, but often multiple testing is carried out using software such as Optimizely.

多次测试通常依赖于软件,并使用经典的p值来测试统计显着性。 因此,这两个模型都依赖于统计信息,但是通常使用Optimizely之类的软件执行多次测试。

使用P值的危险 (Dangers of Using P-Values)

The use of p-values can and does often produce results that are false and this is due to two well-known factors:


  • Carrying out many tests

  • Stopping a test when positive results are seen


Bearing this in mind, it’s as well to really do your research when it comes to the software you use to check how variables are tested and how intuitive the software is.


Optimizely, for example, recommends that you set up variations before running tests using its software. If you don’t, you’re essentially just running an A/A test as the result is equal to the original page. The company also points out that without the variables being properly set up, you’re likely to get winning results that are false.

例如,进行优化,建议您在使用其软件运行测试之前设置版本。 如果不这样做,则实际上只是在进行A / A测试,因为结果等于原始页面。 该公司还指出,如果没有正确设置变量,您可能会获得错误的获胜结果。

A / B测试中的常见错误 (Common Mistakes in A/B Testing)

Firstly, in order to carry out A/B testing that returns true results, it’s necessary to have a large enough sample size. According to Goodson, the statistical power increases with larger samples and while you may get the odd random variable with a large sample, this is unavoidable and won’t necessarily return false results.

首先,为了执行返回真实结果的A / B测试,必须有足够大的样本量 。 根据古德森(Goodson)的说法,样本量越大,统计功效就越大,尽管样本量大时您可能会得到奇数随机变量,但这是不可避免的,不一定会返回错误的结果。

However, not all sites have a large amount of traffic, so to some extent the sample will be somewhat beyond your control. Be aware of this because if you have few visitors, you’re more likely to get random variables and it could be a waste of your time.

但是,并非所有站点都有大量流量,因此在某种程度上该样本将超出您的控制范围。 请注意这一点,因为如果您的访问者很少,则您很可能会获得随机变量,这可能会浪费您的时间。

The second important aspect to successful A/B testing is the length of time that the test runs for. Again, if you cut short the test, you’re essentially reducing the statistical power of it and you’re likely to receive false positives which, while appearing to generate uplift, actually result in no change when it comes to the bottom line: revenue.

成功进行A / B测试的第二个重要方面是测试进行的时间长度 。 同样,如果您缩短了测试的时间,则实际上是在降低测试的统计能力,并且很可能会收到误报,虽然这些举报看起来会产生提升,但实际上却不会带来任何改变: 。

If you cut short a test when you think you’re seeing winning results, Goodson says:


“Almost two-thirds of winning tests will be completely bogus.”


In other words it’s vital that you let the test run long enough — even if you’re seeing a good amount of conversions — in order to build statistical power and gain real results. So you could start out with a good solid testing model, with a good sample size, yet let impatience lead you to false results.

换句话说,至关重要的是,让测试运行足够长的时间(即使您看到大量的转化),以建立统计能力并获得真实的结果。 因此,您可以从一个良好的可靠测试模型入手,并拥有一个良好的样本量,而让不耐烦会使您得出错误的结果。

运行同时测试 (Running Simultaneous Tests)

Another recent and damaging trend has been to perform a lot of tests all at the same time. This is a bad idea because if you perform 20 tests, then on average you’ll see one winning result, if it’s 40 then you’ll see two, as each test has a 5% chance of winning. Goodson says:

另一个近期且具有破坏性的趋势是同时进行大量测试。 这是个坏主意,因为如果执行20个测试,则平均会看到一个获胜结果;如果是40,则将看到两个获胜结果,因为每个测试都有5%的获胜机会。 古德森说:

“Rather than a scattergun approach, it’s best to perform a small number of focused and well-grounded tests, all of which will have adequate statistical power.”


According to AppSumo, in their own testing for their product:

根据AppSumo的说法 ,在他们自己的产品测试中:

“Only 1 out of 8 A/B tests have driven significant change.”

“在8个A / B测试中,只有1个推动了重大变化。”

AppSumo has around 5,000 visitors per day and while they maintain that they have seen some excellent results, such as email conversions increasing by more than five times and doubling purchase rates, the site has also seen some “harsh realities” when it comes to the testing process.


Even those that they were sure would work simply failed for various reasons, which included:


  • People not reading the text

  • Using a % as an incentive rather than $

  • Pop-up/light boxes irritating the visitor


In order to actually get away with the above, it’s necessary to have a very strong brand that people trust, and that’s still the Holy Grail for many of us.


AppSumo Failed Test

In the example above, the folks at AppSumo believe that the test failed due to the need to enter an email address — something that is a precious commodity to all sites. However, it’s equally precious to the “sophisticated” user and they don’t part with their email address lightly, which is why it’s a better idea to offer a cash incentive, rather than a percentage.

在上面的示例中,AppSumo的人员认为,由于需要输入电子邮件地址,该测试失败了,这对于所有站点而言都是宝贵的商品。 但是,这对于“老练”的用户来说同样宝贵,他们不会轻易放弃电子邮件地址,这就是为什么提供现金奖励而不是一定比例的想法是更好的主意。

运行成功的A / B测试 (Running Successful A/B Testing)

Before implementing a test, plan it well and determine how the test is likely to improve conversions. You should of course set up goals in Google Analytics for measurable results and use appropriate software. You can carry out testing manually, using your own calculations or use a template such as the one prepared by Visual Website Optimizer. Alternatively (or additionally), there’s a free online A/B test significance tool that you can use, which is created by the same people.

实施试验前,计划得很好,并且确定测试会如何提高转化率。 当然,您应该在Google Analytics(分析)中设置目标以获得可衡量的结果,并使用适当的软件。 您可以使用自己的计算来手动进行测试,也可以使用模板(例如Visual Website Optimizer准备模板)进行测试。 另外(或另外),您可以使用免费的在线A / B测试重要性工具 ,该工具由同一个人创建。

Note: You can also use Google Analytics Content Experiments to perform testing.

注意:您还可以使用Google Analytics(分析)内容实验来执行测试。

Google Analytics Content Experiments

You should also realize that there’s no rush and you will have to exercise patience if you want to gain winning results — it’s likely that the tests will take weeks or even months to complete.

您还应该意识到,不要着急, 如果要获得胜利的结果,就必须保持耐心 –测试可能需要数周甚至数月才能完成。

Additionally, you should:


  • Test only one page at a time, or even one element on a page.

  • Select pages that have a high bounce/exit rate.

  • Expect a bare minimum of 1000 visitors before seeing any results.

  • Be prepared for failure; very few tests are successful the first time around.

    为失败做好准备; 首次进行的测试很少成功。
  • Understand that A/B testing has a learning curve.

    了解A / B测试具有学习曲线。
  • Be patient.

  • Understand your customer.


Says the Miva Merchant blog:


“Going into it knowing that 7 out of 8 of your tests will produce insignificant improvements will likely prevent you from un-real expectations. Stick with it, and don’t give up after multiple insignificant results.”

“要知道,八分之七的测试将带来微不足道的改进,可能会阻止您产生虚幻的期望。 坚持下去,不要在多次微不足道的结果后放弃。”

If your testing is going to be a success, it’s important to know your audience, too. Creating a buyer persona is something that should always be carried out before the design and development phase, but many businesses fail to understand the importance of this. If you don’t know who you’re addressing, then how can you possibly even attempt to give them what they want?

如果您的测试取得成功,那么了解您的听众也很重要。 创建买方角色的过程始终应在设计和开发阶段之前进行,但是许多企业无法理解这一点的重要性。 如果您不知道要向谁讲话,那么您怎么可能甚至尝试给他们他们想要的东西?

It’s all in the planning, as it is for pretty much every aspect of business. In order to gain conversions, it’s always necessary to do your research, no exceptions.

这一切都在计划中,因为它几乎涵盖了业务的各个方面。 为了获得转化,总是有必要进行研究 ,没有例外。

最后的想法 (Final Thoughts)

Martin Goodson recommends that, should you perform A/B tests and not see any real uplift, or uplift isn’t maintained, then it’s always worth carrying out the test again to check to see whether it was carried out effectively in the first instance. He also points out that estimated uplift from testing is often overestimated (the ‘winner’s curse’) and this is especially true for those with a small sample size.

马丁·古德森(Martin Goodson)建议,如果您执行A / B测试并没有看到任何真正的提升,或者没有保持提升,那么始终值得再次进行测试,以检查它是否在第一时间有效地执行了。 他还指出,从测试中估计的提升通常被高估了(“赢家的诅咒”),对于样本量较小的人尤其如此。

If you do have a small sample, then ask yourself if it’s worth carrying out testing at this stage, as the results you’ll gain may not be very accurate at all. If this is the case, then you run the risk of making changes that will alienate future visitors.

如果您的样本确实很少,那么请问一下自己是否值得在此阶段进行测试,因为您获得的结果可能根本不是很准确。 如果是这种情况,那么您就有可能进行更改,从而疏远未来的访问者。

A/B testing does have value, but if not carried out correctly, it will return false results. Even if it is done right, there’s no guarantee that it will be successful — so be prepared for this before you begin. Do your research, understand your goals and set up the test properly, while all the time exercising patience and you will get there.

A / B测试确实有价值,但是如果执行不正确,它将返回错误的结果 。 即使正确完成,也无法保证一定会成功-因此在开始之前请为此做好准备。 做研究,了解目标并正确设置测试,同时始终保持耐心,您将到达目的地。

翻译自: https://www.sitepoint.com/winning-ab-test-results-misleading/






