AB Testing Review

Alex Tech Bolg

于 2021-11-04 10:00:01 发布

阅读量372

点赞数

分类专栏： ABtest 文章标签： AB-Testing Statistics

本文链接：https://blog.csdn.net/qq_41103204/article/details/121134494

版权

ABtest 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Cracking A/B Testing Problems in DS interview
How to Estimate Sample Size in A/B Tests
A Summary of Udacity A/B Testing Course

Cracking A/B Testing Problems in DS interview

https://towardsdatascience.com/7-a-b-testing-questions-and-answers-in-data-science-interviews-eee6428a8b63

What is A/B Testing

请添加图片描述

power = 1 - (Type 2 error)

Designing an A/B Testing

请添加图片描述

More samples if sample variance is larger.
Less samples if difference between treatment and control is larger.

请添加图片描述

This value is decided by multiple stakeholders.

请添加图片描述

Obtain the number of days to run the experiment by dividing the sample size by the number of users in each group.
If the number less than 14 days, we typically would run for 14 days to capture the weekly pattern.
https://www.invespcro.com/blog/how-long-should-you-run-an-ab-test-for/

Multiple Testing Problem

请添加图片描述

Correction: At least 10 false positive.
This only makes sense if you have a huge number of metrics
Suppose we have 200 metrics and cap FDR at 0.05. This means we’re okay with seeing false positives 5 of the time. We will observe at least 10 false positive in those 200 metrics every time.

Novelty and Primacy effect(Change Aversion)

请添加图片描述

If we already have a test running and we want to analyze if there is novelty effect we could compare first-time users vs. old user’s result in the treatment group to get an actual estimate of the impact of novelty effect. Same for primacy effect.

Interference between variants

Interference between control and treatment groups can also lead to unreliable results.

请添加图片描述

Typically we split control and treatment groups by randomly select users, and in the ideal scenario each users is independent and we expect no interference between control and treatment groups.

请添加图片描述

Dealing with interference

请添加图片描述

surge price: 动态定价
Long time: A referral program. It can take some time for users to refer his or her friend.

请添加图片描述

https://arxiv.org/pdf/1903.08755.pdf

How to Estimate Sample Size in A/B Tests

https://www.youtube.com/watch?v=JEAsoUrX6KQ

请添加图片描述

Type I error is a false positive conclusion.
Type II error is a false negative conclusion.

请添加图片描述

These two situation get same results.

请添加图片描述

A Summary of Udacity A/B Testing Course

https://towardsdatascience.com/a-summary-of-udacity-a-b-testing-course-9ecc32dedbb1

Can we test everything?

change aversion, novelty effect
(1) what is the base of your comparison?
(2) how much time you need in order for your users to adapt to the new experience, so that you can actually say what is the plateaued experience and make a robust decision?

how to do an A/B test?

Choose and characterize metrics to evaluate your experiments, i.e. what do you care about, how do you want to measure the effect
Choose significance level (alpha), statistical power (1-beta) and practical significance level you really want to launch the change if the test is statistically significant
Calculate required sample size
Take sample for control/treatment groups and run the test
Analyze the results and draw valid conclusions

Step 1: Choose and characterize metrics for both sanity check and evaluation

sensitivity and robustness

Step 2: Choose significance level, statistical power and practical significance level

You may not want to launch a change even if the test is statistically significant because you need to consider the business impact of the change, whether it is worthwhile to launch considering the engineering cost, customer support or sales issue, and opportunity costs.

Step 3: Calculate required sample size

Step 4: Take sample for control/treatment groups and run the test