一文介绍 statisitc behind a/b test, anova test handson code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')
from math import ceil
sns.set_theme(style='dark')
sns.set(rc={'figure.figsize':(20,15)})
data_ab = pd.read_csv('data/ab_test.csv')
data_ab.head()
id | time | con_treat | page | converted | |
---|---|---|---|---|---|
0 | 851104 | 11:48.6 | control | old_page | 0 |
1 | 804228 | 01:45.2 | control | old_page | 0 |
2 | 661590 | 55:06.2 | treatment | new_page | 0 |
3 | 853541 | 28:03.1 | treatment | new_page | 0 |
4 | 864975 | 52:26.2 | control | old_page | 1 |
EDA
check whether user in treatment group only receive new pages , check whether duplicated records exists
data_ab.columns = ["user_id", "timestamp", "group", "landing_page", "converted"]
# num of rows and unique user
print('row number:{0}'.format(data_ab.shape[0]))
print('unique user:{0}'.format(data_ab['user_id'].nunique()))
row number:294478
unique user:290584
data_ab.head()
user_id | timestamp | group | landing_page | converted | |
---|---|---|---|---|---|
0 | 851104 | 11:48.6 | control | old_page | 0 |
1 | 804228 | 01:45.2 | control | old_page | 0 |
2 | 661590 | 55:06.2 | treatment | new_page | 0 |
3 | 853541 | 28:03.1 | treatment | new_page | 0 |
4 | 864975 | 52:26.2 | control | old_page | 1 |
# check whether mismatch exists
data_ab.groupby(by=['group'],as_index=False).agg({'landing_page':pd.Series.nunique})
group | landing_page | |
---|---|---|
0 | control | 2 |
1 | treatment | 2 |
pd.crosstab(data_ab['group'],data_ab['landing_page'],margins='True')
landing_page | new_page | old_page | All |
---|---|---|---|
group | |||
control | 1928 | 145274 | 147202 |
treatment | 145311 | 1965 | 147276 |
All | 147239 | 147239 | 294478 |
# check for mismatch
n_treat = data_ab[data_ab['group']=='treatment'].shape[0]
n_newpage = data_ab[data_ab['landing_page']=='new_page'].shape[0]
diff = n_treat - n_newpage
pd.DataFrame({'n treatment':[n_treat],
'n newPage':[n_newpage],
'diff':diff})
n treatment | n newPage | diff | |
---|---|---|---|
0 | 147276 | 147239 | 37 |
page_diff = pd.crosstab(index=data_ab['group'],columns=data_ab['landing_page'])
page_diff.plot.bar()
# clean up the dataset, so that group=treatment -> page= new
df = data_ab[(data_ab['group']=='treatment')&(data_ab['landing_page']=='new_page')|(data_ab['group']=='control')&(data_ab['landing_page']=='old_page')]
df.shape[0]
290585
df[df.duplicated(subset=['user_id'])==True]
user_id | timestamp | group | landing_page | converted | |
---|---|---|---|---|---|
2893 | 773192 | 55:59.6 | treatment | new_page | 0 |
df = df.drop_duplicates('user_id',keep='first')
AB TEST
- post campaign review
- calculate minimum sample size before campaign
df['user_id'] = df['user_id'].astype(str)
control_cvt = df[df['group']=='control']['converted'].mean()
print('control_cvt:%.4f'%control_cvt)
control_cvt:0.1204
treat_cvt= df[df['group']=='treatment']['converted'].mean()
print('control_cvt:%.4f'%treat_cvt)
control_cvt:0.1188
n_treat = df[df['group']=='treatment']['user_id'].count()
n_control = df[df['group']=='control']['user_id'].count()
n_treat/n_control
1.0002478075911725
n_control
145274
test on minimum sample size
e
f
f
e
c
t
s
i
z
e
=
δ
s
t
d
effect_size = \frac {\delta}{std}
effectsize=stdδ
做完一个hypothesis test后,如果P<5%,还需要计算效应量,如果P>5%,需要计算功效
效应量:样本间差异或相关程度的量化指标,P值判定是否有统计学意义,效应量判断差异性有多大,反应实际上的意义,有时即使有显著统计学意义,效应量却很小。ES<0.2,则小效应,0.2-0.5中效应,>0.5大效应
effect_size = sms.proportion_effectsize(control_cvt,treat_cvt)
print('effect_size:%.4f'%effect_size)
# proportion_effectsize effect size的公式是cohen h
effect_size:0.0049
effect_size2 = power_analysis.solve_power(effect_size = None,power=0.8,alpha=0.05,nobs1 =n_treat)
# 给定sample size,power,显著性水平计算effect size
effect_size和effect_size2计算结果含义不同,effect_size是指给定baseline和improvement后计算出的效应量,此时样本量没有固定,为infinite。effect_size2是指给定样本量后最小能够计算出的效应量。
required_n = sms.NormalIndPower().solve_power(effect_size,power=0.8,alpha=0.05,ratio=n_treat/n_control)
required_n = ceil(required_n)
print('required_n:%d'%required_n)
required_n:663492
import statsmodels.stats.power as smp
power_analysis = smp.TTestIndPower()
n_required = power_analysis.solve_power(effect_size = effect_size,power=0.8,alpha=0.05,ratio=n_treat/n_control)
print('n_required:%.4f'%n_required)
n_required:663492.3208
因此,control_cvt为0.1188,treat_cvt为0.1204时,至少需要样本量663492,但是实验样本量只有145274,检测不出
test on whether experiment achieve the goal: difference is significant
通过ztest和ttest坐下hypothesis test
convert_old = df[(df["converted"] == 1) & (df["landing_page"] == "old_page")]['user_id'].nunique()
convert_new = df[(df["converted"] == 1) & (df["landing_page"] == "new_page")]['user_id'].nunique()
n_old = df[df["landing_page"] == "old_page"]['user_id'].nunique()
n_new = df[df["landing_page"] == "new_page"]['user_id'].nunique()
n_old
145274
convert_old
17489
方法1:通过ztest做hypothesis test
z_score,p_value = sm.stats.proportions_ztest(np.array([convert_new,convert_old]),np.array([n_new,n_old]), alternative = 'larger')
p_value
0.9050583127590245
方法2:通过ttest做hypothesis test
当sample size>30时,用ztest也可以
from scipy import stats as st
new_gr = df.loc[df['landing_page']=="new_page",'converted'].to_numpy()
old_gr = df.loc[df['landing_page']=="old_page",'converted'].to_numpy()
st.ttest_ind(new_gr,old_gr)
Ttest_indResult(statistic=-1.3109235634981506, pvalue=0.18988462498742617)
方法3:手算 ttest
var_old = control_cvt*(1-control_cvt)
var_new = treat_cvt*(1-treat_cvt)
p_delta = -control_cvt + treat_cvt
print(p_delta)
-0.0015782389853555567
pooled_se = np.sqrt(var_new/n_new + var_old/n_old)
t_sts = p_delta/pooled_se
# degree of freedom
dof = (var_new/n_new + var_old/n_old)**2/((var_new/n_new)**2/(n_new-1) + (var_old/n_old)**2/(n_old-1))
dof
290571.7142957336
pvalue = 2*st.t.cdf(-abs(t_sts),dof)
print(pvalue)
0.18988341360095998
通过假设检验,也可以看出,原假设即control组合treat组没区别不能被拒绝
计算置信区间
(low_treat,up_treat) = st.norm.interval(0.95,loc=treat_cvt,scale=np.sqrt(var_new/n_new))
(low_control,up_control) = st.norm.interval(0.95,loc=control_cvt,scale=np.sqrt(var_old/n_old))
(lower_con, lower_treat), (upper_con, upper_treat) = sm.stats.proportion_confint(np.array([convert_new,convert_old]),np.array([n_new,n_old]),alpha=0.05)
scip和statmodels库两个方法计算置信区间,control组和treat组区间有重合,也说明原假设不能被拒绝
ANOVA TEST
测试下时间段是否对转换有影响,将时间段分为0-20,20-40,40-60
df.head()
user_id | timestamp | group | landing_page | converted | hour | time_range | |
---|---|---|---|---|---|---|---|
0 | 851104 | 11:48.6 | control | old_page | 0 | 11 | 0-20 |
1 | 804228 | 01:45.2 | control | old_page | 0 | 1 | 0-20 |
2 | 661590 | 55:06.2 | treatment | new_page | 0 | 55 | 40-60 |
3 | 853541 | 28:03.1 | treatment | new_page | 0 | 28 | 20-40 |
4 | 864975 | 52:26.2 | control | old_page | 1 | 52 | 40-60 |
df['hour'] = [int(i.split(':')[0]) for i in df['timestamp']]
df['time_range'] = pd.cut(df['hour'],bins=[0,20,40,60],include_lowest=True,labels=['0-20','20-40','40-60'])
方法1
fvalue,pvl = stats.f_oneway(df[df['time_range']=='0-20']['converted'],df[df['time_range']=='20-40']['converted'],df[df['time_range']=='40-60']['converted'])
print('fvalue:{0} pvalue:{1}'.format(fvalue,pvl))
fvalue:0.6913612944665806 pvalue:0.5008945648250425
from statsmodels.formula.api import ols
df_melt = pd.melt(df,id_vars=['time_range'],value_vars=['converted'])
df_melt.head()
time_range | variable | value | |
---|---|---|---|
0 | 0-20 | converted | 0 |
1 | 0-20 | converted | 0 |
2 | 40-60 | converted | 0 |
3 | 20-40 | converted | 0 |
4 | 40-60 | converted | 1 |
方法2
df.head()
user_id | timestamp | group | landing_page | converted | hour | time_range | |
---|---|---|---|---|---|---|---|
0 | 851104 | 11:48.6 | control | old_page | 0 | 11 | 0-20 |
1 | 804228 | 01:45.2 | control | old_page | 0 | 1 | 0-20 |
2 | 661590 | 55:06.2 | treatment | new_page | 0 | 55 | 40-60 |
3 | 853541 | 28:03.1 | treatment | new_page | 0 | 28 | 20-40 |
4 | 864975 | 52:26.2 | control | old_page | 1 | 52 | 40-60 |
model = ols('converted~C(time_range)',data=df[['time_range','converted']]).fit()
anova_table = sm.stats.anova_lm(model,typ=2)
anova_table
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(time_range) | 0.145593 | 2.0 | 0.691361 | 0.500895 |
Residual | 30596.496834 | 290581.0 | NaN | NaN |
Pvalue>0.05,因此判断按照目前的时间段划分,时间段对转换无影响
demo two - way anova test
model_twoway = ols('converted~C(time_range) + C(landing_page) + C(time_range):C(landing_page)',data=df[['time_range','converted','landing_page']]).fit()
anova_table_twoway = sm.stats.anova_lm(model_twoway,typ=2)
anova_table_twoway
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(time_range) | 0.145309 | 2.0 | 0.690017 | 0.501568 |
C(landing_page) | 0.180666 | 1.0 | 1.715825 | 0.190232 |
C(time_range):C(landing_page) | 0.184478 | 2.0 | 0.876015 | 0.416440 |
Residual | 30596.131690 | 290578.0 | NaN | NaN |
结果表明,时间划分,landingpage及时间和landingpage的组合都对转换无影响