网页点击率AB测试

最新推荐文章于 2022-11-25 11:40:10 发布

NeverthelessEnd

最新推荐文章于 2022-11-25 11:40:10 发布

阅读量1.4k

点赞数 2

文章标签： python 数据分析

本文链接：https://blog.csdn.net/sinat_41177607/article/details/106563758

版权

	user_id	timestamp	group	landing_page	converted
0	851104	2017-01-21 22:11:48.556739	control	old_page	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0
4	864975	2017-01-21 01:52:26.210827	control	old_page	1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB

c. 查看数据集中的用户数量（不同的user_id数）。

df['user_id'].nunique()

d. 转化用户的占比。

df.query('converted == 1').user_id.nunique()/df['user_id'].nunique()

0.12104245244060237

e. 请计算数据中 new_page 与 treatment 不匹配的次数。控制组(control) 应该对应旧页面(old_page)，实验组(treatment) 对应新页面(new_page)。

df.groupby(['group', 'landing_page']).count()

		user_id	timestamp	converted
group	landing_page
control	new_page	1928	1928	1928
control	old_page	145274	145274	145274
treatment	new_page	145311	145311	145311
treatment	old_page	1965	1965	1965

df[((df.group == 'treatment') == (df.landing_page == 'new_page')) == False].shape[0]

f. 是否有任何行空缺数值？

df.isnull().any()

user_id         False
timestamp       False
group           False
landing_page    False
converted       False
dtype: bool

2. 对于 treatment 和 new_page 不匹配的行或 control 与 old_page 不匹配的行，不能确定该行是否接收到了新页面还是旧页面。如何处理这些行？

a. 删除treatment 和 new_page 不匹配的行或 control 与 old_page 不匹配的行。

#删除treatment 和 new_page 不匹配的行
df1 = df.drop(df[(df.group == 'treatment') & (df.landing_page!= 'new_page')].index)
#删除control 与 old_page 不匹配的行
df2 = df1.drop(df1[(df1.group == 'control') & (df1.landing_page != 'old_page')].index)

df2.groupby(['group', 'landing_page']).count()

		user_id	timestamp	converted
group	landing_page
control	old_page	145274	145274	145274
treatment	new_page	145311	145311	145311

# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

3. 去重

a. df2 中的用户数量（不同的 user_id) ?

df2['user_id'].nunique()

b. df2 中有一个重复的 user_id 。它是什么？

df_dup = df2.user_id.duplicated()
df2[df_dup]

	user_id	timestamp	group	landing_page	converted
2893	773192	2017-01-14 02:55:59.590927	treatment	new_page	0

c. 这个重复 user_id 的行信息是什么？

df2[df2['user_id'] == 773192]

	user_id	timestamp	group	landing_page	converted
1899	773192	2017-01-09 05:37:58.781806	treatment	new_page	0
2893	773192	2017-01-14 02:55:59.590927	treatment	new_page	0

d. 删除一行重复行，但仍然存储 dataframe 为 df2。

df2 = df2.drop(1899, axis = 0)

#检查是否删除
df2[df2['user_id'] == 773192]

	user_id	timestamp	group	landing_page	converted
2893	773192	2017-01-14 02:55:59.590927	treatment	new_page	0

4. 查看数据集中的转化率。

a. 用户成功转化的整体概率是多少？（不分旧页面或者新页面）

toltal_p = df2.converted.mean()
toltal_p

0.11959708724499628

b. control 组用户的转化率是多少？

df_con = df2.query('group == "control"')
con_p = df_con.converted.mean()
con_p

0.1203863045004612

c. treatment 组用户的转化率是多少？

df_tre = df2.query('group == "treatment"')
tre_p = df_tre.converted.mean()
tre_p

0.11880806551510564

obs_diff = tre_p - con_p
obs_diff

-0.0015782389853555567

d. 一个用户收到新页面的概率是多少？

new_page = df2.query('landing_page == "new_page"').user_id.nunique()/df2['user_id'].nunique()
new_page

0.5000619442226688

e. 目前否有足够的数据支持来证明旧页面或者新页面可以带来更高的转化率？

新页面与旧页面的转化率差距太小，因此并没有显著性证据证明哪一种页面可以带来更多转化。

II - A/B 测试

因为每个事件都对应有一个时间记录（time stamp 时间戳），所以技术上可以实现每次观察都连续运行假设检验。

问题的难点在于，什么时候停止试验：是在发现其中一组的试验效果足够好时立即停止？还是在这样的观察结果又持续发生了一段时间再停止？需要运行多长时间才能确认两个页面没有带来用户转化率的显著差异？

1. 根据数据做出决策：你假设旧页面效果更佳，除非在一类错误在5%以内，新页面被证明更好。

零假设： $p_{new}$ - $p_{old}$ <= 0

备择假设： $p_{new}$ - $p_{old}$ > 0

2. 假定在零假设中， $p_{new}$ 与 $p_{old}$ 是相等的。假设它们都等于ab_data.csv 中的 转化率(converted)。

在新旧页面上执行抽样分布，并计算 转化(converted) 差异，每个页面的样本大小要与 ab_data.csv 相同。计算零假设中10000次迭代计算的估计值。

a. 在零假设中， $p_{new}$ 的 convert rate（转化率）是多少？

p_new = df2.converted.mean()
p_new

0.11959708724499628

b. 在零假设中， $p_{old}$ 的 convert rate（转化率）是多少？

p_old = df2.converted.mean()
p_old

0.11959708724499628

c. $n_{new}$ 是多少？

n_new = df2.query('landing_page=="new_page"').shape[0]
n_new

d. $n_{old}$ 是多少？

n_old = df2.query('landing_page=="old_page"').shape[0]
n_old

e. 在零假设中，使用 $p_{new}$ （新页面的转化率）模拟 $n_{new}$ 个新页面的转化，并将这些 $n_{new}$ 个 1 和 0 存储在 new_page_converted 中。

new_page_converted = np.random.choice(2, size = n_new, p=[1-p_new,p_new])
new_page_converted

array([0, 0, 0, ..., 0, 0, 0])

f. 在零假设中，使用 $p_{old}$ （旧页面的转化率）模拟 $n_{old}$ 个旧页面的转化，并将这些 $n_{old}$ 个 1 和 0 存储在 old_page_converted 中。

old_page_converted=np.random.choice(2,size=n_old,p=[1-p_old,p_old])
old_page_converted

array([0, 1, 0, ..., 0, 0, 0])

g. 根据 e 和 f，计算 $p_{new}$ 和 $p_{old}$ 的差异值（ $p_{new}$ - $p_{old}$ ）。

diff = new_page_converted.mean() - old_page_converted.mean()
diff

-0.00028411903292373253

h. 由于单个数值不能形成分布图形，请参考以上a-g的过程，模拟 10,000 个 $p_{new}$ 与 $p_{old}$ 差异值（ $p_{new}$ - $p_{old}$ ），将这 10,000 个值存储在 p_diffs 中。

p_diffs = []

for i in range(10000):
    p_new_diff = np.random.choice(2,size=n_new,p=[1-p_new,p_new]).mean()
    p_old_diff = np.random.choice(2,size=n_old,p=[1-p_old,p_old]).mean()
    p_diffs.append(p_new_diff - p_old_diff)

i. 绘制一个 p_diffs 分布图形。

plt.hist(p_diffs)
plt.axvline(x = obs_diff, color = 'r')

在这里插入图片描述

j. p_diffs列表的数值中，有多少比例的数值会大于 ab_data.csv 中观察到的实际转化率差异 ？

p_diffs = np.array(p_diffs)
(p_diffs > obs_diff).mean()

0.90290000000000004

k. 用文字解释一下你刚才在 j. 中计算出来的结果。在数据研究中，这个值是什么？根据这个数值，请判断新旧页面的转化率是否有显著差异。

p值是零假设的抽样分布下，新旧页面转化率的差值中，能够取到obs_diff的可能性，即零假设为真时，观察到统计量的概率。

p值为91%，远大于一类错误阈值5%，因此不能拒绝零假设，即新页面的转化率小于或等于旧页面。

l. 也可以使用一个内置程序（built-in）来实现类似的结果。使用内置程序可能很容易就能取得结果，但上面的内容仍然很重要，它可以训练你具有正确的数据统计思维。

import statsmodels.api as sm
#旧页面转化人数
convert_old = df2[(df2.landing_page=='old_page') & (df2.converted == 1)].shape[0]
#新页面转化人数
convert_new = df2[(df2.landing_page=='new_page') & (df2.converted == 1)].shape[0]
n_old = df2.query('landing_page=="old_page"').shape[0]#旧页面用户数
n_new = df2.query('landing_page=="new_page"').shape[0]#新页面用户数

convert_old,convert_new,n_old ,n_new

(17489, 17264, 145274, 145310)

m. 用 stats.proportions_ztest 来计算你的 z-score 与 p-value。

z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new, n_old], alternative='larger')
z_score, p_value

(-1.3109241984234394, 0.90505831275902449)

n. p值为91%，大于5%，所以不能拒绝零假设，即新页面的转化率小于或等于旧页面。这个结果与j、k问题中的结果一致。

III - 回归分析法之一

1. A / B测试中获得的结果也可以通过执行回归来获取。

a. 既然每行的值是转化或不转化，还可用逻辑回归。

b. 目标是使用 statsmodels 来拟合你a. 中指定的回归模型，以查看用户收到的不同页面是否存在显著的转化差异。
首先为这个截距创建一个列（原文：column），并为每个用户收到的页面创建一个虚拟变量列。添加一个截距列，一个 ab_page 列，当用户接收 treatment 时为1， control 时为0。

#添加一个 截距 列，一个 ab_page 列
df2['intercept'] = 1
df2[['control', 'ab_page']] = pd.get_dummies(df['group'])
df2 = df2.drop('control', axis = 1)
df2.head()

	user_id	timestamp	group	landing_page	converted	intercept	ab_page
0	851104	2017-01-21 22:11:48.556739	control	old_page	0	1	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0	1	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1
4	864975	2017-01-21 01:52:26.210827	control	old_page	1	1	0

c. 使用 statsmodels 导入回归模型。实例化该模型，并使用你在 b. 中创建的2个列来拟合该模型，用来预测一个用户是否会发生转化。

logit_mod = sm.Logit(df2['converted'], df2[['intercept', 'ab_page']])
result = logit_mod.fit()

Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6

d. 请在下方提供你的模型摘要，并根据需要使用它来回答下面的问题。

from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
result.summary()

Logit Regression Results
Dep. Variable:	converted	No. Observations:	290584
Model:	Logit	Df Residuals:	290582
Method:	MLE	Df Model:	1
Date:	Fri, 15 May 2020	Pseudo R-squ.:	8.077e-06
Time:	14:08:58	Log-Likelihood:	-1.0639e+05
converged:	True	LL-Null:	-1.0639e+05
		LLR p-value:	0.1899

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-1.9888	0.008	-246.669	0.000	-2.005	-1.973
ab_page	-0.0150	0.011	-1.311	0.190	-0.037	0.007

1/np.exp(-0.015)

1.0151130646157189

旧页面的点击量是新页面的点击量1.0151倍。

e. 与 ab_page 关联的 p-值是多少？为什么它与你在 II 中发现的结果不同？

提示: 与你的回归模型相关的零假设与备择假设分别是什么？它们如何与 Part II 中的零假设和备择假设做比较？

与 ab_page关联的 p-值是0.19。p值较大，表明ab_page这个因素不适合用来预测转化率。

回归模型零假设： $p_{new}$ - $p_{old}$ = 0
回归模型备择假设： $p_{new}$ - $p_{old}$ != 0

前后两个p值的含义不同。该回归模型的p值表示ab_page因素与转化率是否有相关性,同时也表示 $p_{new}$ = $p_{old}$ 的概率。
假设检验的p值表示 $p_{new}$ - $p_{old}$ <= 0时，观察到统计量的概率。

f. 将其他更多因素添加到回归模型中有什么好处和弊端？

在实际应用中，影响反应变量的因素可能是多个，添加更多因素可以更好地分析和预测反应变量。
添加多个变量的弊端：1.反应变量与预测变量间未必有线性关系；2.可能有相关性误差；3.可能产生非恒定方差；4.可能破坏模型的离群值或杠杆点；5.可能产生多重共线性。这些潜在因素都会影响预测的准确性

g. 现在，除了测试不同页面的转化率是否会发生变化之外，还要根据用户居住的国家或地区添加一个 effect 项。导入 countries.csv 数据集，并将数据集合并在适当的行上。

为这些国家的列创建虚拟变量—— 为这三个虚拟变量增加两列。

country_df = pd.read_csv('countries.csv') 
country_df.head()

	user_id	country
0	834778	UK
1	928468	US
2	822059	UK
3	711597	UK
4	710616	UK

#连接df2与country_df
df3 = pd.merge(df2, country_df, on = 'user_id')
df3.head()

	user_id	timestamp	group	landing_page	converted	intercept	ab_page	country
0	851104	2017-01-21 22:11:48.556739	control	old_page	0	1	0	US
1	804228	2017-01-12 08:01:45.159739	control	old_page	0	1	0	US
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1	US
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1	US
4	864975	2017-01-21 01:52:26.210827	control	old_page	1	1	0	US

#创建虚拟变量
df3[['CA', 'UK', 'US']] = pd.get_dummies(df3['country'])
df3 = df3.drop('US', axis = 1)       
df3.head(10)

	user_id	timestamp	group	landing_page	converted	intercept	ab_page	country	CA	UK
0	851104	2017-01-21 22:11:48.556739	control	old_page	0	1	0	US	0	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0	1	0	US	0	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1	US	0	0
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1	US	0	0
4	864975	2017-01-21 01:52:26.210827	control	old_page	1	1	0	US	0	0
5	936923	2017-01-10 15:20:49.083499	control	old_page	0	1	0	US	0	0
6	679687	2017-01-19 03:26:46.940749	treatment	new_page	1	1	1	CA	1	0
7	719014	2017-01-17 01:48:29.539573	control	old_page	0	1	0	US	0	0
8	817355	2017-01-04 17:58:08.979471	treatment	new_page	1	1	1	UK	0	1
9	839785	2017-01-15 18:11:06.610965	treatment	new_page	1	1	1	CA	1	0

#拟合模型
logit_mod1=sm.Logit(df3.converted,df3[['intercept','ab_page','CA','UK']])
result1=logit_mod1.fit()
result1.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6

Logit Regression Results
Dep. Variable:	converted	No. Observations:	290584
Model:	Logit	Df Residuals:	290580
Method:	MLE	Df Model:	3
Date:	Fri, 15 May 2020	Pseudo R-squ.:	2.323e-05
Time:	14:55:38	Log-Likelihood:	-1.0639e+05
converged:	True	LL-Null:	-1.0639e+05
		LLR p-value:	0.1760

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-1.9893	0.009	-223.763	0.000	-2.007	-1.972
ab_page	-0.0149	0.011	-1.307	0.191	-0.037	0.007
CA	-0.0408	0.027	-1.516	0.130	-0.093	0.012
UK	0.0099	0.013	0.743	0.457	-0.016	0.036

1/np.exp(-0.0408),np.exp(0.0099)

(1.0416437559600236, 1.0099491671175422)

从p值判断，国家项不具有统计显著性，对转化率影响不大。

h. 虽然已经查看了国家与页面在转化率上的个体性因素，但还要查看页面与国家/地区之间的相互作用，测试其是否会对转化产生重大影响。创建必要的附加列，并拟合一个新的模型。

提示：页面与国家/地区的相互作用

df3['new_CA'] = df3['new_page'] * df3['CA']
df3['new_UK'] = df3['new_page'] * df3['UK']

#添加交叉项
df3['new_page'] = df3['ab_page']
df3['new_CA'] = df3['new_page'] * df3['CA']
df3['new_UK'] = df3['new_page'] * df3['UK']
df3.head()

	user_id	timestamp	group	landing_page	converted	intercept	ab_page	country	new_page
0	851104	2017-01-21 22:11:48.556739	control	old_page	0	1	0	US	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0	1	0	US	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1	US	1
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1	US	1
4	864975	2017-01-21 01:52:26.210827	control	old_page	1	1	0	US	0

#拟合模型
logit_mod2=sm.Logit(df3.converted,df3[['intercept','ab_page','CA','UK', 'new_CA', 'new_UK']])
result2=logit_mod2.fit()
result2.summary()

Optimization terminated successfully.
         Current function value: 0.366109
         Iterations 6

Logit Regression Results
Dep. Variable:	converted	No. Observations:	290584
Model:	Logit	Df Residuals:	290578
Method:	MLE	Df Model:	5
Date:	Fri, 15 May 2020	Pseudo R-squ.:	3.482e-05
Time:	15:05:30	Log-Likelihood:	-1.0639e+05
converged:	True	LL-Null:	-1.0639e+05
		LLR p-value:	0.1920

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-1.9865	0.010	-206.344	0.000	-2.005	-1.968
ab_page	-0.0206	0.014	-1.505	0.132	-0.047	0.006
CA	-0.0175	0.038	-0.465	0.642	-0.091	0.056
UK	-0.0057	0.019	-0.306	0.760	-0.043	0.031
new_CA	-0.0469	0.054	-0.872	0.383	-0.152	0.059
new_UK	0.0314	0.027	1.181	0.238	-0.021	0.084

1/np.exp(-0.0057),np.exp(0.0314)

(1.0057162759095335, 1.0318981806179213)

通过比较包括 US 地区在内的p值，远远大于0.05，所以这种影响均没有显著性，即页面与国家/地区之间的相互作用，不会对转化产生重大影响

总结

通过A/B测试，新页面的转化率并不高于旧页面。

使用逻辑回归分析，新页面的转化率同样没有显著性变化。

进一步分析国家项对转化的影响，国家项不具有统计显著性，对转化率影响不大。

根据“页面”和“国家/地区”的相互作用建立回归模型，页面与国家/地区之间的相互作用，不会对转化产生重大影响。

因此综合来看，此“新页面”对转化率影响不大，甚至低于旧页面的转化率，需继续对新页面进行优化改进。

NeverthelessEnd

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
网页点击率AB测试

目录简介I - 概率II - A/B 测试III - 回归简介本项目是一家电子商务网站运行 A/B 测试。分析和决定是否应该使用新的页面，保留旧的页面，或者应该将测试时间延长，之后再做出决定。I - 概率先导入数据。import pandas as pdimport numpy as npimport randomimport matplotlib.pyplot as plt%matplotlib inline#We are setting the seed to ass
复制链接

扫一扫