语境 (Context)

I have been working in Advertising, specifically Digital Media and Performance, for nearly 3 years and customer behaviour analysis is one of the core concentrations in my day-to-day job. With the help of different analytics platforms (e.g. Google Analytics, Adobe Analytics), my life has been made easier than before since these platforms come with the built-in function of segmentation that analyses user behaviours across dimensions and metrics.

我从事广告业,特别是数字媒体和表演业已近3年,客户行为分析是我日常工作的核心内容之一。 在不同的分析平台(例如Google Analytics(分析),Adobe Analytics)的帮助下,我的生活变得比以前更加轻松,因为这些平台具有内置的细分功能,可以根据维度和指标分析用户行为。

However, despite the convenience provided, I was hoping to leverage Machine Learning to do customer segmentation that can be scalable and applicable to other optimizations in Data Science (e.g. A/B Testing). Then, I came across the dataset provided by Google Analytics for a Kaggle competition and decided to use it for this project.

但是,尽管提供了便利,但我还是希望利用机器学习来进行客户细分 ,该细分可以扩展并适用于数据科学中的其他优化(例如A / B测试)。 然后,我遇到了Google Analytics(分析)提供的Kaggle竞赛数据集,并决定将其用于该项目。

Feel free to check out the dataset here if you’re keen! Beware that the dataset has several sub-datasets and each has more than 900k rows!

如果您愿意,可以在这里签出数据集! 请注意,数据集具有多个子数据集, 每个子数据集具有超过900k的行

A.解释性数据分析(EDA) (A. Explanatory Data Analysis (EDA))

This always remain an essential step in every Data Science project to ensure the dataset is clean and properly pre-processed to be used for modelling.

这始终是每个Data Science项目中必不可少的步骤,以确保数据集干净且经过适当预处理以用于建模。

First of all, let’s import all the necessary libraries and read the csv file:


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as snsdf_raw = pd.read_csv("google-analytics.csv")
1.展平JSON字段 (1. Flatten JSON Fields)

As you can see, the raw dataset above is a bit “messy” and not digestible at all since some variables are formatted as JSON fields which compress different values of different sub-variables into one field. For example, for geoNetwork variable, we can tell that there are several sub-variables such as continent, subContinent, etc. that are grouped together.

如您所见,上面的原始数据集有点“混乱”,根本无法消化,因为某些变量的格式设置为JSON字段,可将不同子变量的不同值压缩到一个字段中。 例如,对于geoNetwork变量,我们可以知道有几个子变量(例如,continent,subContinent等)组合在一起。

Thanks to the help of a Kaggler, I was able to convert these variables to a more digestible ones by flattening those JSON fields:


import os
import json
from pandas import json_normalizedef load_df(csv_path="google-analytics.csv", nrows=None):
json_columns = ['device', 'geoNetwork', 'totals', 'trafficSource']
df = pd.read_csv(csv_path, converters={column: json.loads for column in json_columns},dtype={'fullVisitorID':'str'}, nrows=nrows)
for column in json_columns:
column_converted = json_normalize(df[column])
column_converted.columns = [f"{column}_{subcolumn}" for subcolumn in column_converted.columns]
df = df.drop(column, axis=1).merge(column_converted, right_index=True, left_index=True)
return df
After flattening those JSON fields, we are able to see a much cleaner dataset, especially those JSON variables split into sub-variables (e.g. device split into device_browser, device_browserVersion, etc.).


2.数据重新格式化和分组 (2. Data Re-formatting & Grouping)

For this project, I have chosen the variables that I believe have better impact or correlation to the user behaviours:


df = df.loc[:,['channelGrouping', 'date', 'fullVisitorId', 'sessionId', 'visitId', 'visitNumber', 'device_browser', 'device_operatingSystem', 'device_isMobile', 'geoNetwork_country', 'trafficSource_source', 'totals_visits', 'totals_hits', 'totals_pageviews', 'totals_bounces', 'totals_transactionRevenue']]df = df.fillna(value=0)
Moving on, as the new dataset has fewer variables which, however, vary in terms of data type, I took some time to analyze each and every variable to ensure the data is “clean enough” prior to modelling. Below are some quick examples of un-clean data to be cleaned:

继续,由于新数据集的变量较少,但是变量的数据类型不同,我花了一些时间分析每个变量,以确保在建模之前数据“足够干净”。 以下是一些要清除的不干净数据的快速示例:

#Format the values
df.channelGrouping = df.channelGrouping.replace("(Other)", "Others")#Convert boolean type to string
df.device_isMobile = df.device_isMobile.astype(str)
df.loc[df.device_isMobile == "False", "device"] = "Desktop"
df.loc[df.device_isMobile == "True", "device"] = "Mobile"#Categorize similar valuesdf['traffic_source'] = df.trafficSource_sourcemain_traffic_source = ["google","baidu","bing","yahoo",...., "pinterest","yandex"]df.traffic_source[df.traffic_source.str.contains("google")] = "google"
df.traffic_source[df.traffic_source.str.contains("baidu")] = "baidu"
df.traffic_source[df.traffic_source.str.contains("bing")] = "bing"
df.traffic_source[df.traffic_source.str.contains("yahoo")] = "yahoo"
df.traffic_source[~df.traffic_source.isin(main_traffic_source)] = "Others"

After re-formatting, I found that fullVisitorID’s unique values are fewer than the total rows of the dataset, meaning there are multiple fullVisitorIDs that were recorded. Hence, I proceeded to group the variables by fullVisitorID and sort by Revenue:

重新格式化后,我发现fullVisitorID的唯一值少于数据集的总行数,这意味着记录了多个fullVisitorID。 因此,我着手按照fullVisitorID对变量进行分组,然后按Revenue进行排序:

df_groupby = df.groupby(['fullVisitorId', 'channelGrouping', 'geoNetwork_country', 'traffic_source', 'device', 'deviceBrowser', 'device_operatingSystem'])
.agg({'totals_hits':'sum', 'totals_pageviews':'sum', 'totals_bounces':'sum','totals_transactionRevenue':'sum'})
.reset_index()df_groupby = df_groupby.sort_values(by='totals_transactionRevenue', ascending=False).reset_index(drop=True)
df.groupby() and df.sort_values()

3.异常值处理 (3. Outlier Handling)

The last step of any EDA process that cannot be overlooked is detecting and handling outliers of the dataset. The reason being is that outliers, especially those marginally extreme ones, impact the performance of a machine learning model, mostly negatively. That said, we need to either remove those outliers from the dataset or convert them (by mean or mode) to fit them to the range that the majority of the data points lie in:

任何EDA流程中不可忽视的最后一步是检测和处理数据集的异常值。 原因是离群值,尤其是那些极度极端的值,对机器学习模型的性能产生了很大的负面影响。 也就是说,我们需要从数据集中删除那些离群值,或者将它们转换(通过均值或众数)以使其适合大多数数据点所在的范围:

#Seaborn Boxplot to see how far outliers lie compared to the restsns.boxplot(df_groupby.totals_transactionRevenue)
As you can see, most of the data points in Revenue lie below USD200,000 and there’s only one extreme outlier that hits nearly USD600,000. If we don’t remove this outlier, the model also takes it into consideration that produces a less objective reflection.

如您所见,“收入”中的大多数数据点都在200,000美元以下,只有一个极端的异常值达到了600,000美元。 如果我们不删除此异常值,则模型也会将其考虑在内,从而产生较少客观的反映。

So let’s go ahead and remove it, and please do so for other variables. Just a quick note, there are several methods of dealing with outliers (such as inter-quantiles). However, in my case, there’s only one so I just went ahead defining the range that I believe fits well:

因此,让我们继续删除它,对于其他变量,请这样做。 简要说明一下,有几种处理离群值(例如分位数间)的方法。 但是,就我而言,只有一个,所以我继续定义了我认为合适的范围:

df_groupby = df_groupby.loc[df_groupby.totals_transactionRevenue < 200000]

B. K-均值聚类 (B. K-Means Clustering)

What is K-Means Clustering and how does it help with customer segmentation?


Clustering is the most well-known unsupervised learning technique that finds structure in unlabeled data by identifying similar groups/clusters, particularly with the helps of K-Means.


K-Means tries to address two questions: (1) K: the number of clusters (groups) we expect to find in the dataset and (2) Means: the average distance of data to each cluster center (centroid) which we try to minimize.

K-Means尝试解决两个问题:(1)K:我们希望在数据集中找到的聚类 (组) 的数量; (2)均值: 数据到我们试图聚类的每个聚类中心 (质心) 的平均距离最小化。

Also, one thing of note is that K-Means comes with several variations, typically :


  1. init = ‘random’: that randomly selects the centroids of each cluster

    init ='random':随机选择每个簇的质心

  2. init = ‘k-means++’: that only selects the 1st centroid by randomness while other centroids to be placed as far away from the 1st as possible

    init ='k-means ++':仅随机选择第一个质心,而其他质心则尽可能远离第一个质心

In this project, I’ll use the second option to ensure that each cluster is well-distinguished from one another:


from sklearn.cluster import KMeansdata = df_groupby.iloc[:, 7:]kmeans = KMeans(n_clusters=3, init="k-means++")
kmeans.fit(data)labels = kmeans.predict(data)
labels = pd.DataFrame(data=labels, index = df_groupby.index, columns=["labels"])

Before applying the algorithm, we need to define “n_clusters” which is the number of groups we expect to get out of the modelling. In this case, I randomly put n_clusters = 3. Then, I went ahead visualizing how the dataset is grouped using 2 variables: Revenue and PageViews:

在应用算法之前,我们需要定义“ n_clusters ”,这是我们希望从建模中摆脱出来的组数。 在这种情况下,我随机放置n_clusters =3。然后,我继续可视化如何使用2个变量对数据集进行分组:Revenue和PageViews:

plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 0],df_kmeans.totals_pageviews[df_kmeans.labels == 0], c='blue')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 1], df_kmeans.totals_pageviews[df_kmeans.labels == 1], c='green')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 2], df_kmeans.totals_pageviews[df_kmeans.labels == 2], c='orange')plt.show()
As you can see, the x-axis stands for the number of Revenue while y-axis for PageViews . After modelling, we can tell a certain degree of difference in 3 clusters. However, I was not sure whether 3 is the “right” number of clusters or not. That said, we can rely on the estimator of K-Means algorithm, inertia_, which is the distance from each sample to the centroid. In particular, we will compare the inertia of each cluster ranging from 1 to 10, in my case, and see which is the lowest and how far we should go:

如您所见,x轴代表“收入”数,y轴代表“ PageViews”。 建模后,我们可以区分3个聚类的一定程度的差异。 但是,我不确定3个集群是否正确。 就是说,我们可以依靠K-Means算法的估计量initiative_ ,它是每个样本到质心的距离。 特别是,在我的例子中,我们将比较每个群集的惯性,范围是1到10,然后看看哪一个是最低的以及应该走多远:

#Find the best number of clustersnum_clusters = [x for x in range(1,10)]
inertia = []for i in num_clusters:
model = KMeans(n_clusters = i, init="k-means++")

plt.plot(num_clusters, inertia)
From the chart above, inertia started to fall slowly since the 4th or 5th cluster, meaning that that’s the lowest inertia we can get, so I decided to go with “n_clusters=4”:

从上表中可以看出,自第4簇或第5簇以来,惯性开始缓慢下降,这意味着这是我们可以获得的最低惯性,因此我决定使用“ n_clusters = 4 ”:

plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 0], df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 0], c='blue')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 1],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 1], c='green')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 2],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 2], c='orange')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 3],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 3], c='red')plt.xlabel("Page Views")
Switch PageViews to x-axis and Revenue to y-axis

The clusters now look a lot more distinguishable from one another:


  1. Cluster 0 (Blue): high PageViews yet little-to-none Revenue

  2. Cluster 1 (Red): medium PageViews, low Revenue

  3. Cluster 2 (Orange): medium PageViews, medium Revenue

  4. Cluster 4 (Green): unclear trend of PageViews, high Revenue


Except for cluster 0 and 4 (unclear pattern), which are beyond our control, cluster 1 and 2 can tell a story here as they seem to share some similarities.


To understand which factor that might impact each cluster, I segmented each cluster by Channels, Device and Operating System:


As seen from above, in Cluster 1, Referral channel contributed the highest Revenue, followed by Direct and Organic Search. In contrast, it’s Direct that made the highest contribution in Cluster 2. Similarly, while Macintosh is the most dominating device in Cluster 1, it’s Windows in Cluster 2 that achieved higher revenue. The only similarity between 2 clusters is the Device Browser, which Chrome is widely used.

从上方可以看出,在类别1中,引荐渠道贡献了最高的收入,其次是直接搜索和自然搜索。 相比之下,Direct在集群2中贡献最大。类似地,尽管Macintosh是集群1中最主要的设备,但集群2中的Windows获得了更高的收入。 2个群集之间的唯一相似之处是设备浏览器,Chrome被广泛使用。

Voila! This further segmentation helps us tell which factor (in this case, Channel, Device Browser, Operating System) works better for each cluster, hence we can better evaluate our investment moving forward!

瞧! 进一步的细分可以帮助我们确定哪个因素(在这种情况下,通道,设备浏览器,操作系统)对于每个集群都更有效,因此我们可以更好地评估未来的投资!

C.通过假设检验进行A / B检验 (C. A/B Testing through Hypothesis Testing)

What is A/B Testing and how can Hypothesis Testing come into place to complement the process?

什么是A / B测试,以及如何进行假设测试来补充流程?

A/B Testing is no stranger to those who work in Advertising and Media, since it’s one of the powerful techniques that help improve the performance with more cost efficiency. Particularly, A/B Testing divides the audience into 2 groups: Test vs Control. Then, we expose the ads/show a different design to the Test group only to see if there’s any significant discrepancy between 2 groups: exposed vs un-exposed.

A / B测试对于从事广告和媒体工作的人员并不陌生,因为它是帮助以更高的成本效率提高性能的强大技术之一。 特别是,A / B测试将受众分为两组:测试与控制。 然后,我们向测试组展示广告/展示不同的设计,只是为了查看两组之间是否存在显着差异:公开与未公开。

Image for post
Image credit: https://productcoalition.com/are-you-segmenting-your-a-b-test-results-c5512c6def65?gi=7b445e5ef457
图片来源: https : //productcoalition.com/are-you-segmenting-your-ab-test-results-c5512c6def65?gi=7b445e5ef457

In Advertising, there are a number of different automatic tools in the market that can easily help do A/B Testing at one click. However, I still wanted to try a different method in Data Science that can do the same: Hypothesis Testing. The methodology is pretty much the same, as Hypothesis Testing compares the Null Hypothesis (H0) and Alternate Hypothesis (H1) and see if there’s any significant discrepancy between the two!

在广告中,市场上有许多不同的自动工具,可轻松帮助您一键进行A / B测试。 但是,我仍然想在数据科学中尝试一种可以做到这一点的不同方法: 假设检验 。 方法学几乎是一样的,因为假设检验将零假设(H0)和替代假设(H1)进行比较,看看两者之间是否存在显着差异!

Assume that I run a promotion campaign that exposes an ad to the Test group. Here’s a quick summary of steps that need to be followed to test the result with Hypothesis Testing:

假设我运行了一个促销活动,将广告展示给“测试”组。 以下是使用假设检验测试结果所需遵循的步骤的快速摘要:

  1. Sample Size Determination

  2. Pre-requisite Requirements: Normality and Correlation Tests

  3. Hypothesis Testing


For the 1st step, we can rely on Power Analysis which helps determine the sample size to draw from a population. Power Analysis requires 3 parameters: (1) effect size, (2) power and (3) alpha. If you are looking for details on how Power Analysis, please refer to an in-depth article here that I wrote some time ago.

对于第一步 ,我们可以依靠功效分析,该分析有助于确定要从总体中提取的样本量。 功效分析需要3个参数:(1)效果大小,(2)功效和(3)alpha。 如果您正在寻找在功率分析如何,请参阅了深入的文章详细介绍在这里 ,我写了前一段时间。

Below is a quick note to each parameter for your quick understanding:


#Effect Size: (expected mean - actual mean) / actual_std
effect_size = (280000 - df_group1_ab.revenue.mean())/df_group1_ab.revenue.std() #set expected mean to $350,000
power = 0.9 #the probability of rejecting the null hypothesis#Alpha
alpha = 0.05 #the error rate

After having 3 parameters ready, we use TTestPower() to determine the sample size:


import statsmodels.stats.power as smsn = sms.TTestPower().solve_power(effect_size=effect_size, power=power, alpha=alpha)print(n)

The result is 279, meaning we need to draw 279 data points from each group: Test and Control. As I don’t have real data, I used np.random.normal to generate a list of revenue data, in this case sample size = 279 for each group:

结果是279,这意味着我们需要从每个组中提取279个数据点:测试和控制。 由于我没有真实数据,因此我使用np.random.normal生成了收入数据列表,在这种情况下,每个组的样本量= 279:

#Take the samples out of each group: control vs testcontrol_sample = np.random.normal(control_rev.mean(), control_rev.std(), size=279)
test_sample = np.random.normal(test_rev.mean(), test_rev.std(), size=279)

Moving to the 2nd step, we need to ensure the samples are (1) normally distributed and (2) independent (not correlated). Again, if you want a refresh on the tests used in this step, refer to my article as above. In short, we are going to use (1) Shapiro as the normality test and (2) Pearson as the correlation test.

移至第二步 ,我们需要确保样本是(1)正态分布和(2)独立(不相关)的。 同样,如果您想刷新此步骤中使用的测试,请参考上面的文章。 简而言之,我们将使用(1)Shapiro作为正态性检验,(2)Pearson作为相关性检验。

#Step 2. Pre-requisite: Normality, Correlationfrom scipy.stats import shapiro, pearsonrstat1, p1 = shapiro(control_sample)
stat2, p2 = shapiro(test_sample)print(p1, p2)stat3, p3 = pearsonr(control_sample, test_sample)

The p-value of Shapiro is 0.129 and 0.539 for Control and Test group respectively, which is > 0.05. Hence, we don’t reject the null hypothesis and are able to say that 2 groups are normally distributed.

对照组和测试组的Shapiro p值分别为0.129和0.539,> 0.05。 因此,我们不会拒绝原假设,而是可以说2个组是正态分布的。

The p-value of Pearson is 0.98, which is >0.05, meaning that 2 groups are independent from each other.

皮尔森(Pearson)的p值为0.98,即> 0.05,表示2个组彼此独立。

Final step is here! As there are 2 variables to be tested against each other (Test vs Control group), we use T-Test to see if there’s any significant discrepancy in Revenue after running A/B Testing:

最后一步就在这里 ! 由于有两个变量需要相互测试(测试组和对照组),因此我们使用T-Test来查看运行A / B测试后收入是否存在显着差异:

#Step 3. Hypothesis Testingfrom scipy.stats import ttest_indtstat, p4 = ttest_ind(control_sample, test_sample)

The result is 0.35, which is > 0.05. Hence, the A/B Test conducted indicates that the Test Group exposed to the ads doesn’t show any superiority over the Control Group with no ad exposure.

结果为0.35,即> 0.05。 因此,进行的A / B测试表明,暴露于广告的测试组与没有暴露广告的对照组相比没有任何优势。

Voila! That’s the end of this project — Customer Segmentation & A/B Testing! I hope you find this article useful and easy to follow.

瞧! 这就是项目的结尾–客户细分和A / B测试! 我希望您觉得这篇文章有用且易于阅读。

Do look out for my upcoming projects in Data Science and Machine Learning in the near future! In the meantime feel free to check out my Github here for the complete repository:

请在不久的将来注意我即将进行的数据科学和机器学习项目 ! 同时,请随时在此处查看我的Github以获取完整的存储库:

Github: https://github.com/andrewnguyen07LinkedIn: www.linkedin.com/in/andrewnguyen07

GitHub: https : //github.com/andrewnguyen07 LinkedIn: www.linkedin.com/in/andrewnguyen07



翻译自: https://towardsdatascience.com/customer-segmentation-k-means-clustering-a-b-testing-bd26a94462dd






