数据驱动分析 实践二
客户分段
上一篇中,我们初步了解了一份在线零售数据,探索了North Star Metrics。本篇我们开始进一步通过数据了解客户,进行客户分段。
客户分段 (Customer Segmentation)
首先,我们需要思考一个问题:为什么我们要做客户分段?
主要原因是我们不能用同样的内容、渠道和重要性来对待每一个客户。我们需要更好的理解客户。客户有不同的需求,也有不同的特点,这些是我们采取不同决策所依赖的关键要素。
基于目的不同,你可以采用不同的策略实现客户分段。例如,如果你希望提高客户留存,你可以基于流失概率做客户分段并采取相应的行为。我们这里实现一种流行的客户分段方法 RFM.
RFM 就是Recency , Frequency , Monetary 价值。理论上我们可以向如下方式分段:
- Low Value
低价值客户,发生交易频率较低,价值较低或负价值。 - Mid Value
为你带来中等的收入。 - High Value
最不能失去的高价值客户。
RFM 需要计算Recency,Frequency和Monetary.这里我们将采用非监督聚类方法来实现RFM聚类。
Recency
为了计算近因(Recency),我们需要找到每一个客户最近的购买实践并计算其不活跃的天数,然后我们应用Kmeans聚类算法来得到其近因分数(Recency Score)。
这里我们仍然使用上一次的零售数据集。
导入包
# import libraries
from __future__ import division
from datetime import datetime, timedelta
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import chart_studio.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
#inititate Plotly
pyoff.init_notebook_mode()
读入数据
#load our data from CSV
tx_data = pd.read_excel('Online Retail.xlsx')
#convert the string date field to datetime
tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])
#we will be using only UK data
tx_uk = tx_data.query("Country=='United Kingdom'").reset_index(drop=True)
计算Recency
#create a generic user dataframe to keep CustomerID and new segmentation scores
tx_user = pd.DataFrame(tx_data['CustomerID'].unique())
tx_user.columns = ['CustomerID']
#get the max purchase date for each customer and create a dataframe with it
tx_max_purchase = tx_uk.groupby('CustomerID').InvoiceDate.max().reset_index()
tx_max_purchase.columns = ['CustomerID','MaxPurchaseDate']
#we take our observation point as the max invoice date in our dataset
tx_max_purchase['Recency'] = (tx_max_purchase['MaxPurchaseDate'].max() - tx_max_purchase['MaxPurchaseDate']).dt.days
#merge this dataframe to our new user dataframe
tx_user = pd.merge(tx_user, tx_max_purchase[['CustomerID','Recency']], on='CustomerID')
tx_user.head()
#plot a recency histogram
plot_data = [
go.Histogram(
x=tx_user['Recency']
)
]
plot_layout = go.Layout(
title='Recency'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
tx_user.Recency.describe()
count 3950.000000
mean 90.778481
std 100.230349
min 0.000000
25% 16.000000
50% 49.000000
75% 142.000000
max 373.000000
Name: Recency, dtype: float64
可以看到,平均值虽然是90天,但中位数是49天。
Kmeans聚类,采用Elbow method寻找最佳K。
from sklearn.cluster import KMeans
sse={}
tx_recency = tx_user[['Recency']]
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_recency)
tx_recency["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
本例中我们选择K=4
#build 4 clusters for recency and add it to dataframe
kmeans = KMeans(n_clusters=4)
kmeans.fit(tx_user[['Recency']])
tx_user['RecencyCluster'] = kmeans.predict(tx_user[['Recency']])
#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
new_cluster_field_name = 'new_' + cluster_field_name
df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
df_new['index'] = df_new.index
df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
df_final = df_final.drop([cluster_field_name],axis=1)
df_final = df_final.rename(columns={"index":cluster_field_name})
return df_final
tx_user = order_cluster('RecencyCluster', 'Recency',tx_user,False)
tx_user.groupby('RecencyCluster')['Recency'].describe()
可以观察到近因聚类的结果确实具有不同的特征,例如聚类1相比于聚类2明显更近。
我们将采用类似的方法用于Frequency(频率)和Revenue(收入)。
Frequency
#get order counts for each user and create a dataframe with it
tx_frequency = tx_uk.groupby('CustomerID').InvoiceDate.count().reset_index()
tx_frequency.columns = ['CustomerID','Frequency']
#add this data to our main dataframe
tx_user = pd.merge(tx_user, tx_frequency, on='CustomerID')
#plot the histogram
plot_data = [
go.Histogram(
x=tx_user.query('Frequency < 1000')['Frequency']
)
]
plot_layout = go.Layout(
title='Frequency'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
#k-means
kmeans = KMeans(n_clusters=4)
kmeans.fit(tx_user[['Frequency']])
tx_user['FrequencyCluster'] = kmeans.predict(tx_user[['Frequency']])
#order the frequency cluster
tx_user = order_cluster('FrequencyCluster', 'Frequency',tx_user,True)
#see details of each cluster
tx_user.groupby('FrequencyCluster')['Frequency'].describe()
Revenue
#calculate revenue for each customer
tx_uk['Revenue'] = tx_uk['UnitPrice'] * tx_uk['Quantity']
tx_revenue = tx_uk.groupby('CustomerID').Revenue.sum().reset_index()
#merge it with our main dataframe
tx_user = pd.merge(tx_user, tx_revenue, on='CustomerID')
#plot the histogram
plot_data = [
go.Histogram(
x=tx_user.query('Revenue < 10000')['Revenue']
)
]
plot_layout = go.Layout(
title='Monetary Value'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
#apply clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(tx_user[['Revenue']])
tx_user['RevenueCluster'] = kmeans.predict(tx_user[['Revenue']])
#order the cluster numbers
tx_user = order_cluster('RevenueCluster', 'Revenue',tx_user,True)
#show details of the dataframe
tx_user.groupby('RevenueCluster')['Revenue'].describe()
总体分数
我们已经有了Recency,Frequency和Revenue分数,现在需要创建一个总体分数来综合表示这三个分数。
#calculate overall score and use mean() to see details
tx_user['OverallScore'] = tx_user['RecencyCluster'] + tx_user['FrequencyCluster'] + tx_user['RevenueCluster']
tx_user.groupby('OverallScore')['Recency','Frequency','Revenue'].mean()
很明显分数为8的是我们最好的客户,为0的是最不好的客户。
简单划分:
- 0到2: Low Value
- 3到4: Mid Value
- 5+ : High Value
tx_user['Segment'] = 'Low-Value'
tx_user.loc[tx_user['OverallScore']>2,'Segment'] = 'Mid-Value'
tx_user.loc[tx_user['OverallScore']>4,'Segment'] = 'High-Value'
#Revenue vs Frequency
tx_graph = tx_user.query("Revenue < 50000 and Frequency < 2000")
plot_data = [
go.Scatter(
x=tx_graph.query("Segment == 'Low-Value'")['Frequency'],
y=tx_graph.query("Segment == 'Low-Value'")['Revenue'],
mode='markers',
name='Low',
marker= dict(size= 7,
line= dict(width=1),
color= 'blue',
opacity= 0.8
)
),
go.Scatter(
x=tx_graph.query("Segment == 'Mid-Value'")['Frequency'],
y=tx_graph.query("Segment == 'Mid-Value'")['Revenue'],
mode='markers',
name='Mid',
marker= dict(size= 9,
line= dict(width=1),
color= 'green',
opacity= 0.5
)
),
go.Scatter(
x=tx_graph.query("Segment == 'High-Value'")['Frequency'],
y=tx_graph.query("Segment == 'High-Value'")['Revenue'],
mode='markers',
name='High',
marker= dict(size= 11,
line= dict(width=1),
color= 'red',
opacity= 0.9
)
),
]
plot_layout = go.Layout(
yaxis= {'title': "Revenue"},
xaxis= {'title': "Frequency"},
title='Segments'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
tx_graph = tx_user.query("Revenue < 50000 and Frequency < 2000")
plot_data = [
go.Scatter(
x=tx_graph.query("Segment == 'Low-Value'")['Recency'],
y=tx_graph.query("Segment == 'Low-Value'")['Revenue'],
mode='markers',
name='Low',
marker= dict(size= 7,
line= dict(width=1),
color= 'blue',
opacity= 0.8
)
),
go.Scatter(
x=tx_graph.query("Segment == 'Mid-Value'")['Recency'],
y=tx_graph.query("Segment == 'Mid-Value'")['Revenue'],
mode='markers',
name='Mid',
marker= dict(size= 9,
line= dict(width=1),
color= 'green',
opacity= 0.5
)
),
go.Scatter(
x=tx_graph.query("Segment == 'High-Value'")['Recency'],
y=tx_graph.query("Segment == 'High-Value'")['Revenue'],
mode='markers',
name='High',
marker= dict(size= 11,
line= dict(width=1),
color= 'red',
opacity= 0.9
)
),
]
plot_layout = go.Layout(
yaxis= {'title': "Revenue"},
xaxis= {'title': "Recency"},
title='Segments'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
# Revenue vs Frequency
tx_graph = tx_user.query("Revenue < 50000 and Frequency < 2000")
plot_data = [
go.Scatter(
x=tx_graph.query("Segment == 'Low-Value'")['Recency'],
y=tx_graph.query("Segment == 'Low-Value'")['Frequency'],
mode='markers',
name='Low',
marker= dict(size= 7,
line= dict(width=1),
color= 'blue',
opacity= 0.8
)
),
go.Scatter(
x=tx_graph.query("Segment == 'Mid-Value'")['Recency'],
y=tx_graph.query("Segment == 'Mid-Value'")['Frequency'],
mode='markers',
name='Mid',
marker= dict(size= 9,
line= dict(width=1),
color= 'green',
opacity= 0.5
)
),
go.Scatter(
x=tx_graph.query("Segment == 'High-Value'")['Recency'],
y=tx_graph.query("Segment == 'High-Value'")['Frequency'],
mode='markers',
name='High',
marker= dict(size= 11,
line= dict(width=1),
color= 'red',
opacity= 0.9
)
),
]
plot_layout = go.Layout(
yaxis= {'title': "Frequency"},
xaxis= {'title': "Recency"},
title='Segments'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
现在,我们终于可以基于这个结果进行客户分段和采取必要的行动。
- High Value : 改善留存
- Mid Value : 改善留存 + 增加频率
- Low Value : 增加频率
未完待续…