实战3-淘宝用户行为分析及可视化

淘宝用户行为分析及可视化

分析背景

  • 对淘宝2014年11月18至12月18日用户行为进行分析,该数据集包含了1200+万行,数据字段详解:

    • user_id: 用户ID
    • item_id: 商品ID
    • behavior_type: 用户操作行为。1-点击,2-收藏,3-加入购物车,4-支付
    • user_geohash: 用户地理位置(经过脱敏处理)
    • item_category: 品类ID,商品所属种类
    • time: 操作时间
  • 数据来源:https://tianchi.aliyun.com/dataset/dataDetail?dataId=46

  • 旨在针对电商用户行为进行分析
    在这里插入图片描述

明确问题

  1. 了解淘宝的日浏览量和日独立用户数
  2. 淘宝用户的消费及复购行为
  3. 淘宝平台各种用户行为之间的转化率
  4. 留存率分析
  5. 利用二八理论分析淘宝主要商品的价值
  6. 建立RFM模型对用户进行分类
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyecharts.charts import Bar, Funnel
from pyecharts import options as opts

# 解决suptitle报警问题
# import matplotlib
# matplotlib.use("TkAgg")

# 设置主题
plt.style.use('ggplot')

# 解决中文字符显示
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

读取和理解数据

data = pd.read_csv('tianchi_mobile_recommend_train_user.csv', dtype=str, encoding='utf-8')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12256906 entries, 0 to 12256905
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   user_id        object
 1   item_id        object
 2   behavior_type  object
 3   user_geohash   object
 4   item_category  object
 5   time           object
dtypes: object(6)
memory usage: 561.1+ MB
data.head()
user_iditem_idbehavior_typeuser_geohashitem_categorytime
0980478372324315621NaN42452014-12-06 02
1977261363835835901NaN58942014-12-09 20
298607707647497121NaN28832014-12-18 11
398662432320593836196nn52n65622014-12-06 10
4981459082902085201NaN139262014-12-16 21

数据预处理

# 统计缺失值
data.apply(lambda x: sum(x.isnull()))
user_id                0
item_id                0
behavior_type          0
user_geohash     8334824
item_category          0
time                   0
dtype: int64
# 统计缺失率
data.apply(lambda x: sum(x.isnull()) / len(x))
user_id          0.00000
item_id          0.00000
behavior_type    0.00000
user_geohash     0.68001
item_category    0.00000
time             0.00000
dtype: float64
# 分割日期,转换形式
data['date'] = data['time'].str[:-3]
data['hour'] = data['time'].str[-2:].astype(int)
data['date'] = pd.to_datetime(data['date'])
data['time'] = pd.to_datetime(data['time'])
data.head()
user_iditem_idbehavior_typeuser_geohashitem_categorytimedatehour
0980478372324315621NaN42452014-12-06 02:00:002014-12-062
1977261363835835901NaN58942014-12-09 20:00:002014-12-0920
298607707647497121NaN28832014-12-18 11:00:002014-12-1811
398662432320593836196nn52n65622014-12-06 10:00:002014-12-0610
4981459082902085201NaN139262014-12-16 21:00:002014-12-1621
data.dtypes
user_id                  object
item_id                  object
behavior_type            object
user_geohash             object
item_category            object
time             datetime64[ns]
date             datetime64[ns]
hour                      int32
dtype: object
data.sort_values(by='time', ascending=True, inplace=True)
data.reset_index(drop=True, inplace=True)
data.head()
user_iditem_idbehavior_typeuser_geohashitem_categorytimedatehour
0734627153784852331NaN91302014-11-182014-11-180
1360901372367481151NaN105232014-11-182014-11-180
2404597331552181771NaN85612014-11-182014-11-180
38141991498085241NaN90532014-11-182014-11-180
411330998257308611NaN37832014-11-182014-11-180
# 对字符型数据进行统计,describe()理解include参数
data.describe(include=['object'])
user_iditem_idbehavior_typeuser_geohashitem_category
count122569061225690612256906392208212256906
unique10000287694745754588916
top36233277112921337194ek6ke1863
freq310301445115505811052393247

数据分析与可视化

用户行为分析

日PV和日UV
pv_daily = data.groupby('date').count()[['user_id']].rename(columns={'user_id':'pv'})
pv_daily.head()
pv
date
2014-11-18366701
2014-11-19358823
2014-11-20353429
2014-11-21333104
2014-11-22361355
# 每日独立访客量
uv_daily = data.groupby('date')[['user_id']].apply(lambda x: x.drop_duplicates().count()).rename(columns={'user_id':'uv'})
uv_daily.head()
uv
date
2014-11-186343
2014-11-196420
2014-11-206333
2014-11-216276
2014-11-226187
# 合并
pv_uv_daily = pd.concat([pv_daily, uv_daily], axis=1)
pv_uv_daily.head()
pvuv
date
2014-11-183667016343
2014-11-193588236420
2014-11-203534296333
2014-11-213331046276
2014-11-223613556187
PV与UV相关性
# pv与uv的相关性,method可以是相关性{spearman、pearson(默认)} 非相关性{kendall}
pv_uv_daily.corr(method='pearson')
pvuv
pv1.0000000.920602
uv0.9206021.000000
可视化
plt.figure(figsize=(9, 9), dpi=70)
plt.subplot(211)
plt.plot(pv_daily, color='red')
plt.title('每日访问量', pad=10)
plt.xticks(rotation=45)
plt.grid(b=False)
plt.subplot(212)
plt.plot(uv_daily, color='green')
plt.title('每日访问用户数', pad=10)
plt.xticks(rotation=45)
plt.suptitle('PV和UV变化趋势', fontsize=20)
plt.subplots_adjust(hspace=0.5)
plt.grid(b=False)
plt.show()

在这里插入图片描述

时PV和时UV

pv_hour = data.groupby('hour').count()[['user_id']].rename(columns={'user_id':'pv'})
uv_hour = data.groupby('hour')[['user_id']].apply(lambda x: x.drop_duplicates().count()).rename(columns={'user_id':'uv'})
pv_uv_hour = pd.concat([pv_hour, uv_hour], axis=1)
pv_uv_hour.head()
pvuv
hour
05174045786
12676823780
21470902532
3985161937
4804871765
相关性
# 相关性
pv_uv_hour.corr(method='spearman')
pvuv
pv1.0000000.903478
uv0.9034781.000000
可视化
fig = plt.figure(figsize=(9, 7), dpi=70)
fig.suptitle('PV和UV变化趋势', y=0.93, fontsize=18)
ax1 = fig.add_subplot(111)
ax1.plot(pv_hour, color='blue', label='每小时访问量')
ax1.set_xticks(list(np.arange(0,24)))
ax1.legend(loc='upper center', fontsize=12)
ax1.set_ylabel('访问量')
ax1.set_xlabel('小时')
ax1.grid(False)
ax2 = ax1.twinx()
ax2.plot(uv_hour, color='red', label='每小时访问用户数')
ax2.legend(loc='upper left', fontsize=12)
ax2.set_ylabel('访问用户数')
ax2.grid(False)
fig.show()
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  from ipykernel import kernelapp as app

在这里插入图片描述

不同行为类型用户PV分析

# data.groupby(['date', 'behavior_type'])[['user_id']].count().reset_index().rename(columns={'user_id':'pv'})
diff_behavior_pv = data.pivot_table(columns='behavior_type', index='date', values='user_id', aggfunc='count').rename(columns={'1':'click', '2':'collect', '3':'addToCart', '4':'pay'}).reset_index()
diff_behavior_pv.describe()
behavior_typeclickcollectaddToCartpay
count31.00000031.00000031.00000031.000000
mean372599.3870977824.38709711082.7096773877.580645
std56714.877753805.8272222773.9527182121.877671
min314572.0000006484.0000008679.0000003021.000000
25%344991.0000007285.50000010058.5000003333.000000
50%364097.0000007702.00000010256.0000003483.000000
75%378031.5000008279.50000011277.5000003678.000000
max641507.00000010446.00000024508.00000015251.000000
diff_behavior_pv.head()
behavior_typedateclickcollectaddToCartpay
02014-11-183458556904102123730
12014-11-193378707152101153686
22014-11-203327927167100083462
32014-11-21314572683286793021
42014-11-22340563725299703570
bar_width=0.2
xticklabels = ['7-%d' % i for i in list(np.arange(18,31))] + ['8-%d' % i for i in list(np.arange(1, 24))]

plt.figure(figsize=(20, 9))
plt.bar(diff_behavior_pv.index-2*bar_width, diff_behavior_pv.click, width=bar_width, label='click')
plt.bar(diff_behavior_pv.index-bar_width, diff_behavior_pv.collect, bottom=0, width=bar_width, color='', alpha=0.5, label='collect')
plt.bar(diff_behavior_pv.index, diff_behavior_pv.addToCart, bottom=0, width=bar_width, color='black', label='toCart')
plt.bar(diff_behavior_pv.index+bar_width, diff_behavior_pv.pay, bottom=0, width=bar_width, color='blue',  label='pay')
plt.yscale('log')
plt.yticks(fontsize=20)
plt.xticks(ticks=list(np.arange(0, 37, 3)), labels=xticklabels[::3], rotation=45, fontsize=20)
plt.xlabel('日期', fontsize=22)
plt.ylabel('浏览量', fontsize=22)
plt.title('每天不同行为类型用户PV情况', fontsize=36)
plt.legend(loc='best', fontsize=18)
plt.grid(False)
plt.savefig('每天不同行为类型用户PV情况.png', quality=95, dpi=70)
plt.show()

在这里插入图片描述

结论:

操作行为分析

操作行为情况
pv_detatil = data.pivot_table(columns='behavior_type', index='hour', values='user_id', aggfunc=np.size)
pv_detatil.rename(columns={'1':'click', '2':'collect', '3':'addToCart', '4':'pay'}, inplace=True)
pv_detatil.head()
behavior_typeclickcollectaddToCartpay
hour
048734111062141564845
1252991627667121703
213913933113834806
39325022822480504
47583220102248397
操作行为可视化
for i in pv_detatil.columns.tolist()[1:]:
    plt.plot(pv_detatil[i], label=i)
plt.legend(loc='best', fontsize=12)
plt.title('访问行为情况', fontsize=18)
plt.xticks(list(np.arange(0, 24)))
plt.xlabel('小时')
plt.ylabel('数量')
plt.grid()
plt.show()

在这里插入图片描述

data_user_buy = data[data.behavior_type == '4'].groupby('user_id').size()
data_user_buy.head()
user_id
100001878    36
100011562     3
100012968    15
100014060    24
100024529    26
dtype: int64
click_times = data[data.behavior_type == '1'].groupby('user_id').size()
collect_times = data[data.behavior_type == '2'].groupby('user_id').size()
addToCart_times = data[data.behavior_type == '3'].groupby('user_id').size()
pay_times = data[data.behavior_type == '4'].groupby('user_id').size()
user_behavior = pd.concat([click_times, collect_times, addToCart_times, pay_times], axis=1)
user_behavior.columns= ['click', 'collect', 'addToCart','pay']
user_behavior.fillna(0, inplace=True)
user_behavior['pay_per_click'] = round(user_behavior['click'] / user_behavior['pay'], 1)
user_behavior.head()
clickcollectaddToCartpaypay_per_click
10000187825320.0200.036.070.3
1000115624232.09.03.0141.0
1000129683670.06.015.024.5
1000140609792.050.024.040.8
10002452911211.081.026.043.1
user_behavior.describe()
clickcollectaddToCartpaypay_per_click
count10000.00000010000.00000010000.00000010000.00000010000.000
mean1155.05810024.25560034.35640012.020500inf
std1430.05277473.90063563.88942919.050621NaN
min1.0000000.0000000.0000000.0000002.000
25%297.0000000.0000002.0000002.00000051.200
50%703.0000002.00000012.0000007.000000101.800
75%1461.00000018.00000039.00000015.000000247.125
max27720.0000002935.0000001810.000000809.000000inf
plt.hist(user_behavior[(user_behavior.pay_per_click < 800) & (user_behavior.pay_per_click >=0)].pay_per_click, bins=30)
plt.show()

在这里插入图片描述

# 相关性
user_behavior.corr(method='spearman').iloc[3:4, :3]
clickcollectaddToCart
pay0.6249260.3477760.659073

用户消费行为分析

日ARPU和日ARPPU
# 每日活跃用户数
active_user_daily = data.groupby('date')[['user_id']].apply(lambda x: x.drop_duplicates().count())
active_user_daily.head()
user_id
date
2014-11-186343
2014-11-196420
2014-11-206333
2014-11-216276
2014-11-226187
# 每日付费用户数
pay_user_daily = data[data.behavior_type == '4'].groupby('date')[['user_id']].apply(lambda x: x.drop_duplicates().count())
pay_user_daily.head()
user_id
date
2014-11-181539
2014-11-191511
2014-11-201492
2014-11-211330
2014-11-221411
# 合并
consume_daily = pd.concat([active_user_daily, pay_user_daily], axis=1)
# 重新命名字段
consume_daily.columns= ['activeUserDaily', 'payUserDaily']
# 由于数据中没有给用户消费金额,设每日每位用户消费为500
consume_daily['totalIncome'] = 500
# 计算ARPU
consume_daily['ARPU'] = round(consume_daily['totalIncome'] * consume_daily['payUserDaily'] / consume_daily['activeUserDaily'], 3)
# 计算ARPPU
consume_daily['ARPPU'] = round(consume_daily['totalIncome'] * consume_daily['payUserDaily'] / consume_daily['payUserDaily'])
consume_daily.head()
activeUserDailypayUserDailytotalIncomeARPUARPPU
date
2014-11-1863431539500121.315500.0
2014-11-1964201511500117.679500.0
2014-11-2063331492500117.796500.0
2014-11-2162761330500105.959500.0
2014-11-2261871411500114.029500.0
fig=plt.figure(figsize=(12, 8), dpi=100)
fig.suptitle('日用户消费行为', fontsize=20)
ax1 = fig.add_subplot(111)
ax1.plot(consume_daily['ARPU'],'ro-', label='日ARPU')
ax1.grid()
ax1.set_yticklabels(labels=list(np.arange(100, 300, 20)), fontsize=14)
ax1.set_ylabel('ARPU',fontsize=16)
ax1.legend(fontsize=14)
ax1.set_xlabel('日期', fontsize=16)
ax2 = ax1.twinx()
ax2.plot(consume_daily['ARPPU'], 'b-', label='日ARPPU')
ax2.legend(loc='upper left', fontsize=14)
ax2.set_yticklabels(labels=list(np.arange(470, 600, 10)), fontsize=14)
ax2.set_ylabel('ARPPU',fontsize=16)
fig.savefig('用户日消费行为.png', dpi=70, quality=95)
fig.show()
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:16: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.
  app.launch_new_instance()

在这里插入图片描述

用户购买次数情况分析

data['operation'] = 1
customer_operation = data.groupby(['date', 'user_id', 'behavior_type'])[['operation']].count()
customer_operation.reset_index(level=['date', 'user_id', 'behavior_type'], inplace=True)
customer_operation.head()
dateuser_idbehavior_typeoperation
02014-11-181000018781127
12014-11-1810000187838
22014-11-1810000187841
32014-11-18100014060123
42014-11-1810001406032
customer_operation[customer_operation.behavior_type == '4']['operation'].describe()
count    49201.000000
mean         2.443141
std          3.307288
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max        185.000000
Name: operation, dtype: float64
# 购买次数超过50次的用户数
customer_operation[(customer_operation.behavior_type == '4') & (customer_operation.operation > 50)].count()['user_id']
18
plt.hist(customer_operation[(customer_operation.behavior_type == '4') & (customer_operation.operation < 50)].operation, bins=10)
plt.show()

在这里插入图片描述

每天平均消费次数

customer_operation.groupby('date').apply(lambda x: x[x.behavior_type == '4'].operation.sum() / len(x.user_id.unique())).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x9e8b888>

在这里插入图片描述

付费率

  • 公式

    活 跃 用 户 付 费 率 = U V ( 独 立 用 户 数 ) A P A ( 活 跃 付 费 账 号 ) 活跃用户付费率 = \frac{UV(独立用户数)}{APA(活跃付费账号)} =APA()UV()

customer_operation.groupby('date').apply(lambda x: x[x.behavior_type == '4'].operation.count() / len(x.user_id.unique())).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x9e77f48>

在这里插入图片描述

同一时间段用户消费次数分布

customer_hour_operation  = data[data.behavior_type == '4'].groupby(['user_id', 'date', 'hour',])[['operation']].sum()
customer_hour_operation.reset_index(level=['user_id', 'date', 'hour'], inplace=True)
customer_hour_operation.head()
user_iddatehouroperation
01000018782014-11-18201
11000018782014-11-24203
21000018782014-11-25132
31000018782014-11-26162
41000018782014-11-26211
customer_hour_operation.operation.max()
97
plt.scatter(customer_hour_operation.hour, customer_hour_operation.operation)
plt.xlabel('hour', fontsize=14)
plt.ylabel('buy times', fontsize=14)
plt.show()

在这里插入图片描述

复购行为分析

月复购率
  • 按笔数(同一天超过一次)
  • 按周期(同一天购买多次算一次)
# 按周期
data_rebuy = data[data.behavior_type == '4'].groupby('user_id')['date'].apply(lambda x: len(x.unique()))
data_rebuy[:5]
user_id
100001878    15
100011562     3
100012968    11
100014060    12
100024529     9
Name: date, dtype: int64
# 复购率
data_rebuy[data_rebuy >= 2].count() / data_rebuy.count()
0.8717083051991897
data_day_buy = data[data.behavior_type == '4'].groupby(['user_id']).date.apply(lambda x: x.sort_values()).diff(1).dropna().map(lambda x: x.days)
data_day_buy.head()
user_id           
100001878  2439076    6
           2439090    0
           2440428    0
           2660355    1
           2672617    0
Name: date, dtype: int64

留存率

from datetime import datetime 
day_user = {}
for dt in set(data.date.dt.strftime('%Y%m%d').values.tolist()):
    user = list(set(data[data.date == datetime(int(dt[:4]),int(dt[4:6]),int(dt[6:]))]['user_id'].values.tolist()))
    day_user.update({dt:user})
# 由于字典是无序的,需按日期排序
day_user = sorted(day_user.items(), key=lambda x:x[0], reverse=False)
# 计算每日新增用户
a = {}
t = set(day_user[0][1])
a.update({'20141118':t})
for i in day_user[1:]:
    j = (set(i[1]) - t)
    a.update({i[0]:j})
    t = t | set(i[1])
# 目的是为了和day_user类型一样
a = sorted(a.items(), key=lambda x:x[0], reverse=False)
# 计算留存
retention = {}
ls = []
for i, k in enumerate(a):
    ls.append(len(k[1]))
    for j in day_user[i+1:]:
        li = len(set(k[1]) & set(j[1]))
        ls.append(li)
    retention.update({k[0]: ls})
    ls = []
        
# 目的是为了和day_user类型一样
retention = sorted(retention.items(), key=lambda x:x[0], reverse=False)
re = {}
for i in retention[:16]:    
    re.update({i[0]: i[1][:15]})
retention = pd.DataFrame(re)
retention =  retention.T
retention.drop([8,9,11,12,13], axis=1, inplace=True)
retention.columns = ['新增用户', '次日留存', '2日留存','3日留存', '4日留存','5日留存', '6日留存', '7日留存', '10日留存','14日留存']
div = retention.columns.tolist()[:-1]
for i, dived in enumerate(retention.columns.tolist()[1:]):
    retention['{}率'.format(dived)] = round(retention[dived] / retention['新增用户'], 3)
cols=['新增用户','次日留存','次日留存率','2日留存','2日留存率','3日留存','3日留存率','4日留存','4日留存率','5日留存','5日留存率','6日留存','6日留存率','7日留存','7日留存率','10日留存','10日留存率','14日留存','14日留存率']
retention = retention[cols]
retention.sort_index(inplace=True)
retention.head()
新增用户次日留存次日留存率2日留存2日留存率3日留存3日留存率4日留存4日留存率5日留存5日留存率6日留存6日留存率7日留存7日留存率10日留存10日留存率14日留存14日留存率
20141118634351370.81050000.78848610.76647630.75148100.75849160.77547920.75546270.72948060.758
2014111912837830.6107700.6007360.5747690.5997770.6067580.5917460.5817090.5537570.590
201411205503050.5552740.4982980.5422950.5362860.5202990.5443060.5563100.5642900.527
201411213401640.4821760.5181780.5241580.4651720.5061740.5121600.4711820.5351700.500
201411222501220.4881100.440950.380930.3721050.4201050.4201060.4241130.4521230.492
retention.index.str[-4:]
Index(['1118', '1119', '1120', '1121', '1122', '1123', '1124', '1125', '1126',
       '1127', '1128', '1129', '1130', '1201', '1202', '1203'],
      dtype='object')
plt.figure(figsize=(16, 9), dpi=90)
x = [i[:2] + '-' + i[2:] for i in retention.index.str[-4:].tolist()]
y1 = retention['次日留存率']
y2 = retention['3日留存率']
y3 = retention['7日留存率']
y4 = retention['10日留存率']
y5 = retention['14日留存率']

plt.plot(x, y1, 'ro-', label='次日留存率')
plt.plot(x, y2, 'bo-', label='3日留存率')
plt.plot(x, y3, 'yo--', label='5日留存率')
plt.plot(x, y4, 'gd-', label='7日留存率')
plt.plot(x, y1, 'rd-', label='10日留存率')
plt.plot(x, y5, 'cd--', label='14日留存率')

plt.legend(loc='best')
plt.title('14天内用户留存率情况', fontsize=30)
plt.xlabel('日期', fontsize=20)
plt.ylabel('留存率', fontsize=20)
plt.show()

在这里插入图片描述

漏斗流失分析

data_user_count = data.groupby('behavior_type').size()
data_user_count
behavior_type
1    11550581
2      242556
3      343564
4      120205
dtype: int64
pv_all = data.user_id.count()
pv_all
12256906
pv_click = (pv_all - data_user_count[0]) / pv_all
click_cart = 1 - (data_user_count[0] - data_user_count[2]) / data_user_count[0]
cart_collect = 1 - (data_user_count[2] - data_user_count[1]) / data_user_count[2]
collect_pay = 1 - (data_user_count[1] - data_user_count[3]) / data_user_count[1]
cart_pay = 1 - (data_user_count[2] - data_user_count[3]) / data_user_count[2]
change_rate = pd.DataFrame({'计数': [pv_all, data_user_count[0], data_user_count[2], data_user_count[3]],\
                            '单一转化率':[1, pv_click, click_cart, cart_pay]}, index=['浏览', '点击', '加入购物车', '支付'])
change_rate['总体转化率'] = change_rate['计数'] / pv_all
change_rate
计数单一转化率总体转化率
浏览122569061.0000001.000000
点击115505810.0576270.942373
加入购物车3435640.0297440.028030
支付1202050.3498770.009807

二八理论分析淘宝商品

goods_category = data[data.behavior_type == '4'].groupby('item_category')[['user_id']].count().rename(columns={'user_id':'购买量'}).sort_values(by='购买量', ascending=False)
goods_category['累计购买量'] = goods_category.cumsum()
goods_category['占比'] = goods_category['累计购买量'] / goods_category['购买量'].sum()
goods_category['分类'] = np.where(goods_category['占比'] <= 0.80, '产值前80%', '产值后20%') 
goods_pareto = goods_category.groupby('分类')[['购买量']].count().rename(columns={'购买量':'商品数'})
goods_pareto['商品数占比'] = round(goods_pareto['商品数'] / goods_pareto['商品数'].sum(), 3)
goods_pareto
商品数商品数占比
分类
产值前80%7260.156
产值后20%39390.844

用户细分(RFM)

计算R

from datetime import datetime
recent_user_buy = data[data.behavior_type == '4'].groupby('user_id')['date'].apply(lambda x: datetime(2014, 12, 20)-x.sort_values().iloc[-1])
recent_user_buy = recent_user_buy.reset_index()
recent_user_buy.columns = ['user_id', 'recent']
recent_user_buy.recent = recent_user_buy.recent.map(lambda x: x.days)
recent_user_buy.head()
user_idrecent
01000018782
11000115624
21000129682
31000140602
41000245294

计算F

buy_freq = data[data.behavior_type == '4'].groupby('user_id').date.count()
buy_freq = buy_freq.reset_index().rename(columns={'date': 'freq'})
buy_freq.head()
user_idfreq
010000187836
11000115623
210001296815
310001406024
410002452926
rfm = pd.merge(recent_user_buy, buy_freq, right_on='user_id', left_on='user_id')
rfm.head()
user_idrecentfreq
0100001878236
110001156243
2100012968215
3100014060224
4100024529426

给予指标

rfm['R_value'] = pd.qcut(rfm.recent, 2, labels=['高', '低'])
rfm['F_value'] = pd.qcut(rfm.freq, 2, labels=['低', '高'])
rfm['rf'] = rfm['R_value'].str.cat(rfm.F_value)
rfm.head()
user_idrecentfreqR_valueF_valuerf
0100001878236高高
110001156243高低
2100012968215高高
3100014060224高高
4100024529426高高

用户分类

def trans_value(x):
    if x == '高高': return '价值用户'
    elif x == '高低': return '发展用户'
    elif x == '低高': return '挽留客户'
    else: return '潜在客户'
rfm['rank'] = rfm.rf.apply(trans_value)
rfm.head()
user_idrecentfreqR_valueF_valuerfrank
0100001878236高高价值用户
110001156243高低发展用户
2100012968215高高价值用户
3100014060224高高价值用户
4100024529426高高价值用户

统计不同类型用户结果及可视化

rfm.groupby('rank')[['user_id']].count()
user_id
rank
价值用户3179
发展用户1721
挽留客户1219
潜在客户2767
plt.pie(rfm.groupby('rank')[['user_id']].count().values.tolist(), labels=rfm.groupby('rank')[['user_id']].count().index.tolist(), shadow=True, autopct='%.1f%%', radius=1.5, textprops=dict(fontsize=12))
plt.title('用户分类情况', fontsize=30, pad=45, color='blue')
plt.show()
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: MatplotlibDeprecationWarning: Non-1D inputs to pie() are currently squeeze()d, but this behavior is deprecated since 3.1 and will be removed in 3.3; pass a 1D array instead.
  """Entry point for launching an IPython kernel.

在这里插入图片描述

结论与建议

  1. 这一月内的日访问量和日访问用户数呈现相同趋势,日访问量大都在35万-40万之间波动,日访客数大都在6200-6600波动,
    在双十二购物狂欢节期间出现了剧增。
  2. 根据每小时用户访问行为可以看出用户主要访问淘宝的时间段是在白天10点以后,晚上9点左右达到访客人数的峰值
  3. 由于没有用户消费金额,无法得出日ARPU与日ARPPU具体情况,但是可以肯定的是日ARPPU是高于日ARPU的。
  4. 用户日平均消费次数在0.5次左右波动。
  5. 付费率是在20%-25%之间波动,在双十二期间达到50%
  6. 淘宝用户一天中每小时的消费次数主要在30次以内。
  7. 用户复购率为87%
  8. 可以看出,在点击商品到加入购物车的转化率大约为3%,而从购物车到支付大约35%,因此淘宝应该优化商品界面以及对商品相关的优化,使点击商品到加入购物车的转化率提高。
  9. 可以看出16%的商品占了80%的商品购买量,84%的商品仅提供了20%的商品购买量,因此应对后84%的商品进行优化、撤销等操作来提高后80%商品的购买量。
  10. 淘宝留存率虽然会在短期内从80%下滑到40%,但最终会在40%左右波动,留存率较好。3/5/7/10日留存率差异不大,14日留存率均高于40%。
  • 1
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值