淘宝用户行为分析-1-数据处理(python)

项目数据来源:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1
项目目的:尝试分析用户行为数据,了解这方面的指标,了解从行为数据中能得到什么规律和信息。
这篇博客主要记录我是怎么处理数据和做特征工程的。

1、读取数据,初步查看数据信息

#数据处理包
import pandas as pd
import numpy as np
df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"])
# nrows参数设置读取的行数
# df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"], nrows =500)
# sample 随机选取n行数据
# df = df.sample(n=1000000, replace=False, random_state=1, axis=0)
df.head()
User_IDItem_IDCategory_IDBehavior_typeTimestamp
0122683182520377pv1511544070
1123333462520771pv1511561733
212576651149192pv1511572885
3138308084181361pv1511593493
4143655852520377pv1511596146
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835330 entries, 0 to 3835329
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   User_ID        int64 
 1   Item_ID        int64 
 2   Category_ID    int64 
 3   Behavior_type  object
 4   Timestamp      int64 
dtypes: int64(4), object(1)
memory usage: 146.3+ MB
df.describe()
User_IDItem_IDCategory_IDTimestamp
count3.835330e+063.835330e+063.835330e+063.835330e+06
mean2.411328e+052.578345e+062.711667e+061.511963e+09
std2.548230e+051.487859e+061.464903e+068.665418e+05
min1.000000e+003.000000e+002.171000e+034.401600e+04
25%1.299830e+051.294202e+061.349561e+061.511763e+09
50%1.786180e+052.576338e+062.725426e+061.511966e+09
75%2.255740e+053.860809e+064.145813e+061.512181e+09
max1.018011e+065.163067e+065.161669e+062.122867e+09
# 查看df有没有null值,每列总共有几个null值
df.isnull().sum()
User_ID          0
Item_ID          0
Category_ID      0
Behavior_type    0
Timestamp        0
dtype: int64

初步观察数据我们可以得到:数据一共有3835330条,有5列分别是:User_ID、Item_ID、Category_ID、Behavior_type、Timestamp,其中User_ID、Item_ID、Category_ID、Behavior_type、Timestamp这4列是整数类型,Behavior_type是字符类型,Timestamp需要后面转换成时间类型。数据比较干净,没有null值。

# 读取融合的CSV文件
# merge_dt = pd.read_csv( cwd  + '\\' + "UserBehavior.csv", encoding="utf_8", engine='python' ,iterator = True ) 
# # merge_dt  
# # output: <pandas.io.parsers.TextFileReader at 0x1cc1f0a7470>
# # 调用数据
# df = merge_dt.get_chunk(10000)

2、清洗数据

# 数据去重
print(df.shape)
list_ = df.columns.to_list()
df.drop_duplicates(list_, keep='first', inplace=True)
print(df.shape)
(3835330, 5)
(3835328, 5)
import datetime
import time
# 时间戳转成datetime再转成str
# df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d'))
df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).date())
df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%H'))
# df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).time())
del df['Timestamp']
df.head()
User_IDItem_IDCategory_IDBehavior_typedatetime
0122683182520377pv2017-11-2501
1123333462520771pv2017-11-2506
212576651149192pv2017-11-2509
3138308084181361pv2017-11-2515
4143655852520377pv2017-11-2515
# 偷个懒,把datetime直接分割成2列
# df[["date","time"]] = df["datetime"].str.split(' ', 1, expand=True)
# df.head()
df['date'].value_counts()
2017-12-02    532774
2017-12-03    528928
2017-12-01    417976
2017-11-26    406792
2017-11-30    400366
2017-11-25    395034
2017-11-29    390884
2017-11-27    382304
2017-11-28    378326
2017-11-24      1507
2017-11-23       181
2017-11-22        53
2017-11-19        33
2017-11-21        25
2017-11-20        20
2017-11-18        19
2017-11-17        18
2018-08-28        16
2017-11-03        14
2017-11-16        10
2017-11-14         6
2017-11-13         6
2017-12-04         6
2017-11-11         5
2017-11-15         4
2017-12-06         3
2017-11-10         2
2017-09-16         2
2017-07-03         2
2017-11-12         2
1970-01-01         1
2015-02-06         1
2017-10-31         1
2017-09-11         1
2017-11-02         1
2017-11-05         1
2017-10-10         1
2017-11-06         1
2037-04-09         1
2017-11-04         1
Name: date, dtype: int64
# 2017-11-25'至'2017-12-03'之外,数据量极少,删除不分析
print(df.shape)
# df['date'] = df['date'][df['date'].between('2017-11-25','2017-12-03', inclusive=True)]
df['date'] = df['date'][(df['date']>=datetime.date(2017,11,25)) & (df['date']<=datetime.date(2017,12,3))]
# 删除含有nnan的数据行
df = df.dropna(axis=0, how='any')
print(df.shape)
(3835328, 6)
(3833384, 6)
# 数据量较多时从星期的维度分析可能会发现一些规律,但是该项目只涉及9天即一周零2天,
# 从星期维度分析和按日期分析没什么区别,也可以不做这个特征工程。
df['week'] = df['date'].apply(lambda x: x.weekday()+1)
df.head()
User_IDItem_IDCategory_IDBehavior_typedatetimeweek
0122683182520377pv2017-11-25016
1123333462520771pv2017-11-25066
212576651149192pv2017-11-25096
3138308084181361pv2017-11-25156
4143655852520377pv2017-11-25156

3、RFM

df_rfm = df[df['Behavior_type']=='buy']
df_rfm.shape
(76705, 7)
# 看看有多少用户
# len(df_rfm['User_ID'].unique())
# 看看有多少订单(假设一个buy是一个订单,但是我们知道这只是假设,实际中一个订单会有多个产品)
# len(df_rfm['Behavior_type'])
# 增加各时间段,用来计算订单时间差
df_rfm['datediff'] = (datetime.date(2017,12,4) - df['date']).dt.days
df_rfm.head()
D:\Anconda\envs\pytorch\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
User_IDItem_IDCategory_IDBehavior_typedatetimeweekdatediff
7110016034762951233buy2017-11-251169
7310029710434869428buy2017-11-252169
1001005989292429887buy2017-11-271317
11910010462013002561buy2017-11-271517
12510016062584098232buy2017-11-272117
# 最近一次购买时间间隔
df_r = df_rfm.groupby(by=['User_ID'])['datediff'].agg([('r', 'min')])
# 购买次数
df_f = df_rfm.groupby(by=['User_ID'])['Behavior_type'].agg([('f', 'count')])
# 关联2个数据框
df_rfm_2 = df_r.join(df_f)
df_rfm_2.head()
rf
User_ID
10068
117610
11953
12191
12223
df_rfm_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   r       25400 non-null  int64
 1   f       25400 non-null  int64
dtypes: int64(2)
memory usage: 1.8 MB
df_rfm_2.describe()
rf
count25400.00000025400.000000
mean3.5359063.019882
std2.4057433.039492
min1.0000001.000000
25%1.0000001.000000
50%3.0000002.000000
75%5.0000004.000000
max9.00000084.000000
# 最近购买时间,一共9天,距今天分别是1-9,一共分成4段,划分点分别是:,距今越近分数越高,距今越远分数越低
# bins_r = df_rfm_2['r'].quantile(q=np.linspace(0, 1, num=5), interpolation='nearest')
bins_r = [1, 3, 5, 7, 9]
labels_r = np.arange(4, 0, -1)
df_rfm_2['r_score'] = pd.cut(df_rfm_2['r'], bins=bins_r, labels=labels_r, include_lowest=True)
df_rfm_2['r_score'].unique()
[2, 3, 1, 4]
Categories (4, int64): [4 < 3 < 2 < 1]
df_rfm_2['f'].value_counts()
1     8688
2     5899
3     3849
4     2321
5     1545
6      955
7      639
8      421
9      275
10     188
11     148
12     102
13      63
14      59
15      58
16      34
17      24
18      22
20      16
19      14
21      12
23      12
22       9
29       6
25       5
31       4
30       4
26       3
36       3
28       3
32       3
27       2
51       1
33       1
65       1
47       1
84       1
69       1
24       1
72       1
57       1
42       1
43       1
60       1
61       1
39       1
Name: f, dtype: int64
# 购买频率达到84次,分成4批,分割成1-21,22-43,44-65,66-84
bins_f = [1, 21, 43, 65, 84]
labels_f = np.arange(1, 5)
df_rfm_2['f_score'] = pd.cut(df_rfm_2['f'], bins=bins_f, labels=labels_f, include_lowest=True)
df_rfm_2['f_score'].unique()
[1, 2, 4, 3]
Categories (4, int64): [1 < 2 < 3 < 4]
df_rfm_2.head()
rfr_scoref_score
User_ID
1006821
11761021
1195331
1219111
1222341
# df_rfm_2['total_score'] = df_rfm_2.apply(lambda x: x['r_score'] + x['f_score'], axis=1)
# df_rfm_2['total_score'].value_counts()
df_rfm_2.describe()
rf
count25400.00000025400.000000
mean3.5359063.019882
std2.4057433.039492
min1.0000001.000000
25%1.0000001.000000
50%3.0000002.000000
75%5.0000004.000000
max9.00000084.000000
# df_rfm_2['r_score'] = df_rfm_2['r_score'].apply(lambda x: int(str(x)))
# df_rfm_2['f_score'] = df_rfm_2['f_score'].apply(lambda x: int(str(x)))
df_rfm_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   r        25400 non-null  int64   
 1   f        25400 non-null  int64   
 2   r_score  25400 non-null  category
 3   f_score  25400 non-null  category
dtypes: category(2), int64(2)
memory usage: 1.9 MB
# category类型转换成int数据类型
df_rfm_2[['r_score', 'f_score']] = df_rfm_2[['r_score', 'f_score']].astype(int)
df_rfm_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   r        25400 non-null  int64
 1   f        25400 non-null  int64
 2   r_score  25400 non-null  int32
 3   f_score  25400 non-null  int32
dtypes: int32(2), int64(2)
memory usage: 2.0 MB
df_rfm_2.describe()
rfr_scoref_score
count25400.00000025400.00000025400.00000025400.000000
mean3.5359063.0198823.2574411.003150
std2.4057433.0394921.0007990.065734
min1.0000001.0000001.0000001.000000
25%1.0000001.0000003.0000001.000000
50%3.0000002.0000004.0000001.000000
75%5.0000004.0000004.0000001.000000
max9.00000084.0000004.0000004.000000
# 从上面的describe函数得到r和f的均值分别是3.257441和1.003150
r_avg = 3
f_avg = 1
# bins = df_rfm_2['total_score'].quantile(q=np.linspace(0,1,num=5), interpolation='nearest')
# labels = ['一般维持客户','潜力客户','重要深耕客户','重要价值客户']
# df_rfm_2['label_customer'] = pd.cut(df_rfm_2['total_score'], bins=bins, labels=labels, include_lowest=True)
# 根据r和f均值把用户划分成4类
df_rfm_2['label_customer'] = np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']>f_avg), '重要价值客户',
                                     np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']<=f_avg), '重要维护客户',
                                             np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']>f_avg), '重要发展客户',
                                                     np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']<=f_avg), '一般客户', 
                                                              np.nan))))
df_rfm_2.head()
rfr_scoref_scorelabel_customer
User_ID
1006821一般客户
11761021一般客户
1195331一般客户
1219111一般客户
1222341重要维护客户
# 索引转化成普通列
df_rfm_2 = df_rfm_2.reset_index()
df_rfm_2.head()
User_IDrfr_scoref_scorelabel_customer
01006821一般客户
111761021一般客户
21195331一般客户
31219111一般客户
41222341重要维护客户
df.shape
(3833384, 7)
df_rfm_2 = df_rfm_2[['User_ID','label_customer']]
df_rfm_2.shape
(25400, 2)
# 数据关联
df = pd.merge(df, df_rfm_2, how='left', left_on='User_ID', right_on='User_ID')
df.shape
(3833384, 8)
df.head()
User_IDItem_IDCategory_IDBehavior_typedatetimeweeklabel_customer
0122683182520377pv2017-11-25016NaN
1123333462520771pv2017-11-25066NaN
212576651149192pv2017-11-25096NaN
3138308084181361pv2017-11-25156NaN
4143655852520377pv2017-11-25156NaN

4、产品分类

在实际电商中,会根据产品销量、利润和其他一些特征对产品进行分类,通常有爆款、旺款、在售、清仓等状态,这里用产品buy次数代替产品销量,对产品进行分类。

df_pro = df[df['Behavior_type']=='buy']
print(df_pro.shape)
# 查看Item_ID的销量
print(df_pro['Item_ID'].value_counts().describe())
df_4 = df_pro['Item_ID'].value_counts().reset_index()
# Item_ID太多,只查看下次数大于1的,方便后面设置分类的点
df_5 = df_4[df_4['Item_ID']>1]
print(df_5['Item_ID'].value_counts())
print(df_5['Item_ID'].describe())
df_4.head()
df_4.columns=['Item_ID','buy_num']
df_pro = pd.merge(df_pro, df_4, how='left', left_on='Item_ID', right_on='Item_ID')
df_pro['buy_num'].value_counts()
df_pro.head()
(76705, 8)
count    56726.000000
mean         1.352202
std          1.109932
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         58.000000
Name: Item_ID, dtype: float64
2     6990
3     1940
4      819
5      411
6      197
7      138
8       95
9       68
10      38
11      31
13      18
12      17
15      13
14      12
18       6
17       5
16       3
23       3
21       2
22       2
19       2
26       1
58       1
27       1
25       1
20       1
32       1
Name: Item_ID, dtype: int64
count    10816.000000
mean         2.847171
std          1.923511
min          2.000000
25%          2.000000
50%          2.000000
75%          3.000000
max         58.000000
Name: Item_ID, dtype: float64
User_IDItem_IDCategory_IDBehavior_typedatetimeweeklabel_customerbuy_num
010016034762951233buy2017-11-25116一般客户3
110029710434869428buy2017-11-25216一般客户1
21005989292429887buy2017-11-27131一般客户1
310010462013002561buy2017-11-27151一般客户2
410016062584098232buy2017-11-27211一般客户1
df.shape
(3833384, 8)
df_pro = df_pro[['Item_ID','buy_num']]
df_pro.shape
(76705, 2)
# 去重,否则和df关联后会出现一对多的情况,使数据量太大
df_pro.drop_duplicates(['Item_ID'],keep='first',inplace=True)
df_pro.shape
(56726, 2)
df = pd.merge(df, df_pro, how='left', on='Item_ID')
df.shape
(3833384, 9)
# 设置分界条件对产品进行分类
df['pro_type'] = np.where(df['buy_num']>=9,'爆款',np.where(df['buy_num']>=3,'旺款',np.where(df['buy_num']>0,'在售','清仓')))
del df['buy_num']
df.head()
User_IDItem_IDCategory_IDBehavior_typedatetimeweeklabel_customerpro_type
0122683182520377pv2017-11-25016NaN在售
1123333462520771pv2017-11-25066NaN在售
212576651149192pv2017-11-25096NaN清仓
3138308084181361pv2017-11-25156NaN清仓
4143655852520377pv2017-11-25156NaN清仓
df.to_csv("bi_tmall_users.csv", sep=",", index=False)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值