项目数据来源:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1 项目目的:尝试分析用户行为数据,了解这方面的指标,了解从行为数据中能得到什么规律和信息。 这篇博客主要记录我是怎么处理数据和做特征工程的。
1、读取数据,初步查看数据信息
import pandas as pd
import numpy as np
df = pd. read_csv( "UserBehavior.csv" , sep= "," , names= [ "User_ID" , "Item_ID" , "Category_ID" , "Behavior_type" , "Timestamp" ] )
df. head( )
User_ID Item_ID Category_ID Behavior_type Timestamp 0 1 2268318 2520377 pv 1511544070 1 1 2333346 2520771 pv 1511561733 2 1 2576651 149192 pv 1511572885 3 1 3830808 4181361 pv 1511593493 4 1 4365585 2520377 pv 1511596146
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835330 entries, 0 to 3835329
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 User_ID int64
1 Item_ID int64
2 Category_ID int64
3 Behavior_type object
4 Timestamp int64
dtypes: int64(4), object(1)
memory usage: 146.3+ MB
df. describe( )
User_ID Item_ID Category_ID Timestamp count 3.835330e+06 3.835330e+06 3.835330e+06 3.835330e+06 mean 2.411328e+05 2.578345e+06 2.711667e+06 1.511963e+09 std 2.548230e+05 1.487859e+06 1.464903e+06 8.665418e+05 min 1.000000e+00 3.000000e+00 2.171000e+03 4.401600e+04 25% 1.299830e+05 1.294202e+06 1.349561e+06 1.511763e+09 50% 1.786180e+05 2.576338e+06 2.725426e+06 1.511966e+09 75% 2.255740e+05 3.860809e+06 4.145813e+06 1.512181e+09 max 1.018011e+06 5.163067e+06 5.161669e+06 2.122867e+09
df. isnull( ) . sum ( )
User_ID 0
Item_ID 0
Category_ID 0
Behavior_type 0
Timestamp 0
dtype: int64
初步观察数据我们可以得到:数据一共有3835330条,有5列分别是:User_ID、Item_ID、Category_ID、Behavior_type、Timestamp,其中User_ID、Item_ID、Category_ID、Behavior_type、Timestamp这4列是整数类型,Behavior_type是字符类型,Timestamp需要后面转换成时间类型。数据比较干净,没有null值。
2、清洗数据
print ( df. shape)
list_ = df. columns. to_list( )
df. drop_duplicates( list_, keep= 'first' , inplace= True )
print ( df. shape)
(3835330, 5)
(3835328, 5)
import datetime
import time
df[ "date" ] = df[ "Timestamp" ] . apply ( lambda x: datetime. datetime. fromtimestamp( x) . date( ) )
df[ "time" ] = df[ "Timestamp" ] . apply ( lambda x: datetime. datetime. fromtimestamp( x) . strftime( '%H' ) )
del df[ 'Timestamp' ]
df. head( )
User_ID Item_ID Category_ID Behavior_type date time 0 1 2268318 2520377 pv 2017-11-25 01 1 1 2333346 2520771 pv 2017-11-25 06 2 1 2576651 149192 pv 2017-11-25 09 3 1 3830808 4181361 pv 2017-11-25 15 4 1 4365585 2520377 pv 2017-11-25 15
df[ 'date' ] . value_counts( )
2017-12-02 532774
2017-12-03 528928
2017-12-01 417976
2017-11-26 406792
2017-11-30 400366
2017-11-25 395034
2017-11-29 390884
2017-11-27 382304
2017-11-28 378326
2017-11-24 1507
2017-11-23 181
2017-11-22 53
2017-11-19 33
2017-11-21 25
2017-11-20 20
2017-11-18 19
2017-11-17 18
2018-08-28 16
2017-11-03 14
2017-11-16 10
2017-11-14 6
2017-11-13 6
2017-12-04 6
2017-11-11 5
2017-11-15 4
2017-12-06 3
2017-11-10 2
2017-09-16 2
2017-07-03 2
2017-11-12 2
1970-01-01 1
2015-02-06 1
2017-10-31 1
2017-09-11 1
2017-11-02 1
2017-11-05 1
2017-10-10 1
2017-11-06 1
2037-04-09 1
2017-11-04 1
Name: date, dtype: int64
print ( df. shape)
df[ 'date' ] = df[ 'date' ] [ ( df[ 'date' ] >= datetime. date( 2017 , 11 , 25 ) ) & ( df[ 'date' ] <= datetime. date( 2017 , 12 , 3 ) ) ]
df = df. dropna( axis= 0 , how= 'any' )
print ( df. shape)
(3835328, 6)
(3833384, 6)
df[ 'week' ] = df[ 'date' ] . apply ( lambda x: x. weekday( ) + 1 )
df. head( )
User_ID Item_ID Category_ID Behavior_type date time week 0 1 2268318 2520377 pv 2017-11-25 01 6 1 1 2333346 2520771 pv 2017-11-25 06 6 2 1 2576651 149192 pv 2017-11-25 09 6 3 1 3830808 4181361 pv 2017-11-25 15 6 4 1 4365585 2520377 pv 2017-11-25 15 6
3、RFM
df_rfm = df[ df[ 'Behavior_type' ] == 'buy' ]
df_rfm. shape
(76705, 7)
df_rfm[ 'datediff' ] = ( datetime. date( 2017 , 12 , 4 ) - df[ 'date' ] ) . dt. days
df_rfm. head( )
D:\Anconda\envs\pytorch\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
User_ID Item_ID Category_ID Behavior_type date time week datediff 71 100 1603476 2951233 buy 2017-11-25 11 6 9 73 100 2971043 4869428 buy 2017-11-25 21 6 9 100 100 598929 2429887 buy 2017-11-27 13 1 7 119 100 1046201 3002561 buy 2017-11-27 15 1 7 125 100 1606258 4098232 buy 2017-11-27 21 1 7
df_r = df_rfm. groupby( by= [ 'User_ID' ] ) [ 'datediff' ] . agg( [ ( 'r' , 'min' ) ] )
df_f = df_rfm. groupby( by= [ 'User_ID' ] ) [ 'Behavior_type' ] . agg( [ ( 'f' , 'count' ) ] )
df_rfm_2 = df_r. join( df_f)
df_rfm_2. head( )
r f User_ID 100 6 8 117 6 10 119 5 3 121 9 1 122 2 3
df_rfm_2. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 r 25400 non-null int64
1 f 25400 non-null int64
dtypes: int64(2)
memory usage: 1.8 MB
df_rfm_2. describe( )
r f count 25400.000000 25400.000000 mean 3.535906 3.019882 std 2.405743 3.039492 min 1.000000 1.000000 25% 1.000000 1.000000 50% 3.000000 2.000000 75% 5.000000 4.000000 max 9.000000 84.000000
bins_r = [ 1 , 3 , 5 , 7 , 9 ]
labels_r = np. arange( 4 , 0 , - 1 )
df_rfm_2[ 'r_score' ] = pd. cut( df_rfm_2[ 'r' ] , bins= bins_r, labels= labels_r, include_lowest= True )
df_rfm_2[ 'r_score' ] . unique( )
[2, 3, 1, 4]
Categories (4, int64): [4 < 3 < 2 < 1]
df_rfm_2[ 'f' ] . value_counts( )
1 8688
2 5899
3 3849
4 2321
5 1545
6 955
7 639
8 421
9 275
10 188
11 148
12 102
13 63
14 59
15 58
16 34
17 24
18 22
20 16
19 14
21 12
23 12
22 9
29 6
25 5
31 4
30 4
26 3
36 3
28 3
32 3
27 2
51 1
33 1
65 1
47 1
84 1
69 1
24 1
72 1
57 1
42 1
43 1
60 1
61 1
39 1
Name: f, dtype: int64
bins_f = [ 1 , 21 , 43 , 65 , 84 ]
labels_f = np. arange( 1 , 5 )
df_rfm_2[ 'f_score' ] = pd. cut( df_rfm_2[ 'f' ] , bins= bins_f, labels= labels_f, include_lowest= True )
df_rfm_2[ 'f_score' ] . unique( )
[1, 2, 4, 3]
Categories (4, int64): [1 < 2 < 3 < 4]
df_rfm_2. head( )
r f r_score f_score User_ID 100 6 8 2 1 117 6 10 2 1 119 5 3 3 1 121 9 1 1 1 122 2 3 4 1
df_rfm_2. describe( )
r f count 25400.000000 25400.000000 mean 3.535906 3.019882 std 2.405743 3.039492 min 1.000000 1.000000 25% 1.000000 1.000000 50% 3.000000 2.000000 75% 5.000000 4.000000 max 9.000000 84.000000
df_rfm_2. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 r 25400 non-null int64
1 f 25400 non-null int64
2 r_score 25400 non-null category
3 f_score 25400 non-null category
dtypes: category(2), int64(2)
memory usage: 1.9 MB
df_rfm_2[ [ 'r_score' , 'f_score' ] ] = df_rfm_2[ [ 'r_score' , 'f_score' ] ] . astype( int )
df_rfm_2. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 r 25400 non-null int64
1 f 25400 non-null int64
2 r_score 25400 non-null int32
3 f_score 25400 non-null int32
dtypes: int32(2), int64(2)
memory usage: 2.0 MB
df_rfm_2. describe( )
r f r_score f_score count 25400.000000 25400.000000 25400.000000 25400.000000 mean 3.535906 3.019882 3.257441 1.003150 std 2.405743 3.039492 1.000799 0.065734 min 1.000000 1.000000 1.000000 1.000000 25% 1.000000 1.000000 3.000000 1.000000 50% 3.000000 2.000000 4.000000 1.000000 75% 5.000000 4.000000 4.000000 1.000000 max 9.000000 84.000000 4.000000 4.000000
r_avg = 3
f_avg = 1
df_rfm_2[ 'label_customer' ] = np. where( ( df_rfm_2[ 'r_score' ] > r_avg) & ( df_rfm_2[ 'f_score' ] > f_avg) , '重要价值客户' ,
np. where( ( df_rfm_2[ 'r_score' ] > r_avg) & ( df_rfm_2[ 'f_score' ] <= f_avg) , '重要维护客户' ,
np. where( ( df_rfm_2[ 'r_score' ] <= r_avg) & ( df_rfm_2[ 'f_score' ] > f_avg) , '重要发展客户' ,
np. where( ( df_rfm_2[ 'r_score' ] <= r_avg) & ( df_rfm_2[ 'f_score' ] <= f_avg) , '一般客户' ,
np. nan) ) ) )
df_rfm_2. head( )
r f r_score f_score label_customer User_ID 100 6 8 2 1 一般客户 117 6 10 2 1 一般客户 119 5 3 3 1 一般客户 121 9 1 1 1 一般客户 122 2 3 4 1 重要维护客户
df_rfm_2 = df_rfm_2. reset_index( )
df_rfm_2. head( )
User_ID r f r_score f_score label_customer 0 100 6 8 2 1 一般客户 1 117 6 10 2 1 一般客户 2 119 5 3 3 1 一般客户 3 121 9 1 1 1 一般客户 4 122 2 3 4 1 重要维护客户
df. shape
(3833384, 7)
df_rfm_2 = df_rfm_2[ [ 'User_ID' , 'label_customer' ] ]
df_rfm_2. shape
(25400, 2)
df = pd. merge( df, df_rfm_2, how= 'left' , left_on= 'User_ID' , right_on= 'User_ID' )
df. shape
(3833384, 8)
df. head( )
User_ID Item_ID Category_ID Behavior_type date time week label_customer 0 1 2268318 2520377 pv 2017-11-25 01 6 NaN 1 1 2333346 2520771 pv 2017-11-25 06 6 NaN 2 1 2576651 149192 pv 2017-11-25 09 6 NaN 3 1 3830808 4181361 pv 2017-11-25 15 6 NaN 4 1 4365585 2520377 pv 2017-11-25 15 6 NaN
4、产品分类
在实际电商中,会根据产品销量、利润和其他一些特征对产品进行分类,通常有爆款、旺款、在售、清仓等状态,这里用产品buy次数代替产品销量,对产品进行分类。
df_pro = df[ df[ 'Behavior_type' ] == 'buy' ]
print ( df_pro. shape)
print ( df_pro[ 'Item_ID' ] . value_counts( ) . describe( ) )
df_4 = df_pro[ 'Item_ID' ] . value_counts( ) . reset_index( )
df_5 = df_4[ df_4[ 'Item_ID' ] > 1 ]
print ( df_5[ 'Item_ID' ] . value_counts( ) )
print ( df_5[ 'Item_ID' ] . describe( ) )
df_4. head( )
df_4. columns= [ 'Item_ID' , 'buy_num' ]
df_pro = pd. merge( df_pro, df_4, how= 'left' , left_on= 'Item_ID' , right_on= 'Item_ID' )
df_pro[ 'buy_num' ] . value_counts( )
df_pro. head( )
(76705, 8)
count 56726.000000
mean 1.352202
std 1.109932
min 1.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 58.000000
Name: Item_ID, dtype: float64
2 6990
3 1940
4 819
5 411
6 197
7 138
8 95
9 68
10 38
11 31
13 18
12 17
15 13
14 12
18 6
17 5
16 3
23 3
21 2
22 2
19 2
26 1
58 1
27 1
25 1
20 1
32 1
Name: Item_ID, dtype: int64
count 10816.000000
mean 2.847171
std 1.923511
min 2.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 58.000000
Name: Item_ID, dtype: float64
User_ID Item_ID Category_ID Behavior_type date time week label_customer buy_num 0 100 1603476 2951233 buy 2017-11-25 11 6 一般客户 3 1 100 2971043 4869428 buy 2017-11-25 21 6 一般客户 1 2 100 598929 2429887 buy 2017-11-27 13 1 一般客户 1 3 100 1046201 3002561 buy 2017-11-27 15 1 一般客户 2 4 100 1606258 4098232 buy 2017-11-27 21 1 一般客户 1
df. shape
(3833384, 8)
df_pro = df_pro[ [ 'Item_ID' , 'buy_num' ] ]
df_pro. shape
(76705, 2)
df_pro. drop_duplicates( [ 'Item_ID' ] , keep= 'first' , inplace= True )
df_pro. shape
(56726, 2)
df = pd. merge( df, df_pro, how= 'left' , on= 'Item_ID' )
df. shape
(3833384, 9)
df[ 'pro_type' ] = np. where( df[ 'buy_num' ] >= 9 , '爆款' , np. where( df[ 'buy_num' ] >= 3 , '旺款' , np. where( df[ 'buy_num' ] > 0 , '在售' , '清仓' ) ) )
del df[ 'buy_num' ]
df. head( )
User_ID Item_ID Category_ID Behavior_type date time week label_customer pro_type 0 1 2268318 2520377 pv 2017-11-25 01 6 NaN 在售 1 1 2333346 2520771 pv 2017-11-25 06 6 NaN 在售 2 1 2576651 149192 pv 2017-11-25 09 6 NaN 清仓 3 1 3830808 4181361 pv 2017-11-25 15 6 NaN 清仓 4 1 4365585 2520377 pv 2017-11-25 15 6 NaN 清仓
df. to_csv( "bi_tmall_users.csv" , sep= "," , index= False )