淘宝用户行为分析-1-数据处理（python）

最新推荐文章于 2022-03-03 11:39:04 发布

置顶小步积

最新推荐文章于 2022-03-03 11:39:04 发布

阅读量1k

点赞数

分类专栏：数据分析文章标签：数据分析 python

本文链接：https://blog.csdn.net/lvhuike/article/details/106203889

版权

数据分析专栏收录该内容

7 篇文章 3 订阅

订阅专栏

项目数据来源：https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1
项目目的：尝试分析用户行为数据，了解这方面的指标，了解从行为数据中能得到什么规律和信息。
这篇博客主要记录我是怎么处理数据和做特征工程的。

1、读取数据，初步查看数据信息

#数据处理包
import pandas as pd
import numpy as np

df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"])
# nrows参数设置读取的行数
# df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"], nrows =500)

# sample 随机选取n行数据
# df = df.sample(n=1000000, replace=False, random_state=1, axis=0)

df.head()

	User_ID	Item_ID	Category_ID	Behavior_type	Timestamp
0	1	2268318	2520377	pv	1511544070
1	1	2333346	2520771	pv	1511561733
2	1	2576651	149192	pv	1511572885
3	1	3830808	4181361	pv	1511593493
4	1	4365585	2520377	pv	1511596146

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835330 entries, 0 to 3835329
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   User_ID        int64 
 1   Item_ID        int64 
 2   Category_ID    int64 
 3   Behavior_type  object
 4   Timestamp      int64 
dtypes: int64(4), object(1)
memory usage: 146.3+ MB

df.describe()

	User_ID	Item_ID	Category_ID	Timestamp
count	3.835330e+06	3.835330e+06	3.835330e+06	3.835330e+06
mean	2.411328e+05	2.578345e+06	2.711667e+06	1.511963e+09
std	2.548230e+05	1.487859e+06	1.464903e+06	8.665418e+05
min	1.000000e+00	3.000000e+00	2.171000e+03	4.401600e+04
25%	1.299830e+05	1.294202e+06	1.349561e+06	1.511763e+09
50%	1.786180e+05	2.576338e+06	2.725426e+06	1.511966e+09
75%	2.255740e+05	3.860809e+06	4.145813e+06	1.512181e+09
max	1.018011e+06	5.163067e+06	5.161669e+06	2.122867e+09

# 查看df有没有null值，每列总共有几个null值
df.isnull().sum()

User_ID          0
Item_ID          0
Category_ID      0
Behavior_type    0
Timestamp        0
dtype: int64

初步观察数据我们可以得到：数据一共有3835330条，有5列分别是：User_ID、Item_ID、Category_ID、Behavior_type、Timestamp，其中User_ID、Item_ID、Category_ID、Behavior_type、Timestamp这4列是整数类型，Behavior_type是字符类型，Timestamp需要后面转换成时间类型。数据比较干净，没有null值。

# 读取融合的CSV文件
# merge_dt = pd.read_csv( cwd  + '\\' + "UserBehavior.csv", encoding="utf_8", engine='python' ,iterator = True ) 
# # merge_dt  
# # output: <pandas.io.parsers.TextFileReader at 0x1cc1f0a7470>
# # 调用数据
# df = merge_dt.get_chunk(10000)

2、清洗数据

# 数据去重
print(df.shape)
list_ = df.columns.to_list()
df.drop_duplicates(list_, keep='first', inplace=True)
print(df.shape)

(3835330, 5)
(3835328, 5)

import datetime
import time
# 时间戳转成datetime再转成str
# df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d'))
df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).date())
df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%H'))
# df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).time())
del df['Timestamp']
df.head()

	User_ID	Item_ID	Category_ID	Behavior_type	date	time
0	1	2268318	2520377	pv	2017-11-25	01
1	1	2333346	2520771	pv	2017-11-25	06
2	1	2576651	149192	pv	2017-11-25	09
3	1	3830808	4181361	pv	2017-11-25	15
4	1	4365585	2520377	pv	2017-11-25	15

# 偷个懒，把datetime直接分割成2列
# df[["date","time"]] = df["datetime"].str.split(' ', 1, expand=True)
# df.head()

df['date'].value_counts()

2017-12-02    532774
2017-12-03    528928
2017-12-01    417976
2017-11-26    406792
2017-11-30    400366
2017-11-25    395034
2017-11-29    390884
2017-11-27    382304
2017-11-28    378326
2017-11-24      1507
2017-11-23       181
2017-11-22        53
2017-11-19        33
2017-11-21        25
2017-11-20        20
2017-11-18        19
2017-11-17        18
2018-08-28        16
2017-11-03        14
2017-11-16        10
2017-11-14         6
2017-11-13         6
2017-12-04         6
2017-11-11         5
2017-11-15         4
2017-12-06         3
2017-11-10         2
2017-09-16         2
2017-07-03         2
2017-11-12         2
1970-01-01         1
2015-02-06         1
2017-10-31         1
2017-09-11         1
2017-11-02         1
2017-11-05         1
2017-10-10         1
2017-11-06         1
2037-04-09         1
2017-11-04         1
Name: date, dtype: int64

# 2017-11-25'至'2017-12-03'之外，数据量极少，删除不分析
print(df.shape)
# df['date'] = df['date'][df['date'].between('2017-11-25','2017-12-03', inclusive=True)]
df['date'] = df['date'][(df['date']>=datetime.date(2017,11,25)) & (df['date']<=datetime.date(2017,12,3))]
# 删除含有nnan的数据行
df = df.dropna(axis=0, how='any')
print(df.shape)

(3835328, 6)
(3833384, 6)

# 数据量较多时从星期的维度分析可能会发现一些规律，但是该项目只涉及9天即一周零2天，
# 从星期维度分析和按日期分析没什么区别，也可以不做这个特征工程。
df['week'] = df['date'].apply(lambda x: x.weekday()+1)
df.head()

	User_ID	Item_ID	Category_ID	Behavior_type	date	time	week
0	1	2268318	2520377	pv	2017-11-25	01	6
1	1	2333346	2520771	pv	2017-11-25	06	6
2	1	2576651	149192	pv	2017-11-25	09	6
3	1	3830808	4181361	pv	2017-11-25	15	6
4	1	4365585	2520377	pv	2017-11-25	15	6

3、RFM

df_rfm = df[df['Behavior_type']=='buy']
df_rfm.shape

(76705, 7)

# 看看有多少用户
# len(df_rfm['User_ID'].unique())

# 看看有多少订单（假设一个buy是一个订单，但是我们知道这只是假设，实际中一个订单会有多个产品）
# len(df_rfm['Behavior_type'])

# 增加各时间段，用来计算订单时间差
df_rfm['datediff'] = (datetime.date(2017,12,4) - df['date']).dt.days
df_rfm.head()

D:\Anconda\envs\pytorch\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	User_ID	Item_ID	Category_ID	Behavior_type	date	time	week	datediff
71	100	1603476	2951233	buy	2017-11-25	11	6	9
73	100	2971043	4869428	buy	2017-11-25	21	6	9
100	100	598929	2429887	buy	2017-11-27	13	1	7
119	100	1046201	3002561	buy	2017-11-27	15	1	7
125	100	1606258	4098232	buy	2017-11-27	21	1	7

# 最近一次购买时间间隔
df_r = df_rfm.groupby(by=['User_ID'])['datediff'].agg([('r', 'min')])
# 购买次数
df_f = df_rfm.groupby(by=['User_ID'])['Behavior_type'].agg([('f', 'count')])
# 关联2个数据框
df_rfm_2 = df_r.join(df_f)
df_rfm_2.head()

	r	f
User_ID
100	6	8
117	6	10
119	5	3
121	9	1
122	2	3

df_rfm_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   r       25400 non-null  int64
 1   f       25400 non-null  int64
dtypes: int64(2)
memory usage: 1.8 MB

df_rfm_2.describe()

	r	f
count	25400.000000	25400.000000
mean	3.535906	3.019882
std	2.405743	3.039492
min	1.000000	1.000000
25%	1.000000	1.000000
50%	3.000000	2.000000
75%	5.000000	4.000000
max	9.000000	84.000000

# 最近购买时间，一共9天，距今天分别是1-9，一共分成4段，划分点分别是：，距今越近分数越高，距今越远分数越低
# bins_r = df_rfm_2['r'].quantile(q=np.linspace(0, 1, num=5), interpolation='nearest')
bins_r = [1, 3, 5, 7, 9]
labels_r = np.arange(4, 0, -1)
df_rfm_2['r_score'] = pd.cut(df_rfm_2['r'], bins=bins_r, labels=labels_r, include_lowest=True)

df_rfm_2['r_score'].unique()

[2, 3, 1, 4]
Categories (4, int64): [4 < 3 < 2 < 1]

df_rfm_2['f'].value_counts()

1     8688
2     5899
3     3849
4     2321
5     1545
6      955
7      639
8      421
9      275
10     188
11     148
12     102
13      63
14      59
15      58
16      34
17      24
18      22
20      16
19      14
21      12
23      12
22       9
29       6
25       5
31       4
30       4
26       3
36       3
28       3
32       3
27       2
51       1
33       1
65       1
47       1
84       1
69       1
24       1
72       1
57       1
42       1
43       1
60       1
61       1
39       1
Name: f, dtype: int64

# 购买频率达到84次，分成4批，分割成1-21，22-43，44-65，66-84
bins_f = [1, 21, 43, 65, 84]
labels_f = np.arange(1, 5)
df_rfm_2['f_score'] = pd.cut(df_rfm_2['f'], bins=bins_f, labels=labels_f, include_lowest=True)

df_rfm_2['f_score'].unique()

[1, 2, 4, 3]
Categories (4, int64): [1 < 2 < 3 < 4]

df_rfm_2.head()

	r	f	r_score	f_score
User_ID
100	6	8	2	1
117	6	10	2	1
119	5	3	3	1
121	9	1	1	1
122	2	3	4	1

# df_rfm_2['total_score'] = df_rfm_2.apply(lambda x: x['r_score'] + x['f_score'], axis=1)

# df_rfm_2['total_score'].value_counts()

df_rfm_2.describe()

	r	f
count	25400.000000	25400.000000
mean	3.535906	3.019882
std	2.405743	3.039492
min	1.000000	1.000000
25%	1.000000	1.000000
50%	3.000000	2.000000
75%	5.000000	4.000000
max	9.000000	84.000000

# df_rfm_2['r_score'] = df_rfm_2['r_score'].apply(lambda x: int(str(x)))
# df_rfm_2['f_score'] = df_rfm_2['f_score'].apply(lambda x: int(str(x)))

df_rfm_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   r        25400 non-null  int64   
 1   f        25400 non-null  int64   
 2   r_score  25400 non-null  category
 3   f_score  25400 non-null  category
dtypes: category(2), int64(2)
memory usage: 1.9 MB

# category类型转换成int数据类型
df_rfm_2[['r_score', 'f_score']] = df_rfm_2[['r_score', 'f_score']].astype(int)

df_rfm_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   r        25400 non-null  int64
 1   f        25400 non-null  int64
 2   r_score  25400 non-null  int32
 3   f_score  25400 non-null  int32
dtypes: int32(2), int64(2)
memory usage: 2.0 MB

df_rfm_2.describe()

	r	f	r_score	f_score
count	25400.000000	25400.000000	25400.000000	25400.000000
mean	3.535906	3.019882	3.257441	1.003150
std	2.405743	3.039492	1.000799	0.065734
min	1.000000	1.000000	1.000000	1.000000
25%	1.000000	1.000000	3.000000	1.000000
50%	3.000000	2.000000	4.000000	1.000000
75%	5.000000	4.000000	4.000000	1.000000
max	9.000000	84.000000	4.000000	4.000000

# 从上面的describe函数得到r和f的均值分别是3.257441和1.003150
r_avg = 3
f_avg = 1

# bins = df_rfm_2['total_score'].quantile(q=np.linspace(0,1,num=5), interpolation='nearest')
# labels = ['一般维持客户','潜力客户','重要深耕客户','重要价值客户']
# df_rfm_2['label_customer'] = pd.cut(df_rfm_2['total_score'], bins=bins, labels=labels, include_lowest=True)

# 根据r和f均值把用户划分成4类
df_rfm_2['label_customer'] = np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']>f_avg), '重要价值客户',
                                     np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']<=f_avg), '重要维护客户',
                                             np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']>f_avg), '重要发展客户',
                                                     np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']<=f_avg), '一般客户', 
                                                              np.nan))))

df_rfm_2.head()

	r	f	r_score	f_score	label_customer
User_ID
100	6	8	2	1	一般客户
117	6	10	2	1	一般客户
119	5	3	3	1	一般客户
121	9	1	1	1	一般客户
122	2	3	4	1	重要维护客户

# 索引转化成普通列
df_rfm_2 = df_rfm_2.reset_index()
df_rfm_2.head()

	User_ID	r	f	r_score	f_score	label_customer
0	100	6	8	2	1	一般客户
1	117	6	10	2	1	一般客户
2	119	5	3	3	1	一般客户
3	121	9	1	1	1	一般客户
4	122	2	3	4	1	重要维护客户

df.shape

(3833384, 7)

df_rfm_2 = df_rfm_2[['User_ID','label_customer']]
df_rfm_2.shape

(25400, 2)

# 数据关联
df = pd.merge(df, df_rfm_2, how='left', left_on='User_ID', right_on='User_ID')

df.shape

(3833384, 8)

df.head()

	User_ID	Item_ID	Category_ID	Behavior_type	date	time	week	label_customer
0	1	2268318	2520377	pv	2017-11-25	01	6	NaN
1	1	2333346	2520771	pv	2017-11-25	06	6	NaN
2	1	2576651	149192	pv	2017-11-25	09	6	NaN
3	1	3830808	4181361	pv	2017-11-25	15	6	NaN
4	1	4365585	2520377	pv	2017-11-25	15	6	NaN

4、产品分类

在实际电商中，会根据产品销量、利润和其他一些特征对产品进行分类，通常有爆款、旺款、在售、清仓等状态，这里用产品buy次数代替产品销量，对产品进行分类。

df_pro = df[df['Behavior_type']=='buy']
print(df_pro.shape)
# 查看Item_ID的销量
print(df_pro['Item_ID'].value_counts().describe())
df_4 = df_pro['Item_ID'].value_counts().reset_index()
# Item_ID太多，只查看下次数大于1的，方便后面设置分类的点
df_5 = df_4[df_4['Item_ID']>1]
print(df_5['Item_ID'].value_counts())
print(df_5['Item_ID'].describe())
df_4.head()
df_4.columns=['Item_ID','buy_num']
df_pro = pd.merge(df_pro, df_4, how='left', left_on='Item_ID', right_on='Item_ID')
df_pro['buy_num'].value_counts()
df_pro.head()

(76705, 8)
count    56726.000000
mean         1.352202
std          1.109932
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         58.000000
Name: Item_ID, dtype: float64
2     6990
3     1940
4      819
5      411
6      197
7      138
8       95
9       68
10      38
11      31
13      18
12      17
15      13
14      12
18       6
17       5
16       3
23       3
21       2
22       2
19       2
26       1
58       1
27       1
25       1
20       1
32       1
Name: Item_ID, dtype: int64
count    10816.000000
mean         2.847171
std          1.923511
min          2.000000
25%          2.000000
50%          2.000000
75%          3.000000
max         58.000000
Name: Item_ID, dtype: float64

	User_ID	Item_ID	Category_ID	Behavior_type	date	time	week	label_customer	buy_num
0	100	1603476	2951233	buy	2017-11-25	11	6	一般客户	3
1	100	2971043	4869428	buy	2017-11-25	21	6	一般客户	1
2	100	598929	2429887	buy	2017-11-27	13	1	一般客户	1
3	100	1046201	3002561	buy	2017-11-27	15	1	一般客户	2
4	100	1606258	4098232	buy	2017-11-27	21	1	一般客户	1

df.shape

(3833384, 8)

df_pro = df_pro[['Item_ID','buy_num']]
df_pro.shape

(76705, 2)

# 去重，否则和df关联后会出现一对多的情况，使数据量太大
df_pro.drop_duplicates(['Item_ID'],keep='first',inplace=True)
df_pro.shape

(56726, 2)

df = pd.merge(df, df_pro, how='left', on='Item_ID')
df.shape

(3833384, 9)

# 设置分界条件对产品进行分类
df['pro_type'] = np.where(df['buy_num']>=9,'爆款',np.where(df['buy_num']>=3,'旺款',np.where(df['buy_num']>0,'在售','清仓')))
del df['buy_num']
df.head()

	User_ID	Item_ID	Category_ID	Behavior_type	date	time	week	label_customer	pro_type
0	1	2268318	2520377	pv	2017-11-25	01	6	NaN	在售
1	1	2333346	2520771	pv	2017-11-25	06	6	NaN	在售
2	1	2576651	149192	pv	2017-11-25	09	6	NaN	清仓
3	1	3830808	4181361	pv	2017-11-25	15	6	NaN	清仓
4	1	4365585	2520377	pv	2017-11-25	15	6	NaN	清仓

df.to_csv("bi_tmall_users.csv", sep=",", index=False)

小步积

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录