电商案例分析
分析思路
明确需求和目标:通过电商数据对客户行为进行分析
数据收集:
内部数据、购买数据、爬取数据、调查问卷、其他收集
本案例数据为某平台的电商数据已收集完成;
数据预处理
数据整合:横向整合、纵向整合
数据清洗:缺失值、空值、异常值
数据转换:如日期
数据分析:
数据加载
利用pandas读取数据
% matplotlib inline
import pandas as pd
import matplotlib. pyplot as plt
data = pd. read_excel( './order.xlsx' , index_col= 'id' )
data. head( )
orderID
userID
goodsID
orderAmount
payment
chanelID
platformType
orderTime
payTime
chargeback
id
1
sys-2018-254118088
user-157213
PR000064
272.51
272.51
渠道-0396
APP
2018-02-14 12:20:36
2019-02-28 13:38:41
否
2
sys-2018-263312190
user-191121
PR000583
337.93
337.93
渠道-0765
Wech atMP
2018-08-14 09:40:34
2019-01-01 14:47:14
是
3
sys-2018-188208169
user-211918
PR000082
905.68
891.23
渠道-0530
We c hatMP
2018-11-02 20:17:25
2019-01-19 20:06:35
否
4
sys-2018-203314910
user-201322
PR000302
786.27
688.88
渠道-0530
WEB
2018-11-19 10:36:39
2019-08-07 12:24:35
否
5
sys-2018-283989279
user-120872
PR000290
550.77
542.51
渠道-9527
APP
2018-12-26 11:19:16
2019-10-01 07:42:43
否
查看数据整体信息和统计信息
字段说明:
orderID:订单ID
userID:用户ID
goodsID:商品ID
orderAmount:订单金额
payment:支付金额
chanelID:各渠道ID
platformType:渠道分类
orderTime:下单时间
payTime:下单支付时间
chargeback:是否退货
data. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 104557 entries, 1 to 104557
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 orderID 104557 non-null object
1 userID 104557 non-null object
2 goodsID 104557 non-null object
3 orderAmount 104557 non-null float64
4 payment 104557 non-null float64
5 chanelID 104549 non-null object
6 platformType 104557 non-null object
7 orderTime 104557 non-null datetime64[ns]
8 payTime 104557 non-null datetime64[ns]
9 chargeback 104557 non-null object
dtypes: datetime64[ns](2), float64(2), object(6)
memory usage: 8.8+ MB
data. describe( )
orderAmount
payment
count
104557.000000
104557.000000
mean
1049.681521
1167.494225
std
1054.409968
2174.024855
min
6.100000
-12.470000
25%
432.040000
383.660000
50%
679.320000
641.230000
75%
1248.280000
1252.630000
max
28465.250000
83270.053829
数据预处理
增加字段,便于后续分析
异常值处理
使用pandas的describe方法检测异常值
删除异常值、视为缺失值处理、对数转换、使用临界值填充等具体根据业务选择
缺失值处理:
删除缺失值,如果样本容量足够大,缺失部分数量很少的情况下可以选择此方法
填充缺失值:
数值类型:均值、中值、插值填充
类别类型:众数填充
重复值处理:重复值通常对数据分析没有作用,一般选择直接删除
from datetime import datetime
data[ 'interval' ] = ( data[ 'payTime' ] - data[ 'orderTime' ] ) . dt. total_seconds( )
starttime= datetime( 2019 , 1 , 1 )
endtime= datetime( 2019 , 12 , 31 , 23 , 59 , 59 )
data[ 'payTime' ] = pd. to_datetime( data[ 'payTime' ] )
data[ 'orderTime' ] = pd. to_datetime( data[ 'orderTime' ] )
data= data[ data[ 'orderTime' ] > starttime]
data= data[ data[ 'payTime' ] < endtime]
data. head( )
orderID
userID
goodsID
orderAmount
payment
chanelID
platformType
orderTime
payTime
chargeback
interval
id
6
sys-2019-279103297
user-146548
PR000564
425.20
425.20
渠道-0765
Wech atMP
2019-01-01 00:12:23
2019-01-01 00:13:37
否
74.0
7
sys-2019-316686066
user-104210
PR000709
1764.37
1707.04
渠道-0396
We c hatMP
2019-01-01 00:23:06
2019-01-01 00:23:32
否
26.0
8
sys-2019-306447069
user-104863
PR000499
499.41
480.42
渠道-0007
Wech atMP
2019-01-01 01:05:50
2019-01-01 01:06:17
否
27.0
9
sys-2019-290267674
user-206155
PR000253
1103.00
1050.95
渠道-0330
APP
2019-01-01 01:16:12
2019-01-01 01:16:25
否
13.0
10
sys-2019-337079027
user-137939
PR000768
465.41
465.41
渠道-9527
AL i MP
2019-01-01 01:31:00
2019-01-01 01:31:36
否
36.0
data[ data[ 'interval' ] < 0 ]
orderID
userID
goodsID
orderAmount
payment
chanelID
platformType
orderTime
payTime
chargeback
interval
id
14852
sys-2019-303932121
user-185935
PR000331
433.99
310.79
渠道-0283
APP
2019-03-19 13:34:13
2019-03-07 10:19:28
否
-1048485.0
15638
sys-2019-349926661
user-104304
PR000190
591.18
570.88
渠道-0765
APP
2019-03-23 13:53:31
2019-02-09 13:24:49
否
-3630522.0
16162
sys-2019-324041709
user-230753
PR000074
372.75
372.75
渠道-0007
WEB
2019-03-25 17:09:07
2019-02-06 09:57:19
是
-4086708.0
16383
sys-2019-314295685
user-182087
PR000082
1733.77
1694.03
渠道-0007
Wech atMP
2019-03-26 22:47:12
2019-02-07 19:58:25
否
-4070927.0
17003
sys-2019-348295045
user-109824
PR000134
713.46
707.36
渠道-0283
Wech atMP
2019-03-30 14:18:54
2019-02-07 15:14:04
否
-4403090.0
print ( data[ data[ 'interval' ] < 0 ] . index)
data. drop( data[ data[ 'interval' ] < 0 ] . index, inplace= True )
Int64Index([], dtype='int64', name='id')
data[ data[ 'interval' ] > 1800 ]
data. drop( data[ data[ 'interval' ] > 1800 ] . index, inplace= True )
data[ data[ 'orderAmount' ] < 0 ]
data[ data[ 'payment' ] < 0 ]
data. drop( data[ data[ 'payment' ] < 0 ] . index, inplace= True )
print ( data[ 'platformType' ] . unique( ) )
data[ 'platformType' ] . str . replace( ' ' , '' )
data[ 'platformType' ] = data[ 'platformType' ] . str . replace( ' ' , '' )
print ( '************************' )
print ( data[ 'platformType' ] . unique( ) )
['WechatMP' 'APP' 'ALiMP' 'WEB' 'WechatShop' 'Wap']
************************
['WechatMP' 'APP' 'ALiMP' 'WEB' 'WechatShop' 'Wap']
data[ 'orderID' ] . unique( ) . size
data. duplicated( subset= 'orderID' )
data[ data. duplicated( subset= 'orderID' ) ]
data. drop_duplicates( subset= 'orderID' , inplace= True )
print ( data[ data[ 'goodsID' ] == 'PR000000' ] . shape)
data[ data[ 'goodsID' ] == 'PR000000' ]
data. drop( data[ data[ 'goodsID' ] == 'PR000000' ] . index, inplace= True )
(0, 11)
data[ 'chanelID' ] . isnull( )
data[ 'chanelID' ] . fillna( data[ 'chanelID' ] . mode( ) [ 0 ] , inplace= True )
data. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103146 entries, 6 to 104301
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 orderID 103146 non-null object
1 userID 103146 non-null object
2 goodsID 103146 non-null object
3 orderAmount 103146 non-null float64
4 payment 103146 non-null float64
5 chanelID 103146 non-null object
6 platformType 103146 non-null object
7 orderTime 103146 non-null datetime64[ns]
8 payTime 103146 non-null datetime64[ns]
9 chargeback 103146 non-null object
10 interval 103146 non-null float64
dtypes: datetime64[ns](2), float64(3), object(6)
memory usage: 9.4+ MB
data[ 'discount' ] = data[ 'payment' ] / data[ 'orderAmount' ]
data_discount = data[ data[ 'discount' ] <= 1 ]
discount_mean = data_discount[ 'discount' ] . mean( )
mean = data[ data[ 'discount' ] <= 1 ] . discount. sum ( ) /