电商案例分析
分析思路
- 明确需求和目标:通过电商数据对客户行为进行分析
- 数据收集:
- 内部数据、购买数据、爬取数据、调查问卷、其他收集
- 本案例数据为某平台的电商数据已收集完成;
- 数据预处理
- 数据整合:横向整合、纵向整合
- 数据清洗:缺失值、空值、异常值
- 数据转换:如日期
- 数据分析:
数据加载
利用pandas读取数据
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('./order.xlsx',index_col='id')
data.head()
|
orderID |
userID |
goodsID |
orderAmount |
payment |
chanelID |
platformType |
orderTime |
payTime |
chargeback |
id |
|
|
|
|
|
|
|
|
|
|
1 |
sys-2018-254118088 |
user-157213 |
PR000064 |
272.51 |
272.51 |
渠道-0396 |
APP |
2018-02-14 12:20:36 |
2019-02-28 13:38:41 |
否 |
2 |
sys-2018-263312190 |
user-191121 |
PR000583 |
337.93 |
337.93 |
渠道-0765 |
Wech atMP |
2018-08-14 09:40:34 |
2019-01-01 14:47:14 |
是 |
3 |
sys-2018-188208169 |
user-211918 |
PR000082 |
905.68 |
891.23 |
渠道-0530 |
We c hatMP |
2018-11-02 20:17:25 |
2019-01-19 20:06:35 |
否 |
4 |
sys-2018-203314910 |
user-201322 |
PR000302 |
786.27 |
688.88 |
渠道-0530 |
WEB |
2018-11-19 10:36:39 |
2019-08-07 12:24:35 |
否 |
5 |
sys-2018-283989279 |
user-120872 |
PR000290 |
550.77 |
542.51 |
渠道-9527 |
APP |
2018-12-26 11:19:16 |
2019-10-01 07:42:43 |
否 |
查看数据整体信息和统计信息
字段说明:
- orderID:订单ID
- userID:用户ID
- goodsID:商品ID
- orderAmount:订单金额
- payment:支付金额
- chanelID:各渠道ID
- platformType:渠道分类
- orderTime:下单时间
- payTime:下单支付时间
- chargeback:是否退货
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 104557 entries, 1 to 104557
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 orderID 104557 non-null object
1 userID 104557 non-null object
2 goodsID 104557 non-null object
3 orderAmount 104557 non-null float64
4 payment 104557 non-null float64
5 chanelID 104549 non-null object
6 platformType 104557 non-null object
7 orderTime 104557 non-null datetime64[ns]
8 payTime 104557 non-null datetime64[ns]
9 chargeback 104557 non-null object
dtypes: datetime64[ns](2), float64(2), object(6)
memory usage: 8.8+ MB
data.describe()
|
orderAmount |
payment |
count |
104557.000000 |
104557.000000 |
mean |
1049.681521 |
1167.494225 |
std |
1054.409968 |
2174.024855 |
min |
6.100000 |
-12.470000 |
25% |
432.040000 |
383.660000 |
50% |
679.320000 |
641.230000 |
75% |
1248.280000 |
1252.630000 |
max |
28465.250000 |
83270.053829 |
数据预处理
- 增加字段,便于后续分析
- 异常值处理
- 使用pandas的describe方法检测异常值
- 删除异常值、视为缺失值处理、对数转换、使用临界值填充等具体根据业务选择
- 缺失值处理:
- 删除缺失值,如果样本容量足够大,缺失部分数量很少的情况下可以选择此方法
- 填充缺失值:
- 数值类型:均值、中值、插值填充
- 类别类型:众数填充
- 重复值处理:重复值通常对数据分析没有作用,一般选择直接删除
from datetime import datetime
data['interval']=(data['payTime']-data['orderTime']).dt.total_seconds()
starttime=datetime(2019,1,1)
endtime=datetime(2019,12,31,23,59,59)
data['payTime']=pd.to_datetime(data['payTime'])
data['orderTime']=pd.to_datetime(data['orderTime'])
data=data[data['orderTime']>starttime]
data=data[data['payTime']<endtime]
data.head()
|
orderID |
userID |
goodsID |
orderAmount |
payment |
chanelID |
platformType |
orderTime |
payTime |
chargeback |
interval |
id |
|
|
|
|
|
|
|
|
|
|
|
6 |
sys-2019-279103297 |
user-146548 |
PR000564 |
425.20 |
425.20 |
渠道-0765 |
Wech atMP |
2019-01-01 00:12:23 |
2019-01-01 00:13:37 |
否 |
74.0 |
7 |
sys-2019-316686066 |
user-104210 |
PR000709 |
1764.37 |
1707.04 |
渠道-0396 |
We c hatMP |
2019-01-01 00:23:06 |
2019-01-01 00:23:32 |
否 |
26.0 |
8 |
sys-2019-306447069 |
user-104863 |
PR000499 |
499.41 |
480.42 |
渠道-0007 |
Wech atMP |
2019-01-01 01:05:50 |
2019-01-01 01:06:17 |
否 |
27.0 |
9 |
sys-2019-290267674 |
user-206155 |
PR000253 |
1103.00 |
1050.95 |
渠道-0330 |
APP |
2019-01-01 01:16:12 |
2019-01-01 01:16:25 |
否 |
13.0 |
10 |
sys-2019-337079027 |
user-137939 |
PR000768 |
465.41 |
465.41 |
渠道-9527 |
AL i MP |
2019-01-01 01:31:00 |
2019-01-01 01:31:36 |
否 |
36.0 |
data[data['interval']<0]
|
orderID |
userID |
goodsID |
orderAmount |
payment |
chanelID |
platformType |
orderTime |
payTime |
chargeback |
interval |
id |
|
|
|
|
|
|
|
|
|
|
|
14852 |
sys-2019-303932121 |
user-185935 |
PR000331 |
433.99 |
310.79 |
渠道-0283 |
APP |
2019-03-19 13:34:13 |
2019-03-07 10:19:28 |
否 |
-1048485.0 |
15638 |
sys-2019-349926661 |
user-104304 |
PR000190 |
591.18 |
570.88 |
渠道-0765 |
APP |
2019-03-23 13:53:31 |
2019-02-09 13:24:49 |
否 |
-3630522.0 |
16162 |
sys-2019-324041709 |
user-230753 |
PR000074 |
372.75 |
372.75 |
渠道-0007 |
WEB |
2019-03-25 17:09:07 |
2019-02-06 09:57:19 |
是 |
-4086708.0 |
16383 |
sys-2019-314295685 |
user-182087 |
PR000082 |
1733.77 |
1694.03 |
渠道-0007 |
Wech atMP |
2019-03-26 22:47:12 |
2019-02-07 19:58:25 |
否 |
-4070927.0 |
17003 |
sys-2019-348295045 |
user-109824 |
PR000134 |
713.46 |
707.36 |
渠道-0283 |
Wech atMP |
2019-03-30 14:18:54 |
2019-02-07 15:14:04 |
否 |
-4403090.0 |
print(data[data['interval']<0].index)
data.drop(data[data['interval']<0].index,inplace=True)
Int64Index([], dtype='int64', name='id')
data[data['interval']>1800]
data.drop(data[data['interval']>1800].index,inplace=True)
data[data['orderAmount']<0]
data[data['payment']<0]
data.drop(data[data['payment']<0].index,inplace=True)
print(data['platformType'].unique())
data['platformType'].str.replace(' ','')
data['platformType']=data['platformType'].str.replace(' ','')
print('************************')
print(data['platformType'].unique())
['WechatMP' 'APP' 'ALiMP' 'WEB' 'WechatShop' 'Wap']
************************
['WechatMP' 'APP' 'ALiMP' 'WEB' 'WechatShop' 'Wap']
data['orderID'].unique().size
data.duplicated(subset='orderID')
data[data.duplicated(subset='orderID')]
data.drop_duplicates(subset='orderID',inplace=True)
print(data[data['goodsID']=='PR000000'].shape)
data[data['goodsID']=='PR000000']
data.drop(data[data['goodsID']=='PR000000'].index,inplace=True)
(0, 11)
data['chanelID'].isnull()
data['chanelID'].fillna(data['chanelID'].mode()[0],inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103146 entries, 6 to 104301
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 orderID 103146 non-null object
1 userID 103146 non-null object
2 goodsID 103146 non-null object
3 orderAmount 103146 non-null float64
4 payment 103146 non-null float64
5 chanelID 103146 non-null object
6 platformType 103146 non-null object
7 orderTime 103146 non-null datetime64[ns]
8 payTime 103146 non-null datetime64[ns]
9 chargeback 103146 non-null object
10 interval 103146 non-null float64
dtypes: datetime64[ns](2), float64(3), object(6)
memory usage: 9.4+ MB
data['discount'] = data['payment'] / data['orderAmount']
data_discount = data[data['discount'] <= 1]
discount_mean = data_discount['discount'].mean()
mean = data[data['discount'] <= 1].discount.sum() / data