电商数据分析笔记

最新推荐文章于 2023-03-22 09:30:59 发布

KuroSiroMurasaki

最新推荐文章于 2023-03-22 09:30:59 发布

阅读量1.9k

点赞数 2

本文链接：https://blog.csdn.net/weixin_44714634/article/details/111179049

版权

电商数据分析

本笔记宗旨在于多维度分析英国某电商平台的销售数据，了解平台的日常业务状况，辅助业务运营及业务策划提出相关建议。此数据列出了一年内（2010/12/01~2011/12/09）约4000名客户的购买内容。

数据来源：https://www.kaggle.com/carrie1/ecommerce-data

1.准备数据

1.1读取数据

首先加载数据并查看了数据结构以及字段类型。

导包

import warnings
warnings.filterwarnings('ignore') # 有时代码正常运行，但弹出警告。可用警告过滤器进行控制。
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
#import seaborn as sns
from pylab import mpl
# 设置中文字体
mpl.rcParams["font.sans-serif"] = ["SimHei"]
mpl.rcParams["axes.unicode_minus"] = False
import plotly.graph_objs as go
from plotly.offline import iplot
from matplotlib import font_manager as fm
from  matplotlib import cm

导入数据

os.getcwd() # 查看当前工作目录
os.chdir('C:/Users/Desktop/电商数据') # 修改当前工作目录
df = pd.read_csv('data.csv',encoding='ISO-8859-1',dtype={
   'CustomerID':str}) # ISO-8859-1 单字节编码，表示字符系列0-255.

查看数据结构

print("数据结构：",df.shape)

数据结构： (541909, 8)

查看字段结构

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  object 
 7   Country      541909 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 33.1+ MB

查看前5行数据

df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850	United Kingdom

数据大小：541909行数据，8个字段。

字段说明：

InvoiceNo :订单编号。一般为6位数，C开头表示退货订单。
StockCode ：产品代码。一般由5位整数组成。
Description：产品描述。
Quantity ：下单数量。
InvoiceDate：下单时间。
UnitPrice：产品单价（单位：英镑）。
CustomerID：客户编号。
Country：客户所在国家或地区。

1.2数据清洗

清洗数据，包括对缺失值，异常值以及重复值的处理。

缺失值处理

查看每个字段中空值所占比。

df.apply(lambda x: sum(x.isnull())/len(x),axis=0)

InvoiceNo      0.000000
StockCode      0.000000
Description    0.002683
Quantity       0.000000
InvoiceDate    0.000000
UnitPrice      0.000000
CustomerID     0.249267
Country        0.000000
dtype: float64

两个字段含有空值,分别为"CustomerID"和"Description"。将俩字段的空值用"NULL"字符串替换。

df['CustomerID'].fillna("NULL",inplace=True)
df['Description'].fillna("NULL",inplace=True)

异常值处理

1.时间异常值检查

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
print('min:{},max:{}'.format(df['InvoiceDate'].min(),df['InvoiceDate'].max()))

min:2010-12-01 08:26:00,max:2011-12-09 12:50:00

时间字段均在在2010年12月到2011年12月之间，故无异常值。

2.下单数量与产品单价异常值检查

查看"Quantity"字段是否有小于等于0的数据。

print('等于0:{} , 小于0:{}'.format(df['Quantity'].loc[df['Quantity']==0].count(),df['Quantity'].loc[df['Quantity']<0].count()))

等于0:0 , 小于0:10624

没有下单数量为0的数据，下单数量小于0的数据有10624条。在这里下单数量为负的订单，视为退货订单。

查看"UnitPrice"字段是否有小于等于0的数据。

print('等于0:{} , 小于0:{}'.format(df['UnitPrice'].loc[df['UnitPrice']==0].count(),df['UnitPrice'].loc[df['UnitPrice']<0].count()))

等于0:2515 , 小于0:2

产品单价小于0的数据有2条，等于0的数据有2515条。产品单价小于0视为异常值予以删除处理，产品单价为0视为赠品。

df.drop(df[df['UnitPrice']<0].index,inplace=True)

3.产品编号异常值检查

产品编号一般为5位整数或5位整数加一个字符构成。

unique_stock = df.drop_duplicates(['StockCode'])
normal_stock = unique_stock[unique_stock['StockCode'].str.contains('^[0-9]{5}[a-zA-Z]*',regex=True)]['StockCode'].unique()
unique_stock_list = unique_stock['StockCode'].unique().tolist()
normal_stock_list = normal_stock.tolist()
special_codes = []
for x in unique_stock_list:
    if x not in normal_stock_list:
        special_codes.append(x)

print('总产品编号：{}种，正常产品编号:{}种，特殊产品编号:{}种'.format(len(unique_stock_list),len(normal_stock_list),len(special_codes)))

总产品编号：4070种，正常产品编号:4037种，特殊产品编号:33种

可见本数据涉及产品编号共有4070种，其中4037种符合规范，33种为特殊编号。具体查看特殊编号。

print(special_codes)

['POST', 'D', 'C2', 'DOT', 'M', 'BANK CHARGES', 'S', 'AMAZONFEE', 'DCGS0076', 'DCGS0003', 'gift_0001_40', 'DCGS0070', 'm', 'gift_0001_50', 'gift_0001_30', 'gift_0001_20', 'DCGS0055', 'DCGS0072', 'DCGS0074', 'DCGS0069', 'DCGS0057', 'DCGSSBOY', 'DCGSSGIRL', 'gift_0001_10', 'PADS', 'DCGS0004', 'DCGS0073', 'DCGS0071', 'DCGS0068', 'DCGS0067', 'DCGS0066P', 'B', 'CRUK']

查看特殊产品编号所对应的产品描述信息。

for code in special_codes:
    print('{} -> {}'.format(code,unique_stock[unique_stock['StockCode']==code]['Description'].values))

产品编号	产品描述
POST，C2，DOT	邮费
D	折扣
M，m	说明书
BANK CHARGES	银行收费
S	样品
AMAZONFEE	亚马逊收费
PADS	护垫类商品
gift_…	礼品券
DCGS0069,DCGSSBOY,DCG…	宠物项圈，男士包，可推测其他产品编号对应的也是某商品。
B	坏账调整
CRUK	向英国癌症研究所（CRUK）支付部分销售额的举措。（推测）

除了描述某具体产品的产品编号“DCG…”，“PADS”，以及折扣字段“D”以外其他类别不直接影响销售。因此可删除这些内容。

import re
for code in special_codes:
    temp =re.match(r'^DCG(.*)',code)
    if not temp and code!='PADS' and code!='D':
        df.drop(df[df['StockCode']==code].index,inplace=True)
df.shape

(539040, 8)

去重

删除重复数据。

df.drop_duplicates(inplace=True)
df.shape

(533778, 8)

1.3了解字段

统计字段唯一值

df[['InvoiceNo','StockCode','CustomerID','Country']].nunique()

InvoiceNo     25361
StockCode      4055
CustomerID     4364
Country          38
dtype: int64

可知此数据涉及25361份订单，4055种产品，4363（除去客户编号为NULL的数据）名客户以及他们分布在38个不同国家。

增加订单总额字段

df['TotalPrice'] =df['Quantity']*df['UnitPrice']

增加时间字段

df['Date'] = df['InvoiceDate'].astype("str").str[0:10]
df['Year'] = [x.split('-')[0] for x in df['Date']]
df['Month'] = [x.split('-')[1] for x in df['Date']] 
df['Hour'] = df['InvoiceDate'].astype("str").str[11:13]
df.drop("Date",axis=1,inplace=True)

增加订单状态字段

订单状态分为四种：销售订单，退货订单，赠品，退赠品。

def test(quantity,unitprice):
    if quantity>0 and unitprice>0:
        return "sales"
    elif quantity<0 and unitprice>0:
        return "returns"
    elif quantity>0 and unitprice==0:
        return "gift_sales"
    else:
        return "gift_returns"
    
df['State'] = df.apply(lambda row : test(row['Quantity'],row['UnitPrice']),axis=1)

2.时间维度分析

此阶段从销售金额和销售数量两个角度分析平台每月的销售情况。其次查看每月以及每天不同时间段需要处理的订单数据数量，了解平台的繁忙程度，有助于安排人力。

2.1月销售情况

构造销售数据和退货数据

data_sales = df[df['State']=='sales']
data_returns = df[df['State']=='returns']

月销售金额

total_price_sales = pd.pivot_table(data_sales,index='Year',columns='Month',values='TotalPrice',aggfunc={
   'TotalPrice':np.sum})
total_price_sales.sort_index(axis=1,inplace=True)
total_price_sales

Month	01	02	03	04	05	06	07	08	09	10	11	12
Year
2010	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	775638.36
2011	670380.24	507799.87	689833.51	515377.991	739953.0	737558.98	688219.341	724216.5	1028320.381	1103314.25	1452115.98	614490.92

月退款金额

total_price_returns = pd.pivot_table(data_returns,index='Year',columns='Month',values='TotalPrice',aggfunc={
   'TotalPrice':np.sum})
total_price_returns.sort_index(axis=1,inplace=True)
total_price_returns

Month	01	02	03	04	05	06	07	08	09	10	11	12
Year
2010	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-18241.84
2011	-91548.02	-8620.51	-10728.51	-35315.68	-9042.65	-14026.32	-11571.29	-23939.61	-17307.51	-41669.11	-25466.29	-174120.04

月销售收入 = 月销售金额 - 月退款金额（退款金额本身为负，因此实际操作时用了加号。）

total_price = total_price_sales + total_price_returns
total_price

Month	01	02	03	04	05	06	07	08	09	10	11	12
Year
2010	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	757396.52
2011	578832.

最低0.47元/天解锁文章

KuroSiroMurasaki

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
电商数据分析笔记

电商数据分析本笔记宗旨在于多维度分析英国某电商平台的销售数据，了解平台的日常业务状况，辅助业务运营及业务策划提出相关建议。此数据列出了一年内（2010/12/01~2011/12/09）约4000名客户的购买内容。数据来源：https://www.kaggle.com/carrie1/ecommerce-data1.准备数据1.1读取数据首先加载数据并查看了数据结构以及字段类型。导包import warningswarnings.filterwarnings('ignore') # 有时代码
复制链接

扫一扫