题目:电商数据分析案例
数据集来源
本数据集来自kaggle,数据集链接如下:
https://www.kaggle.com/carrie1/ecommerce-data
分析目的:了解该电商平台的销售情况(该网站是一个以批发为主的电商平台)
分析思路框架:
1、产品维度
2、顾客维度
3、时间维度
查看数据集信息
import pandas as pd
import os
import matplotlib.pyplot as plt
os.chdir("F:\jupyter\E-commerce-data")
data = pd.read_csv("data.csv",encoding="ISO-8859-1")
data.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
数据集共有8个变量,分别为订单编号、产品编号、产品描述、购买数量、购买时间、产品单价、顾客ID、顾客所在地区
数据预处理
数据缺失值处理
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null object
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
更改ID的数据类型
data["CustomerID"] = data["CustomerID"].astype("str")
产品描述和顾客ID有缺失变量,由于本次分析不涉及产品描述,故将其删掉。缺失的顾客ID用unknown表示
data = data.drop(["Description"],axis = 1)
data["CustomerID"] = data["CustomerID"].fillna("unknown")
处理重复值
data.duplicated().sum()
5270
data = data.drop_duplicates()
处理异常值
data.describe()
Quantity | UnitPrice | |
---|---|---|
count | 536639.000000 | 536639.000000 |
mean | 9.619500 | 4.632660 |
std | 219.130206 | 97.233299 |
min | -80995.000000 | -11062.060000 |
25% | 1.000000 | 1.250000 |
50% | 3.000000 | 2.080000 |
75% | 10.000000 | 4.130000 |
max | 80995.000000 | 38970.000000 |
产品数量为负表示退货,产品单价为负没有意义,为异常值
data[data["UnitPrice"]<0].count()
InvoiceNo 0
StockCode 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64
data = data[data["UnitPrice"]>=0]
data[data["UnitPrice"]==0].count()
InvoiceNo 2515
StockCode 2515
Quantity 2515
InvoiceDate 2515
UnitPrice 2515
CustomerID 2515
Country 2515
dtype: int64
单价为0的产品可能是赠品
变量处理
data["Total"]=data["Quantity"]*data["UnitPrice"]
data[["Date","Time"]]=data["InvoiceDate"].str.split(" ",expand=True)
data[["Month","Day","Year"]] = data["Date"].str.split("/",expand=True)
data["Hour"]=data["Time"].str.split(":",expand=True)[0].astype("int")
将数据集分为两部分,一部分是产品购买信息,一部分是退货信息
data_buy = data[data["Quantity"] >= 0]
data_return = data[data["Quantity"] < 0]