电商数据基础分析

最新推荐文章于 2024-05-11 22:04:49 发布

mokunoko

最新推荐文章于 2024-05-11 22:04:49 发布

阅读量698

点赞数

文章标签： sklearn 机器学习 python

本文链接：https://blog.csdn.net/weixin_55359329/article/details/121589693

版权

首先我们导入这一组数据，我们其实可以先将stockcode先剔除掉然后来分析剩下的数据以供你的后续使用，我们拿到一组数据首先要做的是根据你的需求去做一个数据清洗，然后再进行可视化分析以达到你自己想要的目的

我这边主要通过数据来通过两个角度进行分析，一个是从产品角度，一个是从客户的角度。

一·数据清洗

我们的数据名称为df_bussiness,先通过df_bussiness.describe()和df_bussiness.info()来观察他们的数据有什么异常，和我们需要考究的地方。

我们先观察descirbe的数据类型，可以发现有几个我们需要注意的地方：

1.min值有负数（价格和数量一般来说不会有负数，可能是取消订单或者是什么给个猜想）

2.max值过大相比于75%，max值太大会影响整个的数据分析

df_bussiness.isna().sum()

3.观察数据缺失值发现，customerid缺失值过多，description也有少许缺失值。

针对以上三个问题我们进行如下处理：

1.df_bussiness = df_bussiness.dropna()将缺失值剔除掉

2.通过观察发现value小于50的数最佳，剔除大于50的。我们其实也可以通过散点图剔除异常值，看个人喜好！

#max值异常处理
plt.figure(figsize=(20,8))
fig = px.histogram(df_bussiness.Quantity)
fig.show()

df_bussiness =df_bussiness[df_bussiness.Quantity <50]

df_bussiness = df_bussiness[df_bussiness.UnitPrice < 20]

3.我们发现unitprice大部分都是小于20的，所以取小于20的部分进行分析

4.通过下面这串代码发现tail(20)中有有多个C开头的订单，意为取消订单的订单（cancel）

df_bussiness.InvoiceNo.value_counts().tail(20)

df_goods = df_bussiness.drop(df_bussiness[df_bussiness["InvoiceNo"].str.startswith("C")]["InvoiceNo"].index)
# df_goods = df_goods[~df_goods["InvoiceNo"].str.startswith("C")]第二种处理方式

将带有C开头的订单去掉，就可以开始分析了

二·数据分析（针对订单dataframe-df_goods）

1.产品维度入手，分析优质产品主要元素（产品名，订单量）

df_popu_pro = df_goods[["Description",'Quantity']]
df_popu_pro = df_popu_pro.groupby("Description").sum().sort_values(by= "Quantity",ascending = False)[:20]
sns.set()
plt.title("前20的优质产品")
plt.rcParams["font.sans-serif"] = "SimHei"
se_popu_Quantity = df_popu_pro.Quantity
plt.xticks(rotation= 90)
sns.barplot(se_popu_Quantity.index,se_popu_Quantity.values)

2.客户角度入手，抓住出单量前20的客户然后去做一个数据的可视化

#分析优质客户，订单量大的客户
df_order = df_goods[["CustomerID","InvoiceNo","Country"]].groupby(["CustomerID","InvoiceNo","Country"]).count()
df_order = df_order.reset_index(drop= False)

se_order = df_order.CustomerID.value_counts().sort_values(ascending = False)

fig = plt.figure(figsize=(10,5))
# plt.xticks(se_order.index[:20],se_order.CustomerID,rotation= 90)
plt.xticks(np.arange(len(se_order.index[:20])),se_order.index[:20],rotation = 90)
sns.barplot(np.arange(len(se_order.index[:20])),se_order.values[:20])

3.时间的变化对于客户订单量的影响

df_goods["InvoiceDate"] =pd.to_datetime(df_goods.InvoiceDate)
df_order_time = df_goods.set_index("InvoiceDate")

将时间戳化成时间序列以便后续的可视化做准备

月订单总额随着时间的变化关系，我们可以看到2010末到2011这一年都是稳步上涨的

df_order_time["date"] = df_order_time.index.date
df_order_time["Totalsell"] = df_order_time.Quantity*df_order_time.UnitPrice
plt.figure(figsize=(10,4),dpi = 80)
se_total_data = df_order_time[["Totalsell","date"]].resample("M").sum().Totalsell

plt.xticks(np.arange(len(se_total_data.index)),se_total_data.index,rotation = 45)

plt.plot(np.arange(len(se_total_data.index)),se_total_data.values)

其次的话看每日订单额之间的关系

df_total_data = df_order_time[["Totalsell","date"]]
se_total_data2 = df_total_data.groupby("date").sum()
plt.figure(figsize=(10,6))
plt.xticks(rotation = 45)
plt.plot(se_total_data2.index,se_total_data2.values)

4.通过这组数据我们还可以分析客户位于哪个国家多一点，哪个国家更好卖

countries = df_order.groupby("Country").count().sort_values(by = "CustomerID",ascending=False)[:20]
countries.drop(columns="InvoiceNo")

以上便是我们对于df_bussiness的一个简易分析！！

mokunoko

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
电商数据基础分析

首先我们导入这一组数据，我们其实可以先将stockcode先剔除掉然后来分析剩下的数据以供你的后续使用，我们拿到一组数据首先要做的是根据你的需求去做一个数据清洗，然后再进行可视化分析以达到你自己想要的目的我这边主要通过数据来通过两个角度进行分析，一个是从产品角度，一个是从客户的角度。一·数据清洗我们的数据名称为df_bussiness,先通过df_bussiness.describe()和df_bussiness.info()来观察他们的数据有什么异常，和我们需要考究的地方。...
复制链接

扫一扫