数据可视化_电商订单+基础图

最新推荐文章于 2022-10-18 11:53:02 发布

niki__

最新推荐文章于 2022-10-18 11:53:02 发布

阅读量1.1k

点赞数

分类专栏： python+人工智能学习文章标签： python

本文链接：https://blog.csdn.net/niki__/article/details/121624672

版权

python+人工智能学习专栏收录该内容

37 篇文章 0 订阅

订阅专栏

本文介绍了使用Matplotlib和Seaborn进行数据可视化的方法，包括折线图、散点图、柱状图、饼图、直方图等。同时，展示了如何处理电商网站订单数据，如提取、清洗和分析，涉及支付时长、订单金额等关键指标。通过对订单数据的处理和分析，可以了解交易总金额、退货率、客单价等信息，并绘制月度销售额趋势图、渠道分布饼图等图表，帮助理解业务运营状况。

摘要由CSDN通过智能技术生成

文章目录

数据可视化

Matplotlib —> 画图给自己看，用于数据探索
- 画布 —> figure() —> Figure
- 坐标系 —> subplot() —> 一个画布上可以有多个坐标系 —> Axes
- 坐标轴 —> plot() / scatter() / bar() / pie() / hist() / box() …
  - 趋势 —> 折线图
```
import matplotib.pyplot as plt
plt.style.use('seaborn-darkgrid') # 设置画图风格
plt.rc('font',size=6) # 设置图中字体和大小
plt.rc('figure',figsize=(4,3),dpi=150) # 设置图的大小
data['收盘价(元)'].plot()
```
  - 关系 —> 散点图
  - 差异 —> 柱状图
  - 占比 —> 饼图
  - 分布 —> 直方图
```
data['涨跌幅(%)'].hist(bins = 30)
```
  - 描述性统计信息 —> 箱线图（盒须图）
Seaborn —> 对Matplotlib做了封装，用默认的配置减少绘图参数
ECharts / D3.js —> 商业数据看板 / 数字化大屏 —> 前端JavaScript绘图库
- 后端程序：提供绘图需要使用的数据（API接口）—> 数据的服务化 —> Java / PHP / Python
- 前端程序：通过HTTP获取数据，用JavaScript将数据渲染到网页上
—> PyECharts

# 折线图和散点图
x = np.linspace(-2 * np.pi, 2 * np.pi, 60)
y1, y2 = np.sin(x), np.cos(x)
y3, y4 = np.sin(x), np.cos(x)

# 创建画布 ---> figure函数会返回Figure对象
# fig = plt.figure(...)
plt.figure(figsize=(8, 6), dpi=120)

# 创建坐标系 ---> subplot函数会返回Axes对象
# ax = fig.add_subplot(2, 2, 1)
plt.subplot(2, 2, 1)
# 调用plot函数绘制折线图，如果之前没有创建画布，在调用plot时会自动用默认设置创建画布
# ax.plot(...)
plt.plot(x, y1, marker='x', color='#ff00ff', linestyle=':', linewidth=1)
# 横轴和纵轴的标签
plt.xlabel(r'$ \alpha $')
plt.ylabel(r'$ y = sin(\alpha) $')
# 图的标题
plt.title('正弦曲线')

ax = plt.subplot(2, 2, 2)
# 修改坐标轴的位置和显示方式
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_position(('data', 0.0))
ax.spines['bottom'].set_position(('data', 0.0))
# 定制x和y轴的可读
plt.xticks(
    np.arange(-2 * np.pi, 2 * np.pi + 1, np.pi / 2),
    labels=[r'$-2\pi$', r'$-\frac{3}{2}\pi$', r'$-\pi$', r'$-\frac{\pi}{2}$', '0', r'$\frac{\pi}{2}$', r'$\pi$', r'$\frac{3}{2}\pi$', r'$2\pi$']
)
plt.yticks(np.arange(-1, 1.1, 0.5))
plt.plot(x, y2, marker='.', color='#00ffff', linestyle='-', linewidth=2)

plt.subplot(2, 1, 2)
# 调用scatter函数绘制散点图
plt.scatter(x, y3, s=x ** 2, c=y3, cmap='Reds', label='正弦')
plt.scatter(x, y4, s=(y4 * 15 + 20), c=y4, cmap='Greens', label='余弦')
# 在图上添加注释
plt.annotate('正弦曲线', xy=(0.1, 0.2), xytext=(1, -0.8), arrowprops={
    'facecolor': 'blue', 'arrowstyle': '->', 'edgecolor': 'blue',
    'connectionstyle': 'angle3,angleA=0,angleB=90'
})

# 添加图例
plt.legend(loc='lower right')

# 保存图像
plt.savefig('result.png')
# 显示图像
plt.show()

饼图/甜甜圈图


# 饼图/甜甜圈
plt.figure(figsize=(4, 4), dpi=120)
data = np.random.randint(100, 500, 5)
print(data)
labels = ['苹果', '香蕉', '桃子', '荔枝', '石榴']
plt.pie(
    data,
    autopct='%.1f%%',
    radius=1,
    pctdistance=0.8,
    # explode=[0.1, 0, 0.2, 0, 0],
    # shadow=True,
    # 字体属性
    # textprops=dict(fontsize=8, color='black'),
    textprops={'fontsize': 8, 'color': 'black'},
    # 楔子属性
    wedgeprops=dict(linewidth=1, width=0.35, edgecolor='white'),
    labels=labels
)
plt.show()

柱状图
# 对比的柱状
# 柱状图
labels = np.array(['Q1', 'Q2', 'Q3', 'Q4'])
group1 = np.random.randint(20, 50, 4)
print(group1)
group2 = np.random.randint(10, 60, 4)
print(group2)
group3 = np.random.randint(30, 40, 4)
print(group3)
plt.bar(labels, group1, 0.6, label='销售A组')
# 通过bottom属性设置数据堆叠
plt.bar(labels, group2, 0.6, bottom=group1, label='销售B组')
plt.bar(labels, group3, 0.6, bottom=group1 + group2, label='销售C组')
plt.legend()
plt.show()

面积图

# 面积图
plt.figure(figsize=(6, 3))
days = np.arange(7)
sleeping = [7, 8, 6, 6, 7, 8, 10]
eating = [2, 3, 2, 1, 2, 3, 2]
working = [7, 8, 7, 8, 6, 2, 3]
playing = [8, 5, 9, 9, 9, 11, 9]
plt.stackplot(days, sleeping, eating, working, playing)
plt.legend(['睡觉', '吃饭', '工作', '玩耍'], fontsize=10)
plt.show()

雷达图/极坐标折线图

# 雷达图（极坐标折线图）
labels = np.array(['专业技能', '工作经验', '团队意识', '沟通能力', '学习能力'])
values = np.array([78, 85, 95, 72, 88])
angles = np.linspace(0, 2 * np.pi, labels.size, endpoint=False)
# 加一条数据让图形闭合
values = np.concatenate((values, [values[0]]))
angles = np.concatenate((angles, [angles[0]]))
plt.figure(figsize=(4, 4), dpi=120)
ax = plt.subplot(projection='polar')
# 绘图和填充
plt.plot(angles, values, marker='o', linestyle='--', linewidth=2)
plt.fill(angles, values, alpha=0.25)
# 设置文字和网格线
ax.set_thetagrids(angles[:-1] * 180 / np.pi, labels, fontsize=10)
ax.set_rgrids([20, 40, 60, 80], fontsize=10)
plt.show()

玫瑰图

# 玫瑰图（圆形柱状图）
x = np.array([f'A-Q{i}' for i in range(1, 5)] + [f'B-Q{i}' for i in range(1, 5)])
y = np.array(group1.tolist() + group2.tolist())
print(y)
theta = np.linspace(0, 2 * np.pi, x.size, endpoint=False)
width = 2 * np.pi / x.size
colors = np.random.rand(8, 3)
# 将柱状图投影到极坐标
ax = plt.subplot(projection='polar')
plt.bar(theta, y, width=width, color=colors, bottom=0)
ax.set_thetagrids(theta * 180 / np.pi, x, fontsize=10)
plt.show()

密度曲线

# 绘制薪资水平密度曲线
df.salary.plot(kind='kde',xlim=(0,80000))
# kde:密度图
# xlim:x轴范围

某电商网站订单数据

数据提取

从Excel文件中读取订单数据。
提取2019年的订单数据。
处理业务流程不符的数据（支付时间早于下单时间、支付时长超过30分钟、订单金额小于0、支付金额小于0）。

order_df = pd.read_excel('data/某电商网站订单数据.xlsx', index_col='id')
order_df.info()

检查订单有无重复,查看有几种订单

# 检查订单号有没有重复值
order_df.orderID.nunique()

删除重复编号

# 删除订单编号重复的订单
order_df.drop_duplicates('orderID', keep='last', inplace=True)
order_df.shape

处理支付时长超过30分钟的数据

# 处理支付时长超过30分钟的数据
delta = (order_df.payTime - order_df.orderTime)
# order_df.drop(order_df[delta.dt.total_seconds() > 1800].index, inplace=True)
order_df.drop(order_df[(delta.dt.days > 0) | (delta.dt.seconds > 1800)].index, inplace=True)
order_df.shape

处理订单金额或支付金额小于0的数据

# 处理订单金额或支付金额小于0的数据
order_df.drop(order_df[(order_df.orderAmount < 0) | (order_df.payment < 0)].index, inplace=True)
order_df.shape

数据清洗

处理渠道为空的数据（补充众数）
处理平台类型字段（去掉多余的空格，保持数据一致）
添加折扣字段，处理折扣大于1的字段（将支付金额修改为“订单金额*平均折扣”）

查看订单的信息

# 查看信息
order_df.rename(
    columns={'chanelID': 'channelID', 'platfromType': 'platformType'}, 
    inplace=True
)
order_df.info()

处理渠道为空

# 处理渠道为空的数据（补充众数）
common_channel = order_df.channelID.mode()[0]
# order_df.fillna(common_channel, inplace=True)
order_df['channelID'] = order_df.channelID.fillna(common_channel)
order_df.info()

处理平台字段不一致

# 处理平台类型字段（去掉多余的空格，保持数据一致）
order_df['platformType'] = order_df.platformType.replace(r'\s', '', regex=True).str.upper()
order_df.head(10)

数据分析

交易总金额（GMV）、总销售额、实际销售额、退货率、客单价（ARPPU）
每月GMV及趋势分析（折线图）
按流量渠道拆解GMV（饼图）
周一到周日哪天的下单量最高、每天哪个时段下单量最高（柱状图）
用户复购率

print(f'GMV: {order_df.orderAmount.sum() / 10000:.4f}万元')
print(f'总销售额: {order_df.payment.sum() / 10000:.4f}万元')
real_total = order_df[order_df.chargeback == "否"].payment.sum()
print(f'实际销售额: {real_total / 10000:.4f}万元')
back_rate = order_df[order_df.chargeback == '是'].orderID.size / order_df.orderID.size
print(f'退货率: {back_rate * 100:.2f}%')
print(f'客单价：{real_total / order_df.userID.nunique():.2f}元')
print(order_df[order_df.chargeback == '是'].orderID.size)
print(order_df[order_df.chargeback == '是'].orderID.size)
GMV: 10850.3931万元

# 总销售额: 10260.9015万元
# 实际销售额: 8892.2733万元
# 退货率: 13.18%
# 客单价：1130.87元
# 13614

# 月销量的走势图
order_df['month'] = order_df.orderTime.dt.month
ser1 = np.round(order_df.groupby(by='month').orderAmount.sum() / 10000, 2)
ax = ser1.plot(
    figsize=(8, 4), kind='line', 
    linestyle='--', color='tomato', marker='^',
    label='GMV'
)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ser2 = np.round(order_df.groupby(by='month').payment.sum() / 10000, 2)
ser2.plot(
    figsize=(8, 4), kind='line', 
    linestyle='--', color='darkorange', marker='^',
    label='销售额'
)
ser3 = np.round(order_df[order_df.chargeback == '否'].groupby(by='month').payment.sum() / 10000, 2)
ser3.plot(
    figsize=(8, 4), kind='line', 
    linestyle='--', color='cyan', marker='^',
    label='实际销售额'
)
plt.xticks(ser1.index, labels=[f'{x}月' for x in ser1.index])
plt.yticks(np.arange(400, 1201, 100))
plt.xlabel('月份')
plt.ylabel('金额（万元）')
plt.title('2019年月度GMV走势图')
for i in ser1.index:
    plt.text(i - 0.3, ser1[i] + 20, ser1[i])
for i in ser3.index:
    plt.text(i - 0.3, ser3[i] - 50, ser3[i])
plt.legend(loc='lower right')
plt.savefig('monthchange.svg')
plt.show()

渠道分析图

ser = order_df.groupby('channelID').orderAmount.sum()
ser.plot(
    figsize=(5, 5), kind='pie',
    autopct='%.2f%%', pctdistance=0.8,
    counterclock=False, startangle=180,
    wedgeprops={
        'width': 0.4,
        'edgecolor': 'white'
    }
)
plt.ylabel('')
plt.savefig('channel.svg')
plt.show()

周为单位的分析图

order_df['weekday'] = order_df.orderTime.dt.weekday
ser = order_df.groupby('weekday').orderID.count()
ser.plot(kind='bar', color=np.random.rand(7, 3))
plt.xticks(ser.index, labels=[f'星期{x}' for x in '一二三四五六日'], rotation=0)
plt.savefig('week.svg')
plt.show()

半小时的分析图

order_df['time'] = order_df.orderTime.dt.floor('30T').dt.time
ser = order_df.groupby('time').orderID.count()
ser.plot(figsize=(9.5, 4), kind='bar', color=np.random.rand(12, 3))
plt.show()

每月的复购率

def handle_data(x):
    if x == 1:
        return 0
    elif x > 1:
        return 1
    return np.nan


temp = order_df.pivot_table(index='userID', columns='month', values='orderID', aggfunc='count')
temp = temp.applymap(handle_data)
temp.sum() / temp.count()
# temp
se = temp.sum() / temp.count()
se.plot(figsize=(8, 4), kind='bar',color=['teal','b','gold','firebrick','cornflowerblue'])
plt.savefig('resale.svg')
plt.show()

RFM透视表

# 计算RFM的原始值 --> 透视表
temp = order_df[order_df.chargeback == '否'].pivot_table(
    index='userID', 
    values=['orderTime', 'orderID', 'payment'],
    aggfunc={
        'orderTime': 'max',
        'orderID': 'count',
        'payment': 'sum'
    }
)

unt()
se.plot(figsize=(8, 4), kind=‘bar’,color=[‘teal’,‘b’,‘gold’,‘firebrick’,‘cornflowerblue’])
plt.savefig(‘resale.svg’)
plt.show()


RFM透视表

```python
# 计算RFM的原始值 --> 透视表
temp = order_df[order_df.chargeback == '否'].pivot_table(
    index='userID', 
    values=['orderTime', 'orderID', 'payment'],
    aggfunc={
        'orderTime': 'max',
        'orderID': 'count',
        'payment': 'sum'
    }
)