数据可视化实战之新零售--完整代码

大象代码

已于 2024-01-10 20:43:11 修改

阅读量139

点赞数

分类专栏：数据可视化文章标签：信息可视化零售 python

于 2023-04-07 20:23:01 首次发布

本文链接：https://blog.csdn.net/mjoin2022/article/details/130019827

版权

数据可视化专栏收录该内容

6 篇文章 0 订阅

订阅专栏

该段代码展示了使用Python的pandas库对多个CSV订单文件和一个Excel商品文件进行读取、合并、缺失值处理的过程。首先，合并了2018年4月至9月的订单数据，并删除了含有缺失值的行。接着，清洗了商品表，同样去除缺失值。然后，从省市区列中提取市信息，创建新列，并通过循环去除商品详情列中的特定字符来清洗数据。最后，进行了数据降维，创建了新的下单时间段列，并保存了处理后的数据。

摘要由CSDN通过智能技术生成

# 代码7-1
import pandas as pd
data4 = pd.read_csv('订单表2018-4.csv', encoding='gbk')
data5 = pd.read_csv('订单表2018-5.csv', encoding='gbk')
data6 = pd.read_csv('订单表2018-6.csv', encoding='gbk')
data7 = pd.read_csv('订单表2018-7.csv', encoding='gbk')
data8 = pd.read_csv('订单表2018-8.csv', encoding='gbk')
data9 = pd.read_csv('订单表2018-9.csv', encoding='gbk')
goods_info = pd.read_excel('商品表.xlsx')
print(data4.shape, data5.shape, data6.shape, data7.shape,
      data8.shape, data9.shape, goods_info.shape)
# 代码7-2
data = pd.concat([data4, data5, data6, data7, data8, data9], ignore_index=True)
print('订单表合并后的数据为', data.shape)
# 代码7-3
print('订单表各列的缺失值数目为：\n', data.isnull().sum())
# 代码7-4
# 删除缺失值
print('未做删除缺失值前订单表行列数目为：', data.shape)
data = data.dropna(how='any')  # 删除
print('删除完缺失值后订单表行列数目为：', data.shape)
# 代码7-5
# 清洗商品表
print('商品表各列的缺失值数目为：\n', goods_info.isnull().sum())
# 删除缺失值
print('未做删除缺失值前商品表行列数目为：', goods_info.shape)
goods_info = goods_info.dropna(how='any')
print('删除完缺失值后商品表行列数目为：', goods_info.shape)
# 代码7-6
# 从省市区中提取市的信息，并创建新列
data['市'] = data['省市区'].str[3: 6]
print('经过处理后前5行为：\n', data.head())
# 代码7-7
# 定义一个需剔除的字符的list
error_str = [' ', '(', ')', '（', '）', '0', '1', '2', '3', '4', '5', '6',
             '7', '8', '9', 'g', 'l', 'm', 'M', 'L', '听', '特', '饮', '罐',
             '瓶', '只', '装', '欧', '式', '&', '%', 'X', 'x', ';']
# 使用循环剔除指定字符
for i in error_str:
    data['商品详情'] = data['商品详情'].str.replace(i, '')
# 新建一列 商品名称用于新数据存放
data['商品名称'] = data['商品详情']
data['商品名称'][0: 5]
print(data['商品名称'])
print(data['商品名称'][0: 5])
# 代码7-8
# 删除金额较少的订单前的数据量
print(data.shape)
# 删除金额较少的订单后的数据量
data = data[data['总金额(元)'] >= 0.5]
print(data.shape)
# 代码7-9
# 将商品名称表中的部分商品进行名字统一
goods_info['商品名称'] = goods_info['商品名称'].str.replace('可口可乐', '可乐')
goods_info['商品名称'] = goods_info['商品名称'].str.replace(' ', '')
goods_info['商品名称'] = goods_info['商品名称'].str.replace('可比克薯片烧烤味',
                                                    '可比克烧烤味')
goods_info['商品名称'] = goods_info['商品名称'].str.replace('可比克薯片牛肉味',
                                                    '可比克牛肉味')
goods_info['商品名称'] = goods_info['商品名称'].str.replace('可比克薯片番茄味',
                                                    '可比克番茄味')
goods_info['商品名称'] = goods_info['商品名称'].str.replace('阿沙姆奶茶',
                                                    '阿萨姆奶茶')
goods_info['商品名称'] = goods_info['商品名称'].str.replace('罐装百威',
                                                    '罐装百威啤酒')
print(goods_info['商品名称'])
goods_info.to_csv('goods_info.csv', index=False, encoding = 'gbk')
# 代码7-10
# 降维订单数据
data = data.drop(['手续费(元)', '收款方', '软件版本', '省市区',
                  '商品详情', '退款金额(元)'], axis=1)
print('降维后，数据列为：\n', data.columns.values)
# 代码7-11
# 归约订单数据字段
# 将时间格式的字符串转换为标准的时间
data['下单时间'] = pd.to_datetime(data['下单时间'])
data['小时'] = data['下单时间'].dt.hour  # 提取时间中的小时，将其赋给新列 小时
data['月份'] = data['下单时间'].dt.month
data['下单时间段'] = 'time'  # 新增一列下单时间段，并将其初始化为time
exp1 = data['小时'] <= 5  # 判断小时是否小于等于5
# 条件为真则时间段为凌晨
data.loc[exp1, '下单时间段'] = '凌晨'
# 判断小时是否大于5且小于等于8
exp2 = (5 < data['小时']) & (data['小时'] <= 8)
# 条件为真则时间段为早晨
data.loc[exp2, '下单时间段'] = '早晨'
# 判断小时是否大于8且小于等于11
exp3 = (8 < data['小时']) & (data['小时'] <= 11)
# 条件为真则时间段为上午
data.loc[exp3, '下单时间段'] = '上午'
# 判断小时是否小大于11且小于等于13
exp4 = (11 < data['小时']) & (data['小时'] <= 13)
# 条件为真则时间段为中午
data.loc[exp4, '下单时间段'] = '中午'
# 判断小时是否大于13且小于等于16
exp5 = (13 < data['小时']) & (data['小时'] <= 16)
# 条件为真则时间段为下午
data.loc[exp5, '下单时间段'] = '下午'
# 判断小时是否大于16且小于等于19
exp6 = (16 < data['小时']) & (data['小时'] <= 19)
# 条件为真则时间段为傍晚
data.loc[exp6, '下单时间段'] = '傍晚'
# 判断小时是否大于19且小于等于24
exp7 = (19 < data['小时']) & (data['小时'] <= 24)
# 条件为真则时间段为晚上
data.loc[exp7, '下单时间段'] = '晚上'
print('处理完成后的订单表前5行为：\n', data.head())
data.to_csv('order.csv', index=False, encoding = 'gbk')

D:\Python\python.exe D:/pycharm/2读取与处理无人售货机数据.py
(2077, 14) (46068, 14) (51925, 14) (77644, 14) (86459, 14) (86723, 14) (3626, 8)
订单表合并后的数据为 (350896, 14)
订单表各列的缺失值数目为：
设备编号 0
下单时间 0
订单编号 0
购买数量(个) 0
手续费(元) 0
总金额(元) 0
支付状态 0
出货状态 3
收款方 276
退款金额(元) 0
购买用户 0
商品详情 0
省市区 0
软件版本 0
dtype: int64
未做删除缺失值前订单表行列数目为： (350896, 14)
删除完缺失值后订单表行列数目为： (350617, 14)
商品表各列的缺失值数目为：
商品名称 392
销售数量 0
销售金额 0
利润 0
库存数量 0
进货数量 0
存货周转天数 0
月份 0
dtype: int64
未做删除缺失值前商品表行列数目为： (3626, 8)
删除完缺失值后商品表行列数目为： (3234, 8)
经过处理后前5行为：
设备编号下单时间 ... 软件版本市
0 112531 2018/4/30 22:55 ... V2.1.55/1.2;rk3288 中山市
1 112673 2018/4/30 22:50 ... V3.0.37;rk3288;(900x1440) 佛山市
2 112636 2018/4/30 22:35 ... V2.1.55/1.2;rk3288 广州市
3 112636 2018/4/30 22:33 ... V2.1.55/1.2;rk3288 广州市
4 112636 2018/4/30 21:33 ... V2.1.55/1.2;rk3288 广州市

[5 rows x 15 columns]
D:\pycharm\2读取与处理无人售货机数据.py:40: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
data['商品详情'] = data['商品详情'].str.replace(i, '')
0 可口可乐
1 旺仔牛奶
2 雪碧
3 阿萨姆奶茶
4 王老吉
...
350891 王老吉
350892 王老吉
350893 挑战者
350894 伊利麦香味早餐奶
350895 伊利麦香味早餐奶
Name: 商品名称, Length: 350617, dtype: object
0 可口可乐
1 旺仔牛奶
2 雪碧
3 阿萨姆奶茶
4 王老吉
Name: 商品名称, dtype: object
(350617, 16)
D:\pycharm\2读取与处理无人售货机数据.py:43: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
data['商品名称'][0: 5]
D:\pycharm\2读取与处理无人售货机数据.py:45: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
print(data['商品名称'][0: 5])
(350450, 16)
0 黑派黑水
1 黑派黑水
2 黑派黑水
3 黑派黑水
4 黑派黑水
...
3229 18g旺仔小馒头
3230 18g旺仔小馒头
3231 18g旺仔小馒头
3232 18g旺仔小馒头
3233 18g旺仔小馒头
Name: 商品名称, Length: 3234, dtype: object
降维后，数据列为：
['设备编号' '下单时间' '订单编号' '购买数量(个)' '总金额(元)' '支付状态' '出货状态' '购买用户' '市' '商品名称']
处理完成后的订单表前5行为：
设备编号下单时间订单编号 ... 小时月份下单时间段
0 112531 2018-04-30 22:55:00 112531qr15251001151105 ... 22 4 晚上
1 112673 2018-04-30 22:50:00 112673qr15250998551741 ... 22 4 晚上
2 112636 2018-04-30 22:35:00 112636qr15250989343846 ... 22 4 晚上
3 112636 2018-04-30 22:33:00 112636qr15250988245087 ... 22 4 晚上
4 112636 2018-04-30 21:33:00 112636qr15250952296930 ... 21 4 晚上

[5 rows x 13 columns]

进程已结束,退出代码0