【数据分析项目实战】商铺数据加载及存储

最新推荐文章于 2024-04-14 16:39:44 发布

carroll18

最新推荐文章于 2024-04-14 16:39:44 发布

阅读量1.1k

点赞数 1

分类专栏：机器学习文章标签：数据分析 python 项目实战

本文链接：https://blog.csdn.net/qq_40722827/article/details/108152171

版权

机器学习专栏收录该内容

43 篇文章 2 订阅

订阅专栏

'''
【项目】  商铺数据加载及存储

要求：
1、成功读取“商铺数据.csv”文件
2、解析数据，存成列表字典格式：[{'var1':value1,'var2':value2,'var3':values,...},...,{}]
3、数据清洗：
① comment，price两个字段清洗成数字
② 清除字段缺失的数据
③ commentlist拆分成三个字段，并且清洗成数字
4、结果存为.pkl文件

'''

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("商铺数据.csv")
df

	classify	name	comment	star	price	address	commentlist
0	美食	望蓉城老坛酸菜鱼(合生汇店)	我要点评	该商户暂无星级	人均￥125	翔殷路1099号合生汇5F	口味8.3 环境8.4 ...
1	美食	泰国街边料理	74 条点评	准四星商户	人均￥48	黄兴路合生汇B2美食集市内	口味7.4 环境7.6 ...
2	美食	壹面如故(苏宁生活广场店)	265 条点评	准四星商户	人均￥21	邯郸路585号苏宁生活广场B1层	口味7.0 环境7.2 ...
3	美食	鮨谷•Aburiya(合生汇店)	2748 条点评	准五星商户	人均￥142	翔殷路1099号合生广场5楼23、28铺位	口味8.9 环境8.5 ...
4	美食	我们的烤肉我们的馕	5 条点评	准四星商户	人均 -	邯郸路399-D3号	口味7.5 环境6.8 ...
...	...	...	...	...	...	...	...
1260	购物	obdear	1 条点评	准四星商户	人均 -	五角场合生汇广场B1	质量7.1 环境6.9 ...
1261	购物	KAKO(百联又一城店)	4 条点评	准四星商户	人均 -	淞沪路8号百联又一城3层3F-14	质量7.1 环境6.9 ...
1262	购物	思乐得生活馆(合生汇店)	1 条点评	准四星商户	人均 -	翔殷路1099号合生汇B2层28B	质量7.1 环境6.9 ...
1263	购物	sefon臣枫(巴黎春天店)	1 条点评	准四星商户	人均 -	淞沪路1号巴黎春天2层	质量7.1 环境6.9 ...
1264	购物	诗篇(百联又一城店)	1 条点评	准四星商户	人均 -	淞沪路8号百联又一城3层3F-17	质量7.1 环境6.9 ...

1265 rows × 7 columns

# 解析数据，存成列表字典格式：[{'var1':value1,'var2':value2,'var3':values,...},...,{}]
lst = []
dct = {}

for row in df.values:
    for i in range(7):
        dct[df.columns[i]] = row[i];
    lst.append(dct)

print(lst[4])

{'classify': '购物', 'name': '诗篇(百联又一城店)', 'comment': '1                    条点评', 'star': '准四星商户', 'price': '人均                                    -', 'address': '淞沪路8号百联又一城3层3F-17', 'commentlist': '质量7.1                                环境6.9                                服务7.0'}

df['comment'].head()

0                           我要点评
1      74                    条点评
2     265                    条点评
3    2748                    条点评
4       5                    条点评
Name: comment, dtype: object

loc函数：通过行索引 “Index” 中的具体值来取行数据（如取"Index"为"A"的行）

iloc函数：通过行号来取行数据（如取第二行的数据）

#3、数据清洗：
# ① comment，price两个字段清洗成数字
# ② 清除字段缺失的数据
# ③ commentlist拆分成三个字段，并且清洗成数字

数据清洗的时候，要注意观察数据的共同特征和不同部分，然后按照对应要求找出数据的区分条件。

ls_comment = []
price_ls = []
for i in range(len(df)):
    # print(df.loc[i]['comment'].split(" ")[0])
    if((df.loc[i]['comment'].split(" ")[0])=='我要点评'):
        ls_comment.append(int(0))
    else:
        ls_comment.append(int(df.loc[i]['comment'].split(" ")[0]))
        
    if '￥' in df.loc[i]['price']:
        price_ls.append(int(df.loc[i]['price'].split('￥')[-1]))
    else:
        price_ls.append(int(0))

DataFrame 格式的数据要新增加一列时，建议先把对应列的数据存成List形式，最后直接整列赋值。

df['comment_new'] = ls_comment
df['price_new'] = price_ls
df.head()

	classify	name	comment	star	price	address	commentlist	comment_new	price_new
0	美食	望蓉城老坛酸菜鱼(合生汇店)	我要点评	该商户暂无星级	人均￥125	翔殷路1099号合生汇5F	口味8.3 环境8.4 ...	0	125
1	美食	泰国街边料理	74 条点评	准四星商户	人均￥48	黄兴路合生汇B2美食集市内	口味7.4 环境7.6 ...	74	48
2	美食	壹面如故(苏宁生活广场店)	265 条点评	准四星商户	人均￥21	邯郸路585号苏宁生活广场B1层	口味7.0 环境7.2 ...	265	21
3	美食	鮨谷•Aburiya(合生汇店)	2748 条点评	准五星商户	人均￥142	翔殷路1099号合生广场5楼23、28铺位	口味8.9 环境8.5 ...	2748	142
4	美食	我们的烤肉我们的馕	5 条点评	准四星商户	人均 -	邯郸路399-D3号	口味7.5 环境6.8 ...	5	0

df1 = df[df['price_new'] > 0]

df_new = df1[df1['comment_new'] > 0]
df_new

	classify	name	comment	star	price	address	commentlist	comment_new	price_new
1	美食	泰国街边料理	74 条点评	准四星商户	人均￥48	黄兴路合生汇B2美食集市内	口味7.4 环境7.6 ...	74	48
2	美食	壹面如故(苏宁生活广场店)	265 条点评	准四星商户	人均￥21	邯郸路585号苏宁生活广场B1层	口味7.0 环境7.2 ...	265	21
3	美食	鮨谷•Aburiya(合生汇店)	2748 条点评	准五星商户	人均￥142	翔殷路1099号合生广场5楼23、28铺位	口味8.9 环境8.5 ...	2748	142
5	美食	麦当劳(万达店)	785 条点评	准四星商户	人均￥24	邯郸路600号万达商业广场B1楼A05号铺	口味7.4 环境7.2 ...	785	24
6	美食	蒸年青STEAMYOUNG(百联又一城购物中心店)	3779 条点评	准五星商户	人均￥70	淞沪路8号百联又一城购物中心7层	口味8.6 环境8.6 ...	3779	70
...	...	...	...	...	...	...	...	...	...
1217	购物	屈臣氏(苏宁电器广场店)	22 条点评	三星商户	人均￥75	邯郸路585号苏宁电器广场内	质量7.5 环境7.1 ...	22	75
1232	购物	澳人坊(万达店)	12 条点评	准四星商户	人均￥37	淞沪路77号万达广场1层	质量7.5 环境7.1 ...	12	37
1237	购物	奥卡索(东方商厦店)	6 条点评	准四星商户	人均￥333	四平路2500号东方商厦B1层	质量7.1 环境7.1 ...	6	333
1240	购物	TISSOT(巴黎春天店)	17 条点评	三星商户	人均￥4671	淞沪路1号巴黎春天1层	质量7.1 环境7.1 ...	17	4671
1259	购物	as女鞋店(巴黎春天店)	11 条点评	准四星商户	人均￥921	淞沪路1号巴黎春天1层	质量7.1 环境6.9 ...	11	921

560 rows × 9 columns

print(df_new.iloc[0]['commentlist'].split('                                ')[2][2:])

7.4

quality = []
environment = []
service = []
for i in range(len(df_new)):
    if ' ' in df_new.iloc[i]['commentlist']:
        quality.append(float(df_new.iloc[i]['commentlist'].split('                                ')[0][2:]))
        environment.append(float(df_new.iloc[i]['commentlist'].split('                                ')[1][2:]))
        service.append(float(df_new.iloc[i]['commentlist'].split('                                ')[2][2:]))

df_new['quality'] = quality
df_new['environment'] = environment
df_new['service'] = service

df_new.head()

	classify	name	comment	star	price	address	commentlist	comment_new	price_new	quality	environment	service
1	美食	泰国街边料理	74 条点评	准四星商户	人均￥48	黄兴路合生汇B2美食集市内	口味7.4 环境7.6 ...	74	48	7.4	7.6	7.4
2	美食	壹面如故(苏宁生活广场店)	265 条点评	准四星商户	人均￥21	邯郸路585号苏宁生活广场B1层	口味7.0 环境7.2 ...	265	21	7.0	7.2	7.2
3	美食	鮨谷•Aburiya(合生汇店)	2748 条点评	准五星商户	人均￥142	翔殷路1099号合生广场5楼23、28铺位	口味8.9 环境8.5 ...	2748	142	8.9	8.5	8.4
5	美食	麦当劳(万达店)	785 条点评	准四星商户	人均￥24	邯郸路600号万达商业广场B1楼A05号铺	口味7.4 环境7.2 ...	785	24	7.4	7.2	7.2
6	美食	蒸年青STEAMYOUNG(百联又一城购物中心店)	3779 条点评	准五星商户	人均￥70	淞沪路8号百联又一城购物中心7层	口味8.6 环境8.6 ...	3779	70	8.6	8.6	8.6

pickle是python语言的一个标准模块，安装python后已包含pickle库，不需要单独再安装。
pickle模块实现了基本的数据序列化和反序列化。通过pickle模块的序列化操作我们能够将程序中运行的对象信息保存到文件中去，永久存储；通过pickle模块的反序列化操作，我们能够从文件中创建上一次程序保存的对象。

# 4、结果存为.pkl文件

# 数据存储.pkl数据

import pickle
pic = open('data.pkl','wb')
pickle.dump(df_new,pic)
pic.close()
print('finished!')
# 将数据存成了pkl文件

finished!

通过本次实践，锻炼了操作DataFrame格式的数据能力，也更加熟练掌握了数据分片、数据清洗、处理的基本思路。

你知道的越多，你不知道的越多。
有道无术，术尚可求，有术无道，止于术。
如有其它问题，欢迎大家留言，我们一起讨论，一起学习，一起进步

carroll18

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录