2020-05-23

最新推荐文章于 2020-07-09 11:33:42 发布

Inannan_wzh

最新推荐文章于 2020-07-09 11:33:42 发布

阅读量242

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/Inannan_wzh/article/details/106296517

版权

笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

task 2 数据读取与数据扩增

下载测试数据集
通过pandas os，requests模块从指定连接中下载数据流，下载对象是压缩包故还需引入zipfile处理。导入数据过程中需注意：所有文件路径中不要有中文；循环存储方式使用 iter_content()方法。

#导包import pandas as pdimport osimport requestsimport zipfile #路径中不要有中文links = pd.read_csv(‘D:\DataWhale - CVimagedetect\CV_DataSet\mchar_data_list_0515.csv’) # 新建一个文件夹用来存数据，已有就不建dir_name = 'The Street View House Numbers Dataset’mypath = 'D:\DataWhale - CVimagedetect\CV_DataSet\image_DataSet’if not os.path.exists(mypath + dir_name): os.mkdir(mypath + dir_name) #枚举对象输出为带索引序列的元组列表[(0,‘xx’)] for i,link in enumerate(links[‘link’]): file_name = links[‘file’][i] print(file_name, ‘\t’, link) # file_name = mypath + dir_name + ‘/’ + file_name if not os.path.exists(file_name): response = requests.get(link, stream=True) with open( file_name, ‘wb’) as f: #流下载适合iter_content 边下载边存硬盘，requests.get()是下载在内存完成后再存到硬盘 for chunk in response.iter_content(chunk_size=1024): if chunk: f.write(chunk) zip_list = [‘mchar_train’, ‘mchar_test_a’, ‘mchar_val’] for little_zip in zip_list: # 卖萌可耻 if not os.path.exists(mypath + dir_name + ‘/’ + little_zip): zip_file = zipfile.ZipFile(mypath + dir_name + ‘/’ + little_zip + ‘.zip’, ‘r’) zip_file.extractall(path = mypath + dir_name ) # 然后我们就会发现，多了_MACOSX文件夹，里边的内容不可读，这是上传者的mac压缩文件时自动生成的，类似图片封面的文件，无用，删去即可2. 查看数据得到的数据集包括3个数据文件夹，分别是 test-测试集（4W张），train -训练集（3W张），val - 验证集（1W张），2个json文件。点击进入图片集查看图片数据，图片集为截取带有数字的PNG格式数据，并没有题干介绍的字符位置框。通过import json文件打开查看json文件内容，json文件内为训练集、验证集的图片对应字符位置数组，内含位置、字符大小、标签（数值）。import json json_path = (‘D:\DataWhale - CVimagedetect\CV_DataSet\image_DataSetThe Street View House Numbers Dataset\mchar_train.json’)with open(json_path, ‘r’,encoding=‘utf-8’) as f: temp = json.loads(f.read()) print(temp)时会报错：这是由于json文件过大，jupyter notebook的内存限制无法读出。notebook --generate-config”，进入文件后设置“opub_data_rate_limit”去掉注释后多家几个0，重启Jupyter notebook即可。公司电脑不识别jupyter notebook命令，需要有管理员权限调用，把conda的python路径设为环境变量。另一种方法是直接调用Anaconda Promt进行设置。重新运行，得到整体的Json文件。
IOPub data rate exceeded.The notebook server will temporarily stop sending outputto the client in order to avoid crashing it.To change this limit, set the config variable--NotebookApp.iopub_data_rate_limit. Current values:NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)NotebookApp.rate_limit_window=3.0 (secs)
执行单条json，查看某张图片的信息:
print(temp[‘000000.png’]) #输出结果： {‘height’: [219, 219], ‘label’: [1, 9], ‘left’: [246, 323], ‘top’: [77, 81], ‘width’: [81, 96]}

返回结果中可以看到返回的是 000000.png图片中的数字信息，可见输出的是2维数组信息，height，width数组信息是数字的宽高，top、left是2维数字在图片中的像素坐标；label是图片中数字的标签。