Python 关于大文件的读写

最新推荐文章于 2024-08-19 17:47:49 发布

倾城一少

最新推荐文章于 2024-08-19 17:47:49 发布

阅读量691

点赞数

分类专栏： Python 文章标签： python 开发语言 HDF5 pandas Powered by 金山文档

本文链接：https://blog.csdn.net/u010329292/article/details/128924865

版权

Python 专栏收录该内容

22 篇文章 6 订阅

订阅专栏

文章讲述了在处理大型BSON文件时遇到的内存和IO瓶颈问题。通过将验证集从训练集中分离，作者比较了使用PandasDataFrame和list来存储数据的效率，发现list更优。最终，使用pickle格式进行文件存储，解决了速度和容量问题。

摘要由CSDN通过智能技术生成

1、前言

项目时遇到训练集过大的情况，无法直接读入内存，而使用keras的fit_generator()感觉也遇到了IO瓶颈。于是想把验证集从训练集中分离出来，每次只把验证集读取进内存，节省一定的时间。在这个过程中遇到了一系列问题，记录下来以备查找。

2、读取

备注: Pandas.DataFrame是一个很好用的数据结构，但是在读取大文件时请小心，不然容易造成悲剧。

我遇到的问题是：训练集是一个62G的BSON文件，需要根据索引从中找出验证集中数据的位置，读取出来并写入到一个独立文件中，以备后续使用。

（1）使用DataFrame逐行储存

我首先想到的方法就是使用DataFrame格式，直接把每个读到的数据逐行存进df里。

示例代码如下：

import os
from tqdm import *
import bson
import pandas as pd
# Input data files are available in the "../input/" directory.
data_dir = "../input/"
train_bson_path = os.path.join(data_dir, "train.bson")
# First load the lookup tables from the CSV files.
train_offsets_df = pd.read_csv("train_offsets.csv", index_col=0)  # index and features of every instance in the training set file
val_images_df = pd.read_csv("val_images.csv", index_col=0)  # index of which instance is belong to validation set
num_val_images = len(val_images_df)
train_bson_file = open(train_bson_path, "rb")  # open training set file
val_full_dataset = pd.DataFrame(columns=["x", "y"])  # create df to save val set
with tqdm(total=num_val_images) as pbar:  # show estimited time and progress
    for c, d in enumerate(val_images_df.itertuples()):
        offset_row = train_offsets_df.loc[d[1]]  # find the corresponding location
        # Read this product's data from the BSON file.
        train_bson_file.seek(offset_row["offset"])  # find where to start reading (random read)
        item_data = train_bson_file.read(offset_row["length"])
        # Grab the image from the product.
        item = bson.BSON.decode(item_data)
        bson_img = item["imgs"][d[3]]["picture"]
        val_full_dataset.loc[c] = {"x": bson_img, "y": d[2]}  # save into the dataframe
        pbar.update()

代码可以运行，可是在实际运行中，数据的读取速度急速下降，从最开始的700/s下降到100/s，甚至还在继续下降。这段简单的代码实际运行了25个小时。

经过一系列测试发现，代码的瓶颈在于val_full_dataset.loc[c] = {"x": bson_img, "y": d[2]}，重点是这个df.loc[]会随着df的变长而越来越慢。

接下来我又测试了df.append()，得到了与loc相似的结论，都说明了它们的运行速度与df本身的长度有关。所以在定长的df中使用loc是正确的。