python 合并数据集,分块，处理和在Pandas / Python中合并数据集

最新推荐文章于 2021-06-19 11:05:45 发布

weixin_39742392

最新推荐文章于 2021-06-19 11:05:45 发布

阅读量110

点赞数

文章标签： python 合并数据集

当遇到大型数据文件无法一次性读取时，可以采用分块读取的方法。使用pandas的read_fwf函数配合chunksize参数，每次读取20000行数据进行处理，然后将处理后的数据块追加写入CSV文件。通过设置mode='a'确保每次写入都附加到文件末尾，不会覆盖已有内容。这种方式避免了内存溢出问题，实现了大文件的有效处理和存储。

摘要由CSDN通过智能技术生成

There is a large dataset, containing a strings.

I just want to open it via read_fwf using widths, like this:

widths = [3, 7, ..., 9, 7]

tp = pandas.read_fwf(file, widths=widths, header=None)

It would help me to mark the data,

But the system crashes (works with nrows=20000). Then I decided to do it by chunk (e.g. 20000 rows), like this:

cs = 20000

for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)

...:

My question is: what should I do in a loop to merge (concatenate?) the chunks back in a .csv file after some processing of chunk (marking the row, dropping or modyfiing the column)? Or there is another way?

解决方案

I'm going to assume that since reading the entire file

tp = pandas.read_fwf(file, widths=widths, header=None)

fails but reading in chunks works, that the file is too big to be read at once and that you encountered a MemoryError.

In that case, if you can process the data in chunks, then to concatenate the results in a CSV, you could use chunk.to_csv to write the CSV in chunks:

filename = ...

for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)

# process the chunk

chunk.to_csv(filename, mode='a')

Note that mode='a' opens the file in append mode, so that the output of each

chunk.to_csv call is appended to the same file.