python 合并数据集_分块,处理和在Pandas / Python中合并数据集

1586010002-jmsa.png

There is a large dataset, containing a strings.

I just want to open it via read_fwf using widths, like this:

widths = [3, 7, ..., 9, 7]

tp = pandas.read_fwf(file, widths=widths, header=None)

It would help me to mark the data,

But the system crashes (works with nrows=20000). Then I decided to do it by chunk (e.g. 20000 rows), like this:

cs = 20000

for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)

...:

My question is: what should I do in a loop to merge (concatenate?) the chunks back in a .csv file after some processing of chunk (marking the row, dropping or modyfiing the column)? Or there is another way?

解决方案

I'm going to assume that since reading the entire file

tp = pandas.read_fwf(file, widths=widths, header=None)

fails but reading in chunks works, that the file is too big to be read at once and that you encountered a MemoryError.

In that case, if you can process the data in chunks, then to concatenate the results in a CSV, you could use chunk.to_csv to write the CSV in chunks:

filename = ...

for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)

# process the chunk

chunk.to_csv(filename, mode='a')

Note that mode='a' opens the file in append mode, so that the output of each

chunk.to_csv call is appended to the same file.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值