python csv合并单元格_在Python中用不同的列合并CSV

本文介绍如何在Python中合并包含不同列的多个CSV文件,使用csv.DictReader和csv.DictWriter,根据列名而非列位置进行合并。由于不能使用pandas模块以防内存不足,这种方法逐行读取和写入,确保合并后的CSV文件中空值对应于源文件中缺失的列。
摘要由CSDN通过智能技术生成

I have hundreds of large CSV files that I would like to merge into one. However, not all CSV files contain all columns. Therefore, I need to merge files based on column name, not column position.

Just to be clear: in the merged CSV, values should be empty for a cell coming from a line which did not have the column of that cell.

I cannot use the pandas module, because it makes me run out of memory.

Is there a module that can do that, or some easy code?

解决方案

The csv.DictReader and csv.DictWriter classes should work well (see Python docs). Something like this:

import csv

inputs = ["in1.csv", "in2.csv"] # etc

# First determine the field names from the top line of each input file

# Comment 1 below

fieldnames = []

for filename in inputs:

with open(filename, "r", newline="") as f_in:

reader = csv.reader(f_in)

headers = next(reader)

for h in headers:

if h not in fieldnames:

fieldnames.append(h)

# Then copy the data

with open("out.csv", "w", newline="") as f_out: # Comment 2 below

writer = csv.DictWriter(f_out, fieldnames=fieldnames)

for filename in inputs:

with open(filename, "r", newline="") as f_in:

reader = csv.DictReader(f_in) # Uses the field names in this file

for line in reader:

# Comment 3 below

writer.writerow(line)

Comments from above:

You need to specify all the possible field names in advance to DictWriter, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known before DictWriter can write the first line. This part would be more efficient using sets instead of lists (the in operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code.

The above code is for Python 3, where weird things happen in the CSV module without newline="". Remove this for Python 2.

At this point, line is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in the DictReader and DictWriter constructors.

This method should not run out of memory, because it never has the whole file loaded at once.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值