pythoncsv按内容切分_使用Python根据特定列拆分CSV文件

1586010002-jmsa.png

I'm a Python beginner, and have made a few basic scripts. My latest challenge is to take a very large csv file (10gb+) and split it into a number of smaller files, based on the value of a particular variable in each row.

For example, the file may look like this:

Category,Title,Sales

"Books","Harry Potter",1441556

"Books","Lord of the Rings",14251154

"Series", "Breaking Bad",6246234

"Books","The Alchemist",12562166

"Movie","Inception",1573437

And I would want to split the file into separate files:

Books.csv, Series.csv, Movie.csv

In reality there will be hundreds of categories, and they will not be sorted. In this case they are in the first column but in future they may not be.

I've found a few solutions online but nothing in Python. There is a really simple AWK command that can do this in one line, but I cannot get access to AWK in work.

I've written the following code which works, but I think it is probably very inefficient. Can anybody suggest how to speed it up?

import csv

#Creates empty set - this will be used to store the values that have already been used

filelist = set()

#Opens the large csv file in "read" mode

with open('//directory/largefile', 'r') as csvfile:

#Read the first row of the large file and store the whole row as a string (headerstring)

read_rows = csv.reader(csvfile)

headerrow = next(read_rows)

headerstring=','.join(headerrow)

for row in read_rows:

#Store the whole row as a string (rowstring)

rowstring=','.join(row)

#Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use

filename = (row[0])

#This basically makes sure it is not looking at the header row.

if filename != "Category":

#If the filename is not in the filelist set, add it to the list and create new csv file with header row.

if filename not in filelist:

filelist.add(filename)

with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:

f.write(headerstring)

f.write("\n")

f.close()

#If the filename is in the filelist set, append the current row to the existing csv file.

else:

with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:

f.write(rowstring)

f.write("\n")

f.close()

Thanks!

解决方案

A memory efficient way and one that avoids keep re-opening files to append here (as long as you're not going to generate huge amounts of open file handles) is to use a dict to map the category to a fileobj. Where that file isn't yet opened, then create it and write the header, then always write all rows to the corresponding file, eg:

import csv

with open('somefile.csv') as fin:

csvin = csv.DictReader(fin)

# Category -> open file lookup

outputs = {}

for row in csvin:

cat = row['Category']

# Open a new file and write the header

if cat not in outputs:

fout = open('{}.csv'.format(cat), 'w')

dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)

dw.writeheader()

outputs[cat] = fout, dw

# Always write the row

outputs[cat][1].writerow(row)

# Close all the files

for fout, _ in outputs.values():

fout.close()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值