pythoncsv_用Python处理csv文件的一些小玩意儿

Python CSV Toolkit

整理了一些个人在利用python处理csv文件时经常用到的一些自定义方法,放在这里主要方便自己查阅,也可以给其他人做参考

目录

输出CSV文件某列的匹配/不匹配的记录

调整csv文件的列的顺序

CSV转换器

抽取特定列

除去完全重复的记录

根据列名排序

键值互换

输出CSV文件某列的匹配/不匹配的记录

主要用于从csv文件中抽取出匹配特定列的特定字段集合的记录,比如现有这么一个csv文件(表格化后)

name

age

sex

Danny

24

male

Daisy

23

female

Lancelot

23

unknown

Lydia

21

female

...

...

...

需要输出其中age为23的记录到新的csv文件,则我们可以先把23这么个关键词用一个列表收集起来,然后通过下列代码从csv文件中找出所有符合条件的记录并输出

import sys

import csv

# try to fix '_csv.Error: field larger than field limit (131072)'

csv.field_size_limit(sys.maxint)

# write to common csv file with delimiter ','

# output the rows with matched id in id_list to a new csv file

def csv_match(id_list,key,input_file,output_file):

with open(input_file, 'rb') as f:

reader = csv.DictReader(f)

rows = [row for row in reader if row[key] in set(id_list)]

header = rows[0].keys()

with open(output_file, 'w') as f:

f.write(','.join(header))

f.write('\n')

for data in rows:

f.write(",".join(data[h] for h in header))

f.write('\n')

调用的时候:

lst=['23']

csv_match(lst,'age','in.csv','out.csv')

key为需要匹配的列名,另外我们也可以提取不符合该条件的记录,‘取个反’就行了

# output the rows with not matched id in id_list to a new csv file

def csv_not_match(id_list, key, input_file, output_file):

with open(input_file, 'rb') as f:

reader = csv.DictReader(f)

rows = [row for row in reader if not row[key] in set(id_list)]

header = rows[0].keys()

with open(output_file, 'w') as f:

f.write(','.join(header))

f.write('\n')

for data in rows:

f.write(",".join(data[h] for h in header))

f.write('\n')

对于需要判断csv文件中多个列的值的情况,只需修改对应的判别条件和传入参数情况即可

# output the rows with matched key1 or key2 in refer_list to a new csv file

# @params

# refer_list: the list referred to

# key,key2: column name of csv file to check the value in the refer_list or not

def csv_match2(refer_list, key1, key2, input_file, output_file):

with open(input_file, 'rb') as f:

reader = csv.DictReader(f)

rows = [row for row in reader if (row[key1] in set(refer_list)) or (row[key2] in set(refer_list))]

header = rows[0].keys()

with open(output_file, 'w') as f:

f.write(','.join(header))

f.write('\n')

for data in rows:

f.write(",".join(data[h] for h in header))

f.write('\n')

调整csv文件的列的顺序

有时候我们输出的或者拿到的csv文件的列的顺序不够‘人性化’,为了让我们看起来更加直观,更舒服一点,我们可以按照我们的需要调整列的顺序

import csv

# reorder the column of the csv file to what you want

def csv_reorder(in_file, out_file,lst_order):

with open(in_file, 'rb') as infile, open(out_file, 'wb') as outfile:

fieldnames=lst_order

writer = csv.DictWriter(outfile, fieldnames=fieldnames)

writer.writeheader()

for row in csv.DictReader(infile):

writer.writerow(row)

其中lst_order为我们需要的列名顺序,用list存储,举个例子

season_id,league_name,league_size

2003,scottish-premiership,12

2016,1-hnl,10

2004,alka-superligaen,12

2006,allsvenskan,14

1992,premier-league,22

...

现在我们想调整他的顺序,按照league_name,season_id,league_size的顺序重新组合一下

则调用

lst_order = ['league_name','season_id','league_size']

csv_reorder('leagues_size.csv', 'leagues_size_new.csv', lst_order)

得到结果

league_name,season_id,league_size

scottish-premiership,2003,12

1-hnl,2016,10

alka-superligaen,2004,12

allsvenskan,2006,14

premier-league,1992,22

...

CSV转换器

这个主要是用来进行csv和python的一些内置的容器例如list,dict之类的转换,包括一些特殊的多级字典,或者是嵌套列表的字典等等,这里只是把他们打个包放在一起,具体的可以参照我之前写的一篇文章

import csv

#---------------------------------------------------csv <--> dict--------------------------------------------

# convert csv file to dict

# @params:

# key/value: the column of original csv file to set as the key and value of dict

def csv2dict(in_file,key,value):

new_dict = {}

with open(in_file, 'rb') as f:

reader = csv.reader(f, delimiter=',')

fieldnames = next(reader)

reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',')

for row in reader:

new_dict[row[key]] = row[value]

return new_dict

# convert csv file to dict(key-value pairs each row)

# default: set row[0] as key and row[1] as value of the dict

def row_csv2dict(csv_file):

dict_club={}

with open(csv_file)as f:

reader=csv.reader(f,delimiter=',')

for row in reader:

dict_club[row[0]]=row[1]

return dict_club

# write dict to csv file

# write each key/value pair on a separate row

def dict2csv(dict, file):

with open(file, 'wb') as f:

w = csv.writer(f)

# write each key/value pair on a separate row

w.writerows(dict.items())

# write dict to csv file

# write all keys on one row and all values on the next

def dict2csv2(dict, file):

with open(file, 'wb') as f:

w = csv.writer(f)

# write all keys on one row and all values on the next

w.writerow(dict.keys())

w.writerow(dict.values())

# build a dict of list like {key:[...element of lst_inner_value...]}

# key is certain column name of csv file

# the lst_inner_value is a list of specific column name of csv file

def build_list_dict(source_file, key, lst_inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

for element in lst_inner_value:

new_dict.setdefault(row[key], []).append(row[element])

return new_dict

# sample:

# test_club=build_list_dict('test_info.csv','season',['move from','move to'])

# print test_club

# build specific nested dict from csv files

# @params:

# source_file

# outer_key:the outer level key of nested dict

# inner_key:the inner level key of nested dict,and rest key-value will be store as the value of inner key

def build_level2_dict(source_file,outer_key,inner_key):

new_dict = {}

with open(source_file, 'rb')as csv_file:

reader = csv.reader(csv_file, delimiter=',')

fieldnames = next(reader)

inner_keyset=fieldnames

inner_keyset.remove(outer_key)

inner_keyset.remove(inner_key)

csv_file.seek(0)

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

item = new_dict.get(row[outer_key], dict())

item[row[inner_key]] = {k: row[k] for k in inner_keyset}

new_dict[row[outer_key]] = item

return new_dict

# build specific nested dict from csv files

# @params:

# source_file

# outer_key:the outer level key of nested dict

# inner_key:the inner level key of nested dict

# inner_value:set the inner value for the inner key

def build_level2_dict2(source_file,outer_key,inner_key,inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

item = new_dict.get(row[outer_key], dict())

item[row[inner_key]] = row[inner_value]

new_dict[row[outer_key]] = item

return new_dict

# build specific nested dict from csv files

# @params:

# source_file

# outer_key:the outer level key of nested dict

# lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct

# {outer_key:[{pairs of lst_inner_value}]}

def build_level2_dict3(source_file,outer_key,lst_inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

new_dict.setdefault(row[outer_key], []).append({k: row[k] for k in lst_inner_value})

return new_dict

# build specific nested dict from csv files

# @params:

# source_file

# outer_key:the outer level key of nested dict

# lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct

# {outer_key:{key of lst_inner_value:[...value of lst_inner_value...]}}

def build_level2_dict4(source_file,outer_key,lst_inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

# print row

item = new_dict.get(row[outer_key], dict())

# item.setdefault('move from',[]).append(row['move from'])

# item.setdefault('move to', []).append(row['move to'])

for element in lst_inner_value:

item.setdefault(element, []).append(row[element])

new_dict[row[outer_key]] = item

return new_dict

# build specific nested dict from csv files

# @params:

# source_file

# outer_key:the outer level key of nested dict

# lst_inner_key:a list of column name

# lst_inner_value: a list of column name,for circumstance that the inner value of the same lst_inner_key are not distinct

# {outer_key:{lst_inner_key:[...lst_inner_value...]}}

def build_list_dict2(source_file,outer_key,lst_inner_key,lst_inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

# print row

item = new_dict.get(row[outer_key], dict())

item.setdefault(row[lst_inner_key], []).append(row[lst_inner_value])

new_dict[row[outer_key]] = item

return new_dict

# dct=build_list_dict2('test_info.csv','season','move from','move to')

# build specific nested dict from csv files

# a dict like {outer_key:{inner_key1:{inner_key2:{rest_key:rest_value...}}}}

# the params are extract from the csv column name as you like

def build_level3_dict(source_file,outer_key,inner_key1,inner_key2):

new_dict = {}

with open(source_file, 'rb')as csv_file:

reader = csv.reader(csv_file, delimiter=',')

fieldnames = next(reader)

inner_keyset=fieldnames

inner_keyset.remove(outer_key)

inner_keyset.remove(inner_key1)

inner_keyset.remove(inner_key2)

csv_file.seek(0)

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

item = new_dict.get(row[outer_key], dict())

sub_item = item.get(row[inner_key1], dict())

sub_item[row[inner_key2]] = {k: row[k] for k in inner_keyset}

item[row[inner_key1]] = sub_item

new_dict[row[outer_key]] = item

return new_dict

# build specific nested dict from csv files

# a dict like {outer_key:{inner_key1:{inner_key2:inner_value}}}

# the params are extract from the csv column name as you like

def build_level3_dict2(source_file,outer_key,inner_key1,inner_key2,inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

item = new_dict.get(row[outer_key], dict())

sub_item = item.get(row[inner_key1], dict())

sub_item[row[inner_key2]] = row[inner_value]

item[row[inner_key1]] = sub_item

new_dict[row[outer_key]] = item

return new_dict

# build specific nested dict from csv files

# a dict like {outer_key:{inner_key1:{inner_key2:[inner_value]}}}

# for multiple inner_value with the same inner_key2,thus gather them in a list

# the params are extract from the csv column name as you like

def build_level3_dict3(source_file,outer_key,inner_key1,inner_key2,inner_value):

new_dict = {}

with open(source_file, 'rb')as csv_file:

data = csv.DictReader(csv_file, delimiter=",")

for row in data:

item = new_dict.get(row[outer_key], dict())

sub_item = item.get(row[inner_key1], dict())

sub_item.setdefault(row[inner_key2], []).append(row[inner_value])

item[row[inner_key1]] = sub_item

new_dict[row[outer_key]] = item

return new_dict

#----------------------------------------------------------------------------------------------------------

#---------------------------------------------------csv <--> list--------------------------------------------

def list2csv(list, file):

# def list2csv(list):

# wr = csv.writer(open(file, 'wb'), quoting=csv.QUOTE_ALL)

wr=open(file,'w')

for word in list:

# print ''.join(word)

# wr.writerow([word])

wr.write(word+'\n')

# wr.writerow(str.split(word,'"')[0])

# print [word]

# test_list = ['United States', 'China', 'America', 'England']

# list2csv(test_list,'small_test.csv')

# write nested list of dict to csv

def nestedlist2csv(list, out_file):

with open(out_file, 'wb') as f:

w = csv.writer(f)

fieldnames=list[0].keys() # solve the problem to automatically write the header

w.writerow(fieldnames)

for row in list:

w.writerow(row.values())

# my_list = [{'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia',

# 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi',

# 'players.date_of_birth': '08/02/1991', 'players.team': 'Bordeaux'},

# {'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia',

# 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi',

# 'players.date_of_birth': '08/02/1991', 'players.team': 'Sunderland'},

# {'players.vis_name': 'Lewis Baker', 'players.role': 'Midfielder', 'players.country': 'England',

# 'players.last_name': 'Baker', 'players.player_id': '9574', 'players.first_name': 'Lewis',

# 'players.date_of_birth': '25/04/1995', 'players.team': 'Vitesse'}

# ]

# nestedlist2csv(my_list, 'dict2csv_test.csv')

# collect and convert the first column of csv file to list

def csv2list(csv_file):

lst = []

with open(csv_file, 'rb')as f:

reader = csv.reader(f, delimiter=',')

for row in reader:

lst.append(row[0])

return list(set(lst))

#----------------------------------------------------------------------------------------------------------

抽取特定列

抽取特定列的所有值并存储于列表

根据下标抽取特定列到某个新的csv文件

抽取特定列的所有值并存储于列表

获取某列原始的数据并保存为列表

# get certain column value of csv(for common csv file(','))

def get_origin_column_value(file, column_name):

with open(file, 'rb') as f:

role_list = []

reader = csv.reader(f, delimiter=',')

fieldnames = next(reader)

reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',')

for row in reader:

role_list.append(row[column_name])

return role_list

对于某些有特殊需要的可以直接修改代码,比如对原始的列的值进行除重和排序后获取,如下

# get certain column value of csv(for common csv file(',')),and judge if it's repeated

def get_column_value2(file, column_name):

with open(file, 'rb') as f:

role_list = []

reader = csv.reader(f, delimiter=',')

fieldnames = next(reader)

reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',')

for row in reader:

role_list.append(row[column_name])

role_set = set(role_list)

return sorted(list(role_set))

根据下标抽取特定列到某个新的csv文件

import csv

# extract certain column from csv file according to the column#

def column_extract(file_in,file_out,index):

with open(file_in,'r') as f_in:

with open(file_out,'w') as f_out:

for line in f_in:

f_out.write(line.split(',')[index])

f_out.write('\n') # comment if a new line already exists

除去完全重复的记录

# eliminated the completely repeated record in repeated file for further analysis

def eliminate_repeated_row(in_file,out_file):

with open(in_file,'rb') as in_file,open(out_file,'wb')as out_file:

seen=set()

for line in in_file:

# print line

if line in seen:continue

seen.add(line)

out_file.write(line)

对csv文件按照某一列排序

# sort the csv file by certain column to put the similar record together for further analysis

def sort_csv_byColumn(in_file, out_file,column_name):

with open(in_file, 'rb') as f:

reader = csv.reader(f, delimiter=',')

fieldnames = next(reader)

reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',')

sorted_list = sorted(reader, key=lambda row: row[column_name], reverse=True)

# print sorted_list

csv_converter.nestedlist2csv(sorted_list, out_file)

例如我们按照league_name排序(注意这里调用了csv转换器中的方法将列表的字典转换为csv文件)

sort_csv_byColumn('leagues_size.csv','ordered_leagues_size.csv','league_name')

得到结果

season_id,league_name,league_size

2016,ykkonen,9

2003,ykkonen,14

2005,ykkonen,14

2006,ykkonen,14

2007,ykkonen,14

2010,ykkonen,13

2011,ykkonen,10

2009,ykkonen,14

2008,ykkonen,14

2012,ykkonen,10

2013,ykkonen,10

2014,ykkonen,10

2015,ykkonen,10

2016,wiener-stadtliga,16

1988,wiener-stadtliga,16

1993,wiener-stadtliga,16

1994,wiener-stadtliga,16

1995,wiener-stadtliga,16

1996,wiener-stadtliga,16

1997,wiener-stadtliga,16

1998,wiener-stadtliga,16

如果我们按league_size排序

sort_csv_byColumn('leagues_size.csv',

'orderedbysize_leagues_size.csv','league_size')

得到结果

season_id,league_name,league_size

2008,virsliga,9

2010,virsliga,9

2012,a-lyga,9

2012,a-pojat-sm-sarja,9

2013,a-pojat-sm-sarja,9

1953,salzburger-liga,9

2010,3-lig-grup-1,9

2013,armenian-first-league,9

2016,ykkonen,9

2014,stirling-sports-premiership,9

2014,hong-kong-premier-league,9

2015,hong-kong-premier-league,9

1996,s-league,9

2015,s-league,9

2013,united-football-league,9

2016,i-league,9

键值互换

csv文件每一条记录其实可以看作是一个字典,有时csv文件里有不同的键对应同一个值的情况,我们想讲记录反转一下,即让值作为键,对应的键作为值

# return a dict with the same value in original as new key and keys as value

def dict_same_value(original_dict):

new_dict={}

for k,v in original_dict.iteritems():

new_dict.setdefault(v,[]).append(k)

return new_dict

最后欢迎大家fork关于这个的github上的repository,一起丰富更多好玩的功能~

更新日志

1、2016-12-18 修复了从csv文件中获取特定的列的值保存为集合的问题,而是存储为原始的列表

2、2016-12-22 改进了csv转换器中的构建二级字典的方法,使其变得更加灵活

3、2016年12月24日14:57:48 在csv转换器部分加入三级字典构造的参照方法

4、2017年1月9日11:28:45 在csv转换器部分,三级字典构造中,加入了最内部存储值为列表的构造方法

5、2017年1月16日10:43:41 在csv转换器部分,加入了构造列表字典的方法以及构造特殊的二级字典(内部为列表)的方法

6、2017年2月9日10:58:17 在csv转换器部分,加入了新的构造特殊的二级字典(内部为列表)的方法

7、2017年2月10日11:21:45 在csv转换器部分,改进了简单的csv文件转换为字典的方法,此外在Csv_Match部分,加入了匹配判断多个列对应的元素条件的方法

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值