Python处理CSV文件
CSV(comma-separated value,都好分隔值)文件格式是一种非常简单的数据存储与分享方式。CSV文件中的单元格中存放的只是原始文件。
Excel打开CSV文件,其格式和普通表格没有区别;
用记事本打开CSV文件,其格式如下:
data1,data2,data3
data4,data5,data6
data7,data8,data9
即,每行表示一行数据,同行数据用逗号隔开。
下面需要用到的input_file.csv文件内容如下:
Supplier Name,Invoice Number,Part Number,Cost,Purchase Date
Supplier X,001-1001,2341,$500.00 ,1/20/14
Supplier X,001-1001,2341,$500.00 ,1/20/14
Supplier X,001-1001,5467,$750.00 ,1/20/14
Supplier X,001-1001,5467,$750.00 ,1/20/14
Supplier Y,50-9501,7009,$250.00 ,1/30/14
Supplier Y,50-9501,7009,$250.00 ,1/30/14
Supplier Y,50-9505,6650,$125.00 ,2002/3/14
Supplier Name,Invoice Number,Part Number,Cost,Purchase Date,
Supplier Y,50-9505,6650,$125.00 ,2002/3/14,
Supplier Z,920-4803,3321,$615.00 ,2002/3/14,
Supplier Z,920-4804,3321,$615.00 ,2002/10/14,
Supplier Z,920-4805,3321,$615.00 ,2/17/14,
Supplier Z,920-4806,3321,$345 ,615,2/24/14
使用csv模块读写CSV文件
csv模块用途:用于正确处理数据值中嵌入逗号和其他复杂模式。
注:数据中嵌入逗号会导致数据处理出现错误。例如数据 “63,456.333” 会被错误切分成 “63” 和 “456.333”。
示例:
import csv
in_csvfile = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
out_csvfile = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
with open(in_csvfile, 'r', newline = '') as csv_in_file:
with open(out_csvfile, 'w', newline = "") as csv_out_file:
#创建文件读取对象和写入对象
filereader = csv.reader(csv_in_file, delimiter = ',')
filewriter = csv.writer(csv_out_file, delimiter = ',')
for row_list in filereader:
print(row_list)
filewriter.writerow(row_list)
参数 delimiter = ','
是默认分隔符,所以如果输入文件和输出文件都是用逗号分隔的,就不需要这个参数;
如果,读取对象中 delimiter = '+'
,意为将输入文件按 “+” 进行分隔,写入对象中 delimiter = '-'
,意为将数据用 “-” 拼接后写入输出文件。
筛选特定的行
1.行中的值满足某个条件
示例:
import csv
input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
with open(input_file, 'r', newline = '') as csv_in_file:
with open(output_file, 'w', newline = '') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewritter = csv.writter(csv_out_file)
header = next(filereader)
filewritter.writerow(header) #首先将列名写入输出文档
for row_list in filereader:
supplier = str(row_list[0]).strip()
cost = str(row_list[3]).strip('$').replace(',', '')
if suppplier == 'Supplier Z' or float(cost) > 600.0:
filewritter.writerow(row_list)
2.行中的值属于某个集合
示例:
将时间在列表 important_dates
中的行写入输出文档。
import csv
input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
important_dates = ['1/20/14', '1/30/14']
with open(input_file, 'r', newline = '') as csv_in_file:
with open(output_file, 'w', newline = '') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewritter = csv.writter(csv_out_file)
header = next(filereader)
filewritter.writerow(header) #首先将列名写入输出文档
for row_list in filereader:
if row_list[4] in important_dates: #第四列的内容是日期
filewritter.writerow(row_list)
3.行中的值匹配于某个模式/正则表达式
示例:
匹配 “001-” 开头的行,写入到输出文档。
import csv
import re
input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
#匹配“001-”开头的
pattern = re.compile(r'(?P<my_pattern_group>^001-.*)', re.I)
with open(input_file, 'r', newline = '') as csv_in_file:
with open(output_file, 'w', newline = '') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
header = next(filereader)
filewriter.writerow(header)
for row_list in filereader:
invoice_number = row_list[1]
if pattern.search(invoice_number): #只是这了条件改变了
filewriter.writerow(row_list)
选取特定的列
1.通过选取特定的列索引值
示例:
将第一列、第四列写入输出文档。
import csv
input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
my_columns = [0, 3] #将第一列、第四列写入输出文档
with open(input_file, 'r', newline = '') as csv_in_file:
with open(output_file, 'w', newline = '') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
for row_list in filereader:
row_list_output = []
for index in my_columns:
row_list_output.append(row_list[index])
filewriter.writerow(row_list_output)
2.用过选取特定的列标题
示例:
将列标题为 “Cost” , “Supplier Name” 的列写入输出文档。
原则是先找出列标题对应的索引值,在按照索引值的方式进行后续操作。
import csv
input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
my_columns_name = ['Cost', 'Supplier Name']
with open(input_file, 'r', newline = '') as csv_in_file:
with open(output_file, 'w', newline = '') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
header = next(filereader)
my_columns = []
#找出对应索引值
for index in range(len(header)):
if header[index] in my_columns_name:
my_columns.append(index)
#将列标题写入文档
row_list_output = []
for index in my_columns:
row_list_output.append(header[index])
filewriter.writerow(row_list_output)
#将数据内容写入文档
for row_list in filereader:
row_list_output = []
for index in my_columns:
row_list_output.append(row_list[index])
filewriter.writerow(row_list_output)
选取连续的行
示例:
将第4至16行写入输出文档。
import csv
input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
row_counter = 0
with open(input_file, 'r', newline = '') as csv_in_file:
with open(output_file, 'w', newline = '') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
for row in filereader:
if row_counter >= 3 and row_counter <= 15:
filewriter.writerow([value.strip() for value in row])
row_counter += 1
读取多个CSV文件
示例:
将 input_file1.csv 和 input_file2.csv 文件中的内容写入到输出文件。
import csv
import glob
import os
input_file_path = "E:\\CVSwork\\PythonDataAnalysisFoundation"
output_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\output_file.csv"
file_counter = 1
with open(output_file, 'w', newline = '') as csv_out_file:
filewriter = csv.writer(csv_out_file)
for input_file in glob.glob(os.path.join(input_file_path, 'input_*')):
with open(input_file, 'r', newline = '') as csv_in_file:
print("Operate file {}......".format(os.path.basename(input_file)))
filereader = csv.reader(csv_in_file)
if file_counter == 1:
header = next(filereader, None)
filewriter.writerow(header)
file_counter += 1
for row_list in filereader:
filewriter.writerow(row_list)
其中 os.path。basename()
函数的作用是:加入 input_file = "E:\\CVSwork\\PythonDataAnalysisFoundation\\input_file.csv"
,os.path.basename(input_file)
的结果便是 input_file.csv ,即舍去前面的路径,返回文件的基本文件名。
参考文献:
《Python数据分析基础》