我们平时很少读取1个G或者N个G的大文件。但假如要读取500G的大文件,是不可能直接通过 f.read() 读到内存的,因为内存会爆掉··· 如果是超过内存容量的大文件,需要分次从磁盘内读取到内存中,这时候生成器就格外的重要了。直接上代码,非常简单。
按行读取生成器:
def read_file(file):
with open(file, mode='r', encoding='utf8') as f:
while True:
one_line = f.readline().strip()
if not one_line:
return
yield one_line
按行读取csv文件
有个 student_info.csv 文件的部分内容如下:
Name,Date,English,Math,Chinese,Money,Other_1,Other_2,Other_3
XiaoMing,2020-07-20T01:07:00Z,42,93,0.45,3077,5739,0.54,1
XiaoHu,2020-07-20T01:07:31Z,320,852,0.38,18874,37143,0.51,1
XiaoWang,2020-07-20T01:07:48Z,38,118,0.32,3581,34875,0.1,1
XiaoYe,2020-07-20T01:12:48Z,312,477,0.65,3210,4935,0.65,1
XiaoAn,2020-07-20T01:14:04Z,163,263,0.62,2152,4117,0.52,1
XiaoChen,2020-07-20T01:17:30Z,8,10,0.8,423,777,0.54,0.98
XiaoPeng,2020-07-20T01:17:44Z,5,9,0.56,1053,1398,0.75,1
XiaoHong,2020-07-20T01:19:33Z,392,797,0.49,8969,15366,0.58,1
XiaoNing,2020-07-20T01:20:59Z,24,41,0.59,1387,2677,0.52,1
XiaoJing,2020-07-20T01:23:22Z,16,53,0.3,696,1378,0.51,1
XiaoChong,2020-07-20T01:23:45Z,76,111,0.68,2127,2713,0.78,1
XiaoMing,2020-07-20T01:24:01Z,52,135,0.39,3251,6695,0.49,0.99
现在使用生成器按行读取并输出前 6 列和前 5 行:
import csv
from collections import namedtuple
def read_file(file):
with open(file, mode='r', encoding='utf8') as f:
while True:
one_line = f.readline().strip()
if not one_line:
return
yield one_line
lines = read_file("student_info.csv") # lines 是一个生成器
csv_reader = csv.reader(lines)
header = next(csv_reader)[:6] # 只使用前 6 列
print(header)
# Student = namedtuple("Student", header)
Student = namedtuple("Student", "Name Date English Math Chinese Money")
for index, row in enumerate(csv_reader):
_ = Student._make(row[:6]) # 适配前 6 列
print(row[:4]) # 输出前 4 列
print(_.Name, _.Date, _.English, _.Math, _.Chinese, _.Chinese)
if index == 5: # 输出前 5 行
break