最近在看迭代器与生成器,记录一点读取大文件的脚本:
读取的文件有1.16G
# -*- coding: utf-8 -*-
"""
读取一个大文件,简单统计有多少行记录
@Time : 2022/5/29 16:04
@Auth : Eve
@File :read_big_data.py
@IDE :PyCharm
"""
import time
t1 = time.time()
cnt = 0
with open(r'D:\frequencepy\itertools\thisisbigdata.txt', encoding='utf-8') as f:
for line in iter(f.readline, '\n'):
# 这个if判断必须要加,否则找不到空的换行会一直读下去
if len(line) == 0:
break
cnt += 1
print(cnt)
t2 = time.time()
print('处理完毕耗时{}'.format(t2-t1))
输出如下:
4279200
处理完毕耗时16.499059915542603
比较了下和readlines全部读入,效率更高一些。不过对于内存的消耗没有深入比较,后面再补充。
补充其他博客看到的,使用生成器读取大文件,更加节省内存:
def read_file(file):
with open(file, mode='r', encoding='utf8') as f:
while True:
one_line = f.readline().strip()
if not one_line:
return
yield one_line
再补充一种写法,B站视频中看到的,可以解决大文件,只在一行的情况,可以人为设定分隔符
# -*- coding: utf-8 -*-
"""
另一种写法,使用python读取大文件
@Time : 2022/6/4 16:34
@Auth : Eve
@File :bigdata.py
@IDE :PyCharm
"""
import time
def myreadline(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos+len(newline):]
chunk = f.read(4096*10)
if not chunk:
yield buf
break
buf += chunk
t1 = time.time()
with open(r"D:\frequencepy\itertools\thisisbigdata.txt", encoding='utf-8') as f:
cnt = 0
for line in myreadline(f, "\n"):
# print(line)
cnt += 1
t2 = time.time()
print(cnt)
print((t2-t1))