问题描述:
当我们在读取文件时,通常会占用较大内存,但是当内存有限的时候,我们应该如何去处理/读取文件呢
全部读取处理方式
import tracemalloc
filepath = "test.txt"
# 开始跟踪内存分配
tracemalloc.start()
def process_line(line):
pass
with open(filepath, 'r') as f:
lines = f.readlines()
for line in lines:
process_line(line)
# 获取当前内存使用情况以及峰值内存
current, peak = tracemalloc.get_traced_memory()
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()
代码输出:
current memory usage: 5.999628067016602 MB
peak memory usage: 6.011771202087402MB
上面代码中,我们一次性将文件全部读取到内存中,然后对每一行数据进行处理
当前内存使用:5.99MB
峰值内存使用:6.01MB
这只是一个较小的文件(文件大小576K),当处理文件大小为几GB甚至几十GB时候,严重超出系统内存大小,应该如何处理呢?
解决方法1:使用迭代器
# 开始跟踪内存分配
tracemalloc.start()
class LineIterator:
def __init__(self, filepath) -> None:
self.file = open(filepath, 'r')
def __iter__(self):
return self
def __next__(self):
line = self.file.readline()
# 这里可以对行进行处理,筛选出我们需要的行返回
if line:
return line
else:
self.file.close()
raise StopIteration
line_iter = LineIterator(filepath)
for line in line_iter:
process_line(line)
# 获取当前内存使用情况以及峰值内存
current, peak = tracemalloc.get_traced_memory()
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()
代码输出:
current memory usage: 0.003922462463378906 MB
peak memory usage: 0.03148365020751953MB
上述代码中,我们自己写了一个迭代器来一行一行的读取文件
内存占用从5.99MB降低至0.0039MB
峰值内存从6.01MB降低至0.031MB
极大的降低了内存的使用
解决方法2:
tracemalloc.start()
with open(filepath, 'r') as f:
for line in f:
process_line(line)
# 获取当前内存使用情况以及峰值内存
current, peak = tracemalloc.get_traced_memory()
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()
代码输出:
current memory usage: 0.0012960433959960938 MB
peak memory usage: 0.021081924438476562MB
当前内存占用0.0012MB
峰值内内存占用0.021MB
解决方法3:生成器
tracemalloc.start()
def generator(file_path):
with open(file_path,'r') as f:
for line in f:
yield line
gen = generator(filepath)
for line in gen:
process_line(line)
current, peak = tracemalloc.get_traced_memory() # 获取当前内存使用情况以及峰值内存
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()
代码输出:
current memory usage: 0.0008630752563476562 MB
peak memory usage: 0.021325111389160156MB
当前内存使用:0.00086MB
峰值内存使用:0.021MB
写入文件时内存占用测试
import tracemalloc
file_path = "test.txt"
tracemalloc.start()
with open(file_path, "a") as f:
for i in range(1000):
f.write(str(i)+'\n')
current, peak = tracemalloc.get_traced_memory()
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
当i的值为1000时代码输出:
current memory usage: 0.001049041748046875 MB
peak memory usage: 0.0675973892211914MB
当i的值为100000时代码输出:
current memory usage: 0.001049041748046875 MB
peak memory usage: 0.12307262420654297MB
当i的值为10000000时代码输出:
current memory usage: 0.001049041748046875 MB
peak memory usage: 0.12307262420654297MB
本文参考:
https://www.bilibili.com/video/BV1jt421c7yN/?spm_id_from=333.1007.top_right_bar_window_custom_collection.content.click&vd_source=cf0b4c9c919d381324e8f3466e714d7a