【Python】使用迭代器和生成器处理文件，使用小内存处理大文件

pzx_001

已于 2024-07-24 11:11:59 修改

阅读量210

点赞数 7

文章标签： python linux 服务器

于 2024-07-24 10:37:06 首次发布

本文链接：https://blog.csdn.net/qq_42761751/article/details/140655539

版权

问题描述：

当我们在读取文件时，通常会占用较大内存，但是当内存有限的时候，我们应该如何去处理/读取文件呢

全部读取处理方式

import tracemalloc
filepath = "test.txt"
# 开始跟踪内存分配
tracemalloc.start() 
def process_line(line):
    pass
with open(filepath, 'r') as f:
    lines = f.readlines()
for line in lines:
    process_line(line)

# 获取当前内存使用情况以及峰值内存
current, peak = tracemalloc.get_traced_memory()
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()

代码输出：

current memory usage: 5.999628067016602 MB
peak memory usage: 6.011771202087402MB

上面代码中，我们一次性将文件全部读取到内存中，然后对每一行数据进行处理
当前内存使用：5.99MB
峰值内存使用：6.01MB

这只是一个较小的文件（文件大小576K），当处理文件大小为几GB甚至几十GB时候，严重超出系统内存大小，应该如何处理呢？

解决方法1：使用迭代器

# 开始跟踪内存分配
tracemalloc.start()
class LineIterator:
    def __init__(self, filepath) -> None:
        self.file = open(filepath, 'r')
    
    def __iter__(self):
        return self
    
    def __next__(self):
        line = self.file.readline()
        # 这里可以对行进行处理，筛选出我们需要的行返回
        if line:
            return line
        else:
            self.file.close()
            raise StopIteration
            
line_iter = LineIterator(filepath)

for line in line_iter:
    process_line(line)
#  获取当前内存使用情况以及峰值内存
current, peak = tracemalloc.get_traced_memory()
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()

代码输出：

current memory usage: 0.003922462463378906 MB
peak memory usage: 0.03148365020751953MB

上述代码中，我们自己写了一个迭代器来一行一行的读取文件
内存占用从5.99MB降低至0.0039MB
峰值内存从6.01MB降低至0.031MB
极大的降低了内存的使用

解决方法2：

tracemalloc.start()
with open(filepath, 'r') as f:
    for line in f:
        process_line(line)
        
# 获取当前内存使用情况以及峰值内存
current, peak = tracemalloc.get_traced_memory() 
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()

代码输出：

current memory usage: 0.0012960433959960938 MB
peak memory usage: 0.021081924438476562MB

当前内存占用0.0012MB
峰值内内存占用0.021MB

解决方法3：生成器

tracemalloc.start()

def generator(file_path):
    with open(file_path,'r') as f:
        for line in f:
            yield line

gen = generator(filepath)
for line in gen:
    process_line(line)

current, peak = tracemalloc.get_traced_memory()  # 获取当前内存使用情况以及峰值内存
print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")
tracemalloc.stop()

代码输出：

current memory usage: 0.0008630752563476562 MB
peak memory usage: 0.021325111389160156MB

当前内存使用：0.00086MB
峰值内存使用：0.021MB

写入文件时内存占用测试

import tracemalloc
file_path = "test.txt"

tracemalloc.start()
with open(file_path, "a") as f:
    for i in range(1000):
        f.write(str(i)+'\n')

current, peak = tracemalloc.get_traced_memory()

print(f"current memory usage: {current/1024**2} MB")
print(f"peak memory usage: {peak/1024**2}MB")

当i的值为1000时代码输出：

current memory usage: 0.001049041748046875 MB
peak memory usage: 0.0675973892211914MB

当i的值为100000时代码输出：

current memory usage: 0.001049041748046875 MB
peak memory usage: 0.12307262420654297MB

当i的值为10000000时代码输出：

current memory usage: 0.001049041748046875 MB
peak memory usage: 0.12307262420654297MB

本文参考：
https://www.bilibili.com/video/BV1jt421c7yN/?spm_id_from=333.1007.top_right_bar_window_custom_collection.content.click&vd_source=cf0b4c9c919d381324e8f3466e714d7a

pzx_001

关注

7
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【Python】使用迭代器和生成器处理文件，使用小内存处理大文件

当我们在读取文件时，通常会占用较大内存，但是当内存有限的时候，我们应该如何去处理/读取文件呢1024**21024**2上面代码中，我们一次性将文件全部读取到内存中，然后对每一行数据进行处理当前内存使用：5.99MB峰值内存使用：6.01MB这只是一个较小的文件（文件大小576K），当处理文件大小为几GB甚至几十GB时候，严重超出系统内存大小，应该如何处理呢？1024**21024**2上述代码中，我们自己写了一个迭代器来一行一行的读取文件内存占用从降低至峰值内存从。
复制链接

扫一扫