python 读取大文件

最新推荐文章于 2024-05-14 14:10:39 发布

布丁的自我修养

最新推荐文章于 2024-05-14 14:10:39 发布

阅读量807

点赞数

分类专栏： python 文章标签： python 大文件

本文链接：https://blog.csdn.net/budding0828/article/details/90574393

版权

python 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

近期在处理数据时，发现对于几百兆甚至几G的数据，我们是没有办法用notepad++或者sublime等文本编译器打开的，会直接崩溃掉。在程序处理时，也没办法使用普通的文件读法，一下全部加载到内存中。后来查阅网上的几种方法，在这里做个整理。

对于有多行的数据 —— 按行读取

with语句打开和关闭文件，包括抛出一个内部块异常。for line in f文件对象f视为一个迭代器，会自动的采用缓冲IO和内存管理，所以你不必担心大文件。

#If the file is line based
with open(...) as f:
    for line in f:
        process(line) # <do something with line>

参考：https://chenqx.github.io/2014/10/29/Python-fastest-way-to-read-a-large-file/

对于一行或者不好分行的数据 —— 分块读取

对于难以分行读取的文件，可以按照下面的方式。将大文件分割成若干小文件处理，处理完每个小文件后释放该部分内存。这里有一点需要注意，可以根据你自己的处理需求，改变chunk_size的大小，有时候小一点的chunk_size可以读完整个文件，如果太大，还是会爆内存的。

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('really_big_file.dat')
for piece in read_in_chunks(f):
    process_data(piece)

参考：https://www.cnblogs.com/bonelee/p/8446421.html

使用mmap

mmap是一种虚拟内存映射文件的方法，即将一个文件或者其它对象映射到进程的地址空间，实现文件磁盘地址和进程虚拟地址空间中一段虚拟地址的一一对映关系。

with open(path) as infile:
    m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

import mmap

# write a simple example file
with open("hello.txt", "wb") as f:
    f.write("Hello Python!\n")

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print mm.readline()  # prints "Hello Python!"
    # read content via slice notation
    print mm[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = " world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print mm.readline()  # prints "Hello  world!"
    # close the map
    mm.close()

关于mmap的介绍：https://docs.python.org/2/library/mmap.html
用法介绍：https://www.cnblogs.com/zhoujinyi/p/6062907.html

布丁的自我修养

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python 读取大文件

近期在处理数据时，发现对于几百兆甚至几G的数据，我们是没有办法用notepad++或者sublime等文本编译器打开的，会直接崩溃掉。在程序处理时，也没办法使用普通的文件读法，一下全部加载到内存中。后来查阅网上的几种方法，在这里做个整理。对于有多行的数据 —— 按行读取with语句打开和关闭文件，包括抛出一个内部块异常。for line in f文件对象f视为一个迭代器，会自动的采用缓冲IO和...
复制链接

扫一扫