python 基本文件读写及读取大文件而内存不溢出4种方式

绀目澄清

已于 2023-03-12 20:18:49 修改

阅读量2.5k

点赞数 1

分类专栏： Python 文章标签： python 开发语言

于 2021-04-06 09:40:28 首次发布

本文链接：https://blog.csdn.net/u013628121/article/details/115454373

版权

Python 专栏收录该内容

29 篇文章

订阅专栏

基本文件读写

读取
path = './bookDownPageUrl.txt'
with open(path, 'r',encoding='utf-8') as f:
    text = f.read() #读取全部,返回一个str
    text = f.read(6)读取的6个字符的数量,，每一次读取都从上次读取到位置开始读取，读取文件的最后:读取数量大于文件剩余的,则会返回剩余,超过文件末尾返回空串""。
    text = f.readlines() #读取全部行返回一个list,包含换行符
    lines = [line.strip() for line in f.readlines()] #读取全部行返回一个list,不包含换行符
    text = f.readlines(666) #读取666行



写
save_path = './bookDownPageUrl.txt'
with open(save_path, 'w',encoding='utf-8') as f:   #'w'如果该文件存在则原内容删除。不存在，创新文件 ...'r'追加模式。如果该文件不存在，创建新文件用于读写。
    f.write(text + '\n')
    f.writelines(['first line\n', 'second line\n', 'third line\n']) #表示把一个列表写入,writelines() 方法不会自动在字符串末尾添加换行符，因此需要在需要换行的地方手动添加。
    f.flush()  # 刷新文件缓冲区，将数据写入文件

mode: 文件打开模式

文本模式所有组合

r: 只读模式。文件必须存在。
w: 写入模式。如果文件存在，则清空文件内容；如果文件不存在，则创建一个新文件。
x: 独占模式。创建新文件并写入数据。如果文件已存在，则抛出异常。
a: 追加模式。将数据写入文件末尾。如果文件不存在，则创建一个新文件。
r+: 读写模式，文件必须存在
w+: 读写模式，清空文件内容,如文件不存在，则创新
x+: 独占读写模式，创建新文件.如果文件已存在，则抛出异常
a+: 追加读写模式，将数据写入文件末尾

二进制模式所有组合

rb: 二进制只读模式
wb: 二进制写入模式，清空文件内容
xb: 二进制独占模式，创建新文件
ab: 二进制追加模式，将数据写入文件末尾
rb+: 二进制读写模式，文件必须存在
wb+: 二进制读写模式，清空文件内容
xb+: 二进制独占读写模式，创建新文件
ab+: 二进制追加读写模式，将数据写入文件末尾

模式含义

r: 只读模式。文件必须存在。
w: 写入模式。如果文件存在，则清空文件内容；如果文件不存在，则创建一个新文件。
x: 独占模式。创建新文件并写入数据。如果文件已存在，则抛出异常。在文件处理中，独占（Exclusive Access）指的是一个进程或线程独立地占用文件并对其进行读取或写入，其他进程或线程不能同时进行读取或写入操作，直到该进程或线程完成并释放文件资源。这样做的目的是确保文件的一致性和完整性，避免出现竞争条件（Race Condition）等问题。
a: 追加模式。将数据写入文件末尾。如果文件不存在，则创建一个新文件。
b: 二进制模式。用于读写二进制数据。可与其他模式组合使用。

文件缓冲区 buffering

with open(file_path, "w+", encoding="utf-8",buffering=1024) as f:

buffering 单位是字节

默认的缓冲区大小（通常是 -1 ,具体大小取决于系统)

对于文本模式，默认情况下，Python 使用行缓冲区（line buffering），这意味着每次写入一行数据后，缓冲区会自动刷新并将数据写入文件。

对于二进制模式，默认情况下，Python 使用固定大小的缓冲区。具体来说，缓冲区大小由操作系统决定，通常是一个页面大小（通常为4KB或8KB）。当写入的数据量超过缓冲区大小时，缓冲区会自动刷新并将数据写入文件。

缓冲区大小设置为 0

则表示不使用缓冲区，每次写入都会立即更新文件。可能会影响程序的性能，因为频繁的写入操作可能会导致磁盘的负载增加。

缓冲区大设置为其他大小时

1.当缓冲区大小较大时，程序需要积累的数据到缓冲区被填满时才能将它们写入文件，

2.写入操作结束时，缓冲区中的数据会被写入文件。

3.调用 flush() 方法后，强制将缓冲区中的数据写入文件,缓冲区会被清空。

50M缓冲区大小： 50 * 1024 * 1024，即52428800字节
500M缓冲区大小： 500 * 1024 * 1024，即524288000字节
1GB缓冲区大小： 1024 * 1024 * 1024，即1073741824字节
2GB缓冲区大小： 2 * 1024 * 1024 * 1024，即2147483648字节

# 计算50M缓冲区大小（字节数）
buf_size_50M = 50 * 1024 * 1024

# 计算500M缓冲区大小（字节数）
buf_size_500M = 500 * 1024 * 1024

# 计算1GB缓冲区大小（字节数）
buf_size_1GB = 1024 * 1024 * 1024

# 计算2GB缓冲区大小（字节数）
buf_size_2GB = 2 * 1024 * 1024 * 1024

文件编码 encoding

with open(file_path, "r", encoding="utf-8") as f:

ASCII: encoding="ascii"
UTF-8: encoding="utf-8"
UTF-8 with BOM: encoding="utf-8-sig"
GB2312: encoding="gb2312"
GBK: encoding="gbk"
Big5: encoding="big5"
UTF-16: encoding="utf-16"
UTF-16 LE: encoding="utf-16-le"
UTF-16 BE: encoding="utf-16-be"
ISO-8859-1: encoding="iso-8859-1"
ISO-8859-2: encoding="iso-8859-2"
ISO-8859-5: encoding="iso-8859-5"
ISO-8859-15: encoding="iso-8859-15"
Shift-JIS: encoding="shift_jis"
EUC-JP: encoding="euc-jp"
ISO-2022-JP: encoding="iso-2022-jp"
KOI8-R: encoding="koi8-r"
Windows-1250: encoding="windows-1250"
Windows-1251: encoding="windows-1251"
Windows-1252: encoding="windows-1252"
Windows-1253: encoding="windows-1253"
Windows-1254: encoding="windows-1254"
Windows-1255: encoding="windows-1255"
Windows-1256: encoding="windows-1256"
Windows-1257: encoding="windows-1257"
Windows-1258: encoding="windows-1258"

不使用with 函数写入文件流程


file_path = 'example.txt'
mode = 'w'
encoding = 'utf-8'

# 打开文件
f = open(file_path, mode, encoding=encoding)

# 写入数据
f.write('hello, world!\n')

# 关闭文件
f.close()

注意gb2312，有一定概率会出现乱码:

用Python的with open()函数，如果加上encoding=gb2312，有一定概率会出现乱码。
即使打开的是真的GB2312的txt文件，也依然有乱码的可能性。
可能本来的文件就是GBK或者GB18030编码，但是被Chardet识别成了GB2312，也有可能是Python调用的编码器的命名不规范，GB2312和GBK混淆了。
各种可能性都存在。
当使用GB2312乱码的时候，换用GB18030试试，有时候就能解决乱码的问题。

from chardet.universaldetector import UniversalDetector
import os


def detcect_encoding(filepath):
    """检测文件编码
    Args:
        detector: UniversalDetector 对象
        filepath: 文件路径
    Return:
        file_encoding: 文件编码
        confidence: 检测结果的置信度，百分比
    """
    detector = UniversalDetector()
    detector.reset()
    for each in open(filepath, 'rb'):
        detector.feed(each)
        if detector.done:
            break
    detector.close()
    file_encoding = detector.result['encoding']
    confidence = detector.result['confidence']
    if file_encoding is None:
        file_encoding = 'unknown'
        confidence = 0.99
    return file_encoding, confidence * 100


if __name__ == '__main__':
    target_encoding = 'utf-16'
    rootdir = os.path.join(r'D:\source')
    for (_, _, filenames) in os.walk(rootdir):
        for filename in filenames:
            filepath = rootdir + '\\' + filename
            file_encoding, confidence = detcect_encoding(filepath)

            if file_encoding != 'unknown' and confidence > 0.75:
               
                if file_encoding == 'GB2312':
                    file_encoding = 'GB18030'

                with open(filepath, 'r', encoding=file_encoding, errors='ignore') as f:
                    text = f.read()

                outpath = os.path.join(r'D:\result', filename)
                with open(outpath, 'w', encoding=target_encoding, errors='ignore') as f:
                    f.write(text)

                print(
                    f'[+] 转码成功: {filename}({file_encoding}) -> {outpath}({target_encoding}) [+]')

读取大文件,而内存不溢出,4种方式

No.1 循环读取文件,每次读取1行字符串

buffering= 1073741824 #定义1GB的缓冲内存大小,单位字节

#以行读取,可以读取大文件,而内存不溢出
with open(path, 'r',buffering= 1073741824, encoding='utf-8') as f: #
    for line in f:
        print(line,end="")
        time.sleep(0.1)

No.2 循环读取文件,每次读取指定长度字符,这里读取100个字符


import time


# 每次读取100个字符
chunk=100
with open(path,'r',buffering=52428800,encoding='utf-8') as f:#定义50MB的缓冲内存大小,单位字节
    # 创建一个循环来读取文件内容
    while True:
        # 读取chunk大小的内容
        content=f.read(chunk)
        # 设置循环终止条件
        # 检查是否读取到了内容
        if not content: # content读取完毕会返回空字符串，空字符串为False，取反为True
            # 内容读取完毕，退出循环
            break
        # 查看读取内容
        print(content,end="")
        time.sleep(1)

No.3 使用 mmap 模块

from  mmap  import  mmap
def  get_lines(path='./a.txt'):
    with  open(path,"r+")  as  f:
        m  =  mmap(f.fileno(),  0) 
        tmp  =  0
        for  i,  char  in enumerate(m):
            if  char==b"\n":
                yield  m[tmp:i+1].decode() 
                tmp  =  i+1


for line in get_lines(path):
    print(line,end="")
    time.sleep(0.1)

No.4 使用关键字 yield

import time

def get_lines(path):
    l=[]
    with open(path,'r',encoding='utf-8') as f:
        data=f.readlines(5000)
    l =l+data
    for line in l:
        yield line
    

for line in get_lines(path):
    print(line,end="")
    time.sleep(0.08)

kenwoodjw 面试题 300 道 Python

1.若有一个jsonline格式的文件file.txt，大小约为10K，我们的处理方式为：

def get_lines():
        l = []
        with open('file.txt', 'rb') as f:
            for eachline in f:
                l.append(eachline)
            return l

if __name__ == '__main__':
    for e in get_lines():
        process(e) #处理每一行数据

现在要处理一个大小为10G的file.txt文件，但是内存只有4G。如果在只修改get_lines 函数而其他代码保持不变的情况下，应该如何实现？需要考虑的问题都有那些？

def get_lines():
        l = []
        with open('file.txt','rb') as f:
            data = f.readlines(60000)
        l.append(data)
        yield l

要考虑的问题有：内存只有4G，无法一次性读入10G文件。而分批读入数据要记录每次读入数据的位置，且分批每次读取得太小会在读取操作上花费过多时间。

def  get_lines(path='./a.txt',lines=999): 
    l = []
    with  open(path,'rb')  as f:
        data  =  f.readlines(lines)
    l.append(data) 
    yield  l        

    
        
for lines in get_lines('./Data/小说合集1.txt'):
    for line in lines:
        print(line)
        
        
#方法2 two
        
from  mmap  import  mmap
def  get_lines(path='./a.txt'):
    with  open(path,"r+")  as  f:
        m  =  mmap(f.fileno(),  0) 
        tmp  =  0
        for  i,  char  in enumerate(m):
            if  char==b"\n":
                yield  m[tmp:i+1].decode() 
                tmp  =  i+1


for line in get_lines('./Data/小说合集1.txt'):
        print(line)