python读取大文件、内存不够_读取大文件时,Python内存错误,在以下情况下需要思路进行多处理吗?...

I have the file which stores the data in the below format

TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]

TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]

TIME[03.26_12:28:30.753664]ID[ROLL:2341987623]MARKS[PHY:100|MATH:200|CHEM:400]

TIME[03.26_12:29:30.853664]ID[ROLL:201978623]MARKS[PHY:0|MATH:0|CHEM:40]

TIME[04.27_12:29:30.553664]ID[ROLL:2034287623]MARKS[PHY:100|MATH:200|CHEM:400]

Below method I found to fulfill the need given in this question please refer this link for clarification

import re

from itertools import groupby

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")

def func1(arg) -> bool:

return regex.match(arg)

def func2(arg) -> str:

match = regex.match(arg)

if match:

return match.group(1)

return ""

def func3(arg) -> int:

match = regex.match(arg)

if match:

return int(match.group(2))

return 0

with open(your_input_file) as fr:

collection = filter(func1, fr)

collection = sorted(collection, key=func2)

collection = sorted(collection, key=func3)

for key, group in groupby(collection, key=func3):

with open(f"ROLL_{key}", mode="w") as fw:

fw.writelines(group)

The above function is creating the files according to my wish also , it's sorting the file_contents

according to time stamps and I am getting correct output so i tried it for large files of the size 1.7 GB it's giving memory error I tried to use the following method

Failed attempt:

with open(my_file.txt) as fr:

part_read = partial(fr.read, 1024 * 1024)

iterator = iter(part_read, b'')

for index, fra in enumerate(iterator, start=1):

collection = filter(func1, fra)

collection = sorted(collection, key=func2)

collection = sorted(collection, key=func3)

for key, group in groupby(collection, key=func3):

fw=open(f'ROLL_{key}.txt','a')

fw.writelines(group)

This attempt doesn't gave me any results means there was no file created at all it's taking unexpectedly huge time , i found in many of the answers to read file line by line then how I will then sort it , please suggest me suggestions to improve this code or any new idea if I need to use multiprocessing here to process faster ,if that is the case How to use it?

And One main condition with me is I can't store it any data structure since

file can be huge in size

解决方案

And if you want read file by chunk, use this:

import re

from functools import partial

from itertools import groupby

from typing import Tuple

regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")

def func1(arg) -> bool:

return regex.match(arg)

def func2(arg) -> Tuple[str, int]:

match = regex.match(arg)

if match:

return match.group(1), int(match.group(2))

return "", 0

def func3(arg) -> int:

match = regex.match(arg)

if match:

return int(match.group(2))

return 0

def read_in_chunks(file_object, chunk_size=1024*1024):

while True:

data = file_object.read(chunk_size)

if not data:

break

yield data

with open('b.txt') as fr:

for chunk in read_in_chunks(fr):

collection = filter(func1, chunk.splitlines())

collection = sorted(collection, key=func2)

for key, group in groupby(collection, key=func3):

with open(f"ROLL_{key}", mode="wa") as fw:

fw.writelines(group)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值