python自动处理多个txt文件_Python,并行处理大型文本文件

1586010002-jmsa.png

Samples records in the data file (SAM file):

M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0

M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0

......

The data files are line-by-line based

The size of the data files are varies from 1G - 5G.

I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.

the basic workflow is like this:

# global variable, counters will be updated in search function according to the value passed.

counter_a = 0

counter_b = 0

counter_c = 0

open textfile:

for line in textfile:

value = line.split()[3]

search_function(value) # this function takes abit long time to process

def search_function (value):

some conditions checking:

update the counter_a or counter_b or counter_c

With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.

I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.

I have access to a server machine with: 24 processors and around 40G RAM.

Anyone could help with this? Thanks very much.

解决方案

The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)

If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:

import time

import multiprocessing

def slowfunc(value):

time.sleep(0.01)

return value**2 + 0.3*value + 1

counter_a = counter_b = counter_c = 0

def add_to_counter(res):

global counter_a, counter_b, counter_c

counter_a += res

counter_b -= (res - 10)**2

counter_c += (int(res) % 2)

pool = multiprocessing.Pool(50)

results = []

for value in range(100000):

r = pool.apply_async(slowfunc, [value])

results.append(r)

# don't let the queue grow too long

if len(results) == 1000:

results[0].wait()

while results and results[0].ready():

r = results.pop(0)

add_to_counter(r.get())

for r in results:

r.wait()

add_to_counter(r.get())

print counter_a, counter_b, counter_c

That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值