python自动处理多个txt文件_Python，并行处理大型文本文件

最新推荐文章于 2023-06-01 17:44:36 发布

weixin_39886929

最新推荐文章于 2023-06-01 17:44:36 发布

阅读量94

点赞数

文章标签： python自动处理多个txt文件

Samples records in the data file (SAM file):

M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0

M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0

......

The data files are line-by-line based

The size of the data files are varies from 1G - 5G.

I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.

the basic workflow is like this:

# global variable, counters will be updated in search function according to the value passed.

counter_a = 0

counter_b = 0

counter_c = 0

open textfile:

for line in textfile:

value = line.split()[3]

search_function(value) # this function takes abit long time to process

def search_function (value):

some conditions checking:

update the counter_a or counter_b or counter_c

With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.

I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.

I have access to a server machine with: 24 processors and around 40G RAM.

Anyone could help with this? Thanks very much.

解决方案

The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)

If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:

import time

import multiprocessing

def slowfunc(value):

time.sleep(0.01)

return value**2 + 0.3*value + 1

counter_a = counter_b = counter_c = 0

def add_to_counter(res):

global counter_a, counter_b, counter_c

counter_a += res

counter_b -= (res - 10)**2

counter_c += (int(res) % 2)

pool = multiprocessing.Pool(50)

results = []

for value in range(100000):

r = pool.apply_async(slowfunc, [value])

results.append(r)

# don't let the queue grow too long

if len(results) == 1000:

results[0].wait()

while results and results[0].ready():

r = results.pop(0)

add_to_counter(r.get())

for r in results:

r.wait()

add_to_counter(r.get())

print counter_a, counter_b, counter_c

That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.

weixin_39886929

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python自动处理多个txt文件_Python，并行处理大型文本文件

Samples records in the data file (SAM file):M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0.....
复制链接

扫一扫