python创建excel重复写入,python multiprocessing：写入相同的excel文件

最新推荐文章于 2023-03-01 16:50:55 发布

梁浩赞

最新推荐文章于 2023-03-01 16:50:55 发布

阅读量400

点赞数

文章标签： python创建excel重复写入

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.

This is my code so far:

list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]

results = pd.DataFrame()

if __name__ == '__main__':

global list_of_days

writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')

nr_of_cores = multiprocessing.cpu_count()

l = multiprocessing.Lock()

pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))

pool.map(f, range(len(list_of_days)))

pool.close()

pool.join()

def init(l):

global lock

lock = l

def f(k):

global results

*** DO SOME STUFF HERE***

results = results[ *** finished pandas dataframe *** ]

lock.acquire()

results.to_excel(writer, sheet_name=list_of_days[k])

writer.save()

lock.release()

The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:

How to avoid defining global variables?

Is it even possible to pass around dataframes?

Should I move the locking to main instead?

Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks

解决方案1) Why did you implement time.sleep in several places in your 2nd method?

In __main__, time.sleep(0.1), to give the started process a timeslice to startup.

In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and

as q.get_nowait() are used.

In w(q), are only for testing simulating long run of writer.to_excel(...),

i removed this one.

2) What is the difference between pool.map and pool = [mp.Process( . )]?

Using pool.map needs no Queue, no parameter passed, shorter code.

The worker_process have to return immediately the result and terminates.

pool.map starts a new process as long as all iteration are done.

The results have to be processed after that.

Using pool = [mp.Process( . )], starts n processes.

A process terminates on queue.Empty

Can you think of a situation where you would prefer one method over the other?

Methode 1: Quick setup, serialized, only interested in the result to continue.

Methode 2: If you want to do all workload parallel.

You could't use global writer in processes.

The writer instance has to belong to one process.

Usage of mp.Pool, for instance:

def f1(k):

# *** DO SOME STUFF HERE***

results = pd.DataFrame(df_)

return results

if __name__ == '__main__':

pool = mp.Pool()

results = pool.map(f1, range(len(list_of_days)))

writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')

for k, result in enumerate(results):

result.to_excel(writer, sheet_name=list_of_days[k])

writer.save()

pool.close()

This leads to .to_excel(...) are called in sequence in the __main__ process.

If you want parallel .to_excel(...) you have to use mp.Queue().

For instance:

The worker process:

# mp.Queue exeptions have to load from

try:

# Python3

import queue

except:

# Python 2

import Queue as queue

def f2(fq, q):

while True:

try:

k = fq.get_nowait()

except queue.Empty:

exit(0)

# *** DO SOME STUFF HERE***

results = pd.DataFrame(df_)

q.put( (list_of_days[k], results) )

time.sleep(0.1)

The writer process:

def w(q):

writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')

while True:

try:

titel, result = q.get()

except ValueError:

writer.save()

exit(0)

result.to_excel(writer, sheet_name=titel)

The __main__ process:

if __name__ == '__main__':

w_q = mp.Queue()

w_p = mp.Process(target=w, args=(w_q,))

w_p.start()

time.sleep(0.1)

f_q = mp.Queue()

for i in range(len(list_of_days)):

f_q.put(i)

pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]

for p in pool:

p.start()

time.sleep(0.1)

for p in pool:

p.join()

w_q.put('STOP')

w_p.join()

Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

梁浩赞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫