python多进程加快for循环,在Python中使用循环多进程共享大熊猫DataFrame

最新推荐文章于 2024-03-11 18:50:50 发布

weixin_39849479

最新推荐文章于 2024-03-11 18:50:50 发布

阅读量724

点赞数

文章标签： python多进程加快for循环

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.

I know that if I try to do this with standard methods from the multiprocessing package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.

My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:

import multiprocessing

import pandas

import pyodbc

def download(args):

"""pydobc code to download data from sql database"""

def calc(dataset, index):

filter_data = dataset[dataset['ID'] == index]

"""run calculations on filtered DataFrame"""

"""append results to local csv"""

if __name__ == '__main__':

data_1 = download(args_1)

data_2 = download(args_2)

all_data = data_1.append(data_2) #Append downloaded DataFrames into one

unique_id = pandas.unique(all_data['ID'])

pool = multiprocessing.Pool()

[pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]

解决方案Q : "Sharing large pandas DataFrame with multiprocessing for loop in Python ?"

While there are tools to share some data in the multiprocessing module, the actual use will here actually represent an anti-pattern to the presented will to operate this, for performance reasons, inside a Pool-instance, in a "just"-[CONCURRENT]-fashion.

Why?

You spend immense costs on moving the filtering into a Pool-of-independent ( "just"-[CONCURRENT] ) workers, yet each of them is waiting to get served by, again the central GIL-lock, which turns the Manager's work again into a pure-[SERIAL] and even worse, being RAM I/O-bound, the performance suffocation from having no free access to RAM, goes principally in a wrong direction ).

THE ECONOMY OF ADD-ON COSTS v/s THE TRAP of AMDAHL's LAW :

The speed of burning the money ( add-on costs ), that are not visible from a few SLOC-s can be ( and often is) way higher, than any ( only potential, until well engineered, tuned and validated ) in-vivo performance benefit, from operating several lines of code-execution in a "just"-[CONCURRENT] ( the harder for a True-[PARALLEL] ) fashion.

weixin_39849479

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python多进程加快for循环,在Python中使用循环多进程共享大熊猫DataFrame

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on t...
复制链接

扫一扫