python多进程加快for循环,在Python中使用循环多进程共享大熊猫DataFrame

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.

I know that if I try to do this with standard methods from the multiprocessing package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.

My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:

import multiprocessing

import pandas

import pyodbc

def download(args):

"""pydobc code to download data from sql database"""

def calc(dataset, index):

filter_data = dataset[dataset['ID'] == index]

"""run calculations on filtered DataFrame"""

"""append results to local csv"""

if __name__ == '__main__':

data_1 = download(args_1)

data_2 = download(args_2)

all_data = data_1.append(data_2) #Append downloaded DataFrames into one

unique_id = pandas.unique(all_data['ID'])

pool = multiprocessing.Pool()

[pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]

解决方案Q : "Sharing large pandas DataFrame with multiprocessing for loop in Python ?"

While there are tools to share some data in the multiprocessing module, the actual use will here actually represent an anti-pattern to the presented will to operate this, for performance reasons, inside a Pool-instance, in a "just"-[CONCURRENT]-fashion.

Why?

You spend immense costs on moving the filtering into a Pool-of-independent ( "just"-[CONCURRENT] ) workers, yet each of them is waiting to get served by, again the central GIL-lock, which turns the Manager's work again into a pure-[SERIAL] and even worse, being RAM I/O-bound, the performance suffocation from having no free access to RAM, goes principally in a wrong direction ).

THE ECONOMY OF ADD-ON COSTS v/s THE TRAP of AMDAHL's LAW :

The speed of burning the money ( add-on costs ), that are not visible from a few SLOC-s can be ( and often is) way higher, than any ( only potential, until well engineered, tuned and validated ) in-vivo performance benefit, from operating several lines of code-execution in a "just"-[CONCURRENT] ( the harder for a True-[PARALLEL] ) fashion.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值