python并行处理for循环,如何在Python中将for循环转换为并行处理?

I am still in very early stage of my learning of Python. Apologize in advance if this question sounds stupid.

I have this set of data (in table format) that I want to add few calculated columns to. Basically I have some location lon/lat and destination lon/lat, and the respective data time, and I'm calculating the average velocity between each pair.

Sample data look like this:

print(data_all.head(3))

id lon_evnt lat_evnt event_time \

0 1 -179.942833 41.012467 2017-12-13 21:17:54

1 2 -177.552817 41.416400 2017-12-14 03:16:00

2 3 -175.096567 41.403650 2017-12-14 09:14:06

dest_data_generate_time lat_dest lon_dest \

0 2017-12-13 22:33:37.980 37.798599 -121.292193

1 2017-12-14 04:33:44.393 37.798599 -121.292193

2 2017-12-14 10:33:51.629 37.798599 -121.292193

address_fields_dest \

0 {'address': 'Nestle Way', 'city': 'Lathrop...

1 {'address': 'Nestle Way', 'city': 'Lathrop...

2 {'address': 'Nestle Way', 'city': 'Lathrop...

I then zipped the lon/lat together:

data_all['ping_location'] = list(zip(data_all.lon_evnt, data_all.lat_evnt))

data_all['destination'] = list(zip(data_all.lon_dest, data_all.lat_dest))

then I want to calculate the distance between each pair of location pings, and grab some address info from a string (basically taking a substring), and then calculate for the velocity:

for idx, row in data_all.iterrows():

dist = gcd.dist(row['destination'], row['ping_location'])

data_all.loc[idx, 'gc_distance'] = dist

temp_idx = str(row['address_fields_dest']).find(":")

pos_start = temp_idx + 3

pos_end = str(row['address_fields_dest']).find(",") - 2

data_all.loc[idx, 'destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]

##### calculate velocity which is: v = d/t

## time is the difference btwn destination time and the ping creation time

timediff = abs(row['dest_data_generate_time'] - row['event_time'])

data_all.loc[idx, 'velocity km/hr'] = 0

## check if the time dif btwn destination and event ping is more than a minute long

if timediff > datetime.timedelta(minutes=1):

data_all.loc[idx, 'velocity km/hr'] = dist / timediff.total_seconds() * 3600.0

ok now, this program took me almost 7 hours to execute on 333k rows of data! :( I have windows 10 2 core 16gb ram... which is not much, but 7 hours is definitely not ok :(

How can I make the program run more efficiently? One way I'm thinking is, since the data and its calculations are independent of each other, I can take advantage of parallel processing.

I've read into many posts, but it seems like most of the parallel processing methods presented are for if I'm only using one simple function; but here I'm adding multiple new columns.

Any help is really appreciated! or telling me that this is impossible to make pandas do parallel processing (which I believe I've read somewhere saying that but am not completely sure if it's 100% true still).

Sample posts read into:

and a lot more that are not on stackoverflow....

解决方案

Here is a quick solution - I didn't try to optimize your code at all, just fed it into a multiprocessing pool. This will run your function on each row individually, return a row with the new properties, and create a new dataframe from this output.

import multiprocessing as mp

pool = mp.Pool(processes=mp.cpu_count())

def func( arg ):

idx,row = arg

dist = gcd.dist(row['destination'], row['ping_location'])

row['gc_distance'] = dist

temp_idx = str(row['address_fields_dest']).find(":")

pos_start = temp_idx + 3

pos_end = str(row['address_fields_dest']).find(",") - 2

row['destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]

##### calculate velocity which is: v = d/t

## time is the difference btwn destination time and the ping creation time

timediff = abs(row['dest_data_generate_time'] - row['event_time'])

row['velocity km/hr'] = 0

## check if the time dif btwn destination and event ping is more than a minute long

if timediff > datetime.timedelta(minutes=1):

row['velocity km/hr'] = dist / timediff.total_seconds() * 3600.0

return row

new_rows = pool.map( func, [(idx,row) for idx,row in data_all.iterrows()])

data_all_new = pd.concat( new_rows )

Python 中,有多种方式可以进行并行的 for 循环。下面介绍其中两种常用方式。 ## 1. 使用 multiprocessing 模块 `multiprocessing` 模块是 Python 内置的多进程模块,通过创建多个进程来实现并行。下面是一个简单的例子: ```python import multiprocessing def process(number): print(f"Processing {number}") if __name__ == "__main__": numbers = [1, 2, 3, 4, 5] with multiprocessing.Pool() as pool: pool.map(process, numbers) ``` 在上面的例子中,我们定义了一个 `process` 函数,用来处理每一个数字。然后,我们使用 `multiprocessing.Pool` 创建一个进程池,并使用 `pool.map` 方法来并行处理每一个数字。 ## 2. 使用 joblib 模块 `joblib` 模块是一个用于并行的工具包,它可以用于多进程和多线程并行。下面是一个简单的例子: ```python from joblib import Parallel, delayed def process(number): print(f"Processing {number}") if __name__ == "__main__": numbers = [1, 2, 3, 4, 5] Parallel(n_jobs=-1)(delayed(process)(number) for number in numbers) ``` 在上面的例子中,我们定义了一个 `process` 函数,用来处理每一个数字。然后,我们使用 `Parallel` 方法来并行处理每一个数字。`n_jobs=-1` 表示使用所有可用的 CPU 核心进行并行。 需要注意的是,在使用并行的时候,为了避免出现竞争条件,需要确保各个进程或线程之间不互相干扰。因此,需要使用锁或者其他同步机制来保证数据的正确性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值