python中的DataFrame数据结构apply+多进程加速

本文章只讨论多进程的实现,并且是简单实现,复杂实现请绕走。

优化结果

原数据处理时间花费: 523.7913863658905

原数据分为20份,4进程运行时间(因为我的电脑只有四核):时间花费 282.55076575279236

实施过程

过程如下,强的一批:

import numpy as np
import pandas as pd
import time
from multiprocessing import Pool
import json
import pandas as pd
import datetime
def strToStrp(x):
    if x==None or type(x)!=str or len(x)!=19:
        return None
    return datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S")
with open("../state/data/weatherData/顺义2.json","r") as f:
    s=f.readlines()
s=s[0]
s=s.replace("'", "\"")
weatherDataPEK=json.loads(s)

# dateKey=allData2019["Schd_Dt"][0][:10]
def getWeather(x):
    # print(x["Schd_Dt"])
    schdDateTime=x["Schd_Dt"]
    dateKey = schdDateTime[:10]
    weatherThisDay=weatherDataPEK[dateKey]
    weatherThisDay.sort(key=lambda x:datetime.datetime.strptime(x["uptime"],"%Y-%m-%d %H:%M:%S"))
    i=0
    while i<len(weatherThisDay)-1:
        # print(strToStrp(weatherThisDay[i]["uptime"]),strToStrp(schdDateTime),strToStrp(weatherThisDay[i+1]["uptime"]))
        if strToStrp(weatherThisDay[i]["uptime"])<=strToStrp(schdDateTime) and  strToStrp(schdDateTime)<=strToStrp(weatherThisDay[i+1]["uptime"]):
            return  [weatherThisDay[i]["weather"],
                    weatherThisDay[i]["weatid"],
                   weatherThisDay[i]["temp"],
                   weatherThisDay[i]["wind"],
                    weatherThisDay[i]["windid"],
                    weatherThisDay[i]["winp"],
                    weatherThisDay[i]["winpid"]]

        else:
            i+=1
    else:
        return [None,None,None,None,None,None,None]

def apply_f(df):
    return df.apply(getWeather, axis=1)


def init_process(global_vars):
    global a
    a = global_vars


if __name__ == '__main__':
    # a = 2
    np.random.seed(0)
    # df = pd.DataFrame(np.random.rand(10 ** 6, 10))
    df= pd.read_csv("../state/data/standardizationData/allData2019.csv", low_memory=False)

    # t1 = time.time()
    # result_serial = df.apply(f, axis=1)

    t2 = time.time()
    # print("Serial time =", t2 - t1)

    df_parts = np.array_split(df, 20)
    with Pool(processes=8) as pool:
        result_parts = pool.map(apply_f, df_parts)

    # with Pool(processes=8, initializer=init_process, initargs=(a,)) as pool:#参数分别为进程数,初始化过程,共享参数
    #     result_parts = pool.map(apply_f, df_parts)

    result_parallel = pd.concat(result_parts)
    t3 = time.time()
    print("Parallel time =", t3 - t2)

本文参考了博客【https://blog.fangzhou.me/posts/20170702-python-parallelism/】。表示感谢。

  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值