python中的DataFrame数据结构apply+多进程加速

最新推荐文章于 2023-09-15 10:31:56 发布

摩天崖FuJunWANG

最新推荐文章于 2023-09-15 10:31:56 发布

阅读量1.4k

点赞数 1

分类专栏： python 工具使用文章标签： python 多进程数据结构 json

本文链接：https://blog.csdn.net/weixin_41806489/article/details/112053165

版权

工具使用同时被 2 个专栏收录

39 篇文章 2 订阅

订阅专栏

python

14 篇文章 1 订阅

订阅专栏

本文章只讨论多进程的实现，并且是简单实现，复杂实现请绕走。

优化结果

原数据处理时间花费： 523.7913863658905

原数据分为20份，4进程运行时间(因为我的电脑只有四核)：时间花费 282.55076575279236

实施过程

过程如下，强的一批：

import numpy as np
import pandas as pd
import time
from multiprocessing import Pool
import json
import pandas as pd
import datetime
def strToStrp(x):
    if x==None or type(x)!=str or len(x)!=19:
        return None
    return datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S")
with open("../state/data/weatherData/顺义2.json","r") as f:
    s=f.readlines()
s=s[0]
s=s.replace("'", "\"")
weatherDataPEK=json.loads(s)

# dateKey=allData2019["Schd_Dt"][0][:10]
def getWeather(x):
    # print(x["Schd_Dt"])
    schdDateTime=x["Schd_Dt"]
    dateKey = schdDateTime[:10]
    weatherThisDay=weatherDataPEK[dateKey]
    weatherThisDay.sort(key=lambda x:datetime.datetime.strptime(x["uptime"],"%Y-%m-%d %H:%M:%S"))
    i=0
    while i<len(weatherThisDay)-1:
        # print(strToStrp(weatherThisDay[i]["uptime"]),strToStrp(schdDateTime),strToStrp(weatherThisDay[i+1]["uptime"]))
        if strToStrp(weatherThisDay[i]["uptime"])<=strToStrp(schdDateTime) and  strToStrp(schdDateTime)<=strToStrp(weatherThisDay[i+1]["uptime"]):
            return  [weatherThisDay[i]["weather"],
                    weatherThisDay[i]["weatid"],
                   weatherThisDay[i]["temp"],
                   weatherThisDay[i]["wind"],
                    weatherThisDay[i]["windid"],
                    weatherThisDay[i]["winp"],
                    weatherThisDay[i]["winpid"]]

        else:
            i+=1
    else:
        return [None,None,None,None,None,None,None]

def apply_f(df):
    return df.apply(getWeather, axis=1)


def init_process(global_vars):
    global a
    a = global_vars


if __name__ == '__main__':
    # a = 2
    np.random.seed(0)
    # df = pd.DataFrame(np.random.rand(10 ** 6, 10))
    df= pd.read_csv("../state/data/standardizationData/allData2019.csv", low_memory=False)

    # t1 = time.time()
    # result_serial = df.apply(f, axis=1)

    t2 = time.time()
    # print("Serial time =", t2 - t1)

    df_parts = np.array_split(df, 20)
    with Pool(processes=8) as pool:
        result_parts = pool.map(apply_f, df_parts)

    # with Pool(processes=8, initializer=init_process, initargs=(a,)) as pool:#参数分别为进程数，初始化过程，共享参数
    #     result_parts = pool.map(apply_f, df_parts)

    result_parallel = pd.concat(result_parts)
    t3 = time.time()
    print("Parallel time =", t3 - t2)

本文参考了博客【https://blog.fangzhou.me/posts/20170702-python-parallelism/】。表示感谢。

摩天崖FuJunWANG

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
python中的DataFrame数据结构apply+多进程加速

优化结果原数据处理时间花费： 523.7913863658905原数据分为20份，4进程运行时间(因为我的电脑只有四核)：时间花费 282.55076575279236实施过程过程如下，强的一批：import numpy as npimport pandas as pdimport timefrom multiprocessing import Poolimport jsonimport pandas as pdimport datetimedef strToStrp(x):
复制链接

扫一扫

专栏目录