正如@Khris在他的评论中所说,您应该将数据帧分成几个大的块,并并行地遍历每个块。您可以任意将数据帧拆分为随机大小的块,但根据计划使用的进程数将数据帧拆分为同等大小的块更有意义。幸运的是,其他人有already figured out how to do that part给我们:# don't forget to import
import pandas as pd
import multiprocessing
# create as many processes as there are CPUs on your machine
num_processes = multiprocessing.cpu_count()
# calculate the chunk size as an integer
chunk_size = int(df.shape[0]/num_processes)
# this solution was reworked from the above link.
# will work even if the length of the dataframe is not evenly divisible by num_processes
chunks = [df.ix[df.index[i:i + chunk_size]] for i in range(0, df.shape[0], chunk_size)]
这将创建一个列表,该列表将我们的数据帧分成块。现在我们需要将它与一个操作数据的函数一起传递到池中。def func(d):
# let's create a function that squares every value in the dataframe
return d * d
# create