python 并行处理图片,如何在python上并行处理此嵌套循环

最新推荐文章于 2022-09-27 09:20:16 发布

凌暴暴

最新推荐文章于 2022-09-27 09:20:16 发布

阅读量139

点赞数

文章标签： python 并行处理图片

I'm trying to reduce a list of names, and in order to perform this I'm using the fuzzywuzzy library.

I perform two for loops, both over all the names. If the two names have a fuzzy match score between the 90 and the 100, Then I rewrite the second name with the first name.

Here is an example of my dataset, data.

nombre

0 VICTOR MORENO MORENO

1 SERGIO HERNANDEZ GUTIERREZ

2 FRANCISCO JAVIER MUÑOZ LOPEZ

3 JUAN RAYMUNDO MORALES MARTINEZ

4 IVAN ERNESTO SANCHEZ URROZ

And here is my function:

def fuzz_analisis0(top_names):

for name2 in top_names["nombre"]:

for name in top_names["nombre"]:

if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:

top_names[top_names["nombre"]==name] = name2

When I run this with:

fuzz_analisis0(data)

Everything works fine. Here is an output that shows how it works.

print(len(data))

# 1400

data = data.drop_duplicates()

print(len(data))

# 1256

But now, if I try it with parallel processing, it no longer works as expected. Here is the parallelized code:

cores = mp.cpu_count()

df_split = np.array_split(data, cores, axis=0)

pool = Pool(cores)

df_out = np.vstack(pool.map(fuzz_analisis0, df_split))

pool.close()

pool.join()

pool.clear()

The function ends faster than expected and does not find any duplicates.

print(len(data))

# 1400

data = data.drop_duplicates()

print(len(data))

# 1400

If any can help me to figure out what is happening here and how to solve it, I'll be so grateful. Thanks in advance.

edit:

now i have this another function that works with the result of the last one

def fuzz_analisis(dataframe, top_names):

for index in top_names['nombre'].index:

name2 = top_names.loc[index,'nombre']

for index2 in dataframe["nombre"].index:

name = dataframe.loc[index2,'nombre']

if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:

dataframe.loc[index,'nombre'] = name

the dataframe looks loke this:

nombre foto1 foto2 sexo fecha_hora_registro

folio

131 JUAN DOMINGO GONZALEZ DELGADO 131.jpg 131.jpg MASCULINO 2008-08-07 15:42:25

132 FRANCISCO JAVIER VELA RAMIREZ 132.jpg 132.jpg MASCULINO 2008-08-07 15:50:42

133 JUAN CARLOS PEREZ MEDINA 133.jpg 133.jpg MASCULINO 2008-08-07 16:37:24

134 ARMANDO SALINAS SALINAS 134.jpg 134.jpg MASCULINO 2008-08-07 17:18:12

135 JOSE FELIX ZAMBRANO AMEZQUITA 135.jpg 135.jpg MASCULINO 2008-08-07 17:55:05

解决方案

When you recognise your algorithm to be slow multiprocessing is a way to speed it up. However you should probably try to speed up the algorithm first. When using fuzzywuzzy fuzz.ratio will calculate a normalized levenshtein distance, which is a O(N*M) operation. Therefore you should try to minimise the usage. So here is a optimised solution of mcskinner's multiprocessed solution:

from functools import partial

from fuzzywuzzy import fuzz

import multiprocessing as mp

import numpy as np

def length_ratio(s1, s2):

s1_len = len(s1)

s2_len = len(s2)

distance = s1_len - s2_len if s1_len > s2_len else s2_len - s1_len

lensum = s1_len + s2_len

return 100 - 100 * distance / lensum

def fuzz_analisis0_partial(top_names, partial_top_names):

for name2 in top_names["nombre"]:

for name in partial_top_names["nombre"]:

if length_ratio(name, name2) < 90:

continue

ratio = fuzz.ratio(name, name2)

if ratio>90 and ratio<100:

partial_top_names[partial_top_names["nombre"] == name] = name2

return partial_top_names

cores = mp.cpu_count()

df_split = np.array_split(data, cores, axis=0)

pool = mp.Pool(cores)

processed_parts = pool.map(partial(fuzz_analisis0_partial, data), df_split)

processed = np.vstack(list(processed_parts))

pool.close()

pool.join()

First of this solution only executes fuzz.ratio once instead of twice and since this is what takes most of the time, this should give you about a 50% runtime improvement. Then as a second improvement it checks a length based ratio beforehand. This length based ratio is always at least as big as fuzz.ratio, but can be calculated in constant time. Therefore all names that have a big length difference can be processed a lot faster. Beside this make sure your using fuzzywuzzy with python-Levenshtein since this is a lot faster than the version using difflib. As an even faster alternative you could use RapidFuzz(I am the author of RapidFuzz). RapidFuzz already calculates the length ratio when you pass it a cutoff score fuzz.ratio(name, name2, score_cutoff=90), so the lenght_ratio function is not required when using it.

Using RapidFuzz the equivalent function fuzz_analisis0_partial can be programmed the following way:

from rapidfuzz import fuzz

def fuzz_analisis0_partial(top_names, partial_top_names):

for name2 in top_names["nombre"]:

for name in partial_top_names["nombre"]:

ratio = fuzz.ratio(name, name2, score_cutoff=90)

if ratio > 90 and ratio < 100:

partial_top_names[partial_top_names["nombre"] == name] = name2

return partial_top_names