python 并行处理图片,如何在python上并行处理此嵌套循环

I'm trying to reduce a list of names, and in order to perform this I'm using the fuzzywuzzy library.

I perform two for loops, both over all the names. If the two names have a fuzzy match score between the 90 and the 100, Then I rewrite the second name with the first name.

Here is an example of my dataset, data.

nombre

0 VICTOR MORENO MORENO

1 SERGIO HERNANDEZ GUTIERREZ

2 FRANCISCO JAVIER MUÑOZ LOPEZ

3 JUAN RAYMUNDO MORALES MARTINEZ

4 IVAN ERNESTO SANCHEZ URROZ

And here is my function:

def fuzz_analisis0(top_names):

for name2 in top_names["nombre"]:

for name in top_names["nombre"]:

if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:

top_names[top_names["nombre"]==name] = name2

When I run this with:

fuzz_analisis0(data)

Everything works fine. Here is an output that shows how it works.

print(len(data))

# 1400

data = data.drop_duplicates()

print(len(data))

# 1256

But now, if I try it with parallel processing, it no longer works as expected. Here is the parallelized code:

cores = mp.cpu_count()

df_split = np.array_split(data, cores, axis=0)

pool = Pool(cores)

df_out = np.vstack(pool.map(fuzz_analisis0, df_split))

pool.close()

pool.join()

pool.clear()

The function ends faster than expected and does not find any duplicates.

print(len(data))

# 1400

data = data.drop_duplicates()

print(len(data))

# 1400

If any can help me to figure out what is happening here and how to solve it, I'll be so grateful. Thanks in advance.

edit:

now i have this another function that works with the result of the last one

def fuzz_analisis(dataframe, top_names):

for index in top_names['nombre'].index:

name2 = top_names.loc[index,'nombre']

for index2 in dataframe["nombre"].index:

name = dataframe.loc[index2,'nombre']

if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:

dataframe.loc[index,'nombre'] = name

the dataframe looks loke this:

nombre foto1 foto2 sexo fecha_hora_registro

folio

131 JUAN DOMINGO GONZALEZ DELGADO 131.jpg 131.jpg MASCULINO 2008-08-07 15:42:25

132 FRANCISCO JAVIER VELA RAMIREZ 132.jpg 132.jpg MASCULINO 2008-08-07 15:50:42

133 JUAN CARLOS PEREZ MEDINA 133.jpg 133.jpg MASCULINO 2008-08-07 16:37:24

134 ARMANDO SALINAS SALINAS 134.jpg 134.jpg MASCULINO 2008-08-07 17:18:12

135 JOSE FELIX ZAMBRANO AMEZQUITA 135.jpg 135.jpg MASCULINO 2008-08-07 17:55:05

解决方案

When you recognise your algorithm to be slow multiprocessing is a way to speed it up. However you should probably try to speed up the algorithm first. When using fuzzywuzzy fuzz.ratio will calculate a normalized levenshtein distance, which is a O(N*M) operation. Therefore you should try to minimise the usage. So here is a optimised solution of mcskinner's multiprocessed solution:

from functools import partial

from fuzzywuzzy import fuzz

import multiprocessing as mp

import numpy as np

def length_ratio(s1, s2):

s1_len = len(s1)

s2_len = len(s2)

distance = s1_len - s2_len if s1_len > s2_len else s2_len - s1_len

lensum = s1_len + s2_len

return 100 - 100 * distance / lensum

def fuzz_analisis0_partial(top_names, partial_top_names):

for name2 in top_names["nombre"]:

for name in partial_top_names["nombre"]:

if length_ratio(name, name2) < 90:

continue

ratio = fuzz.ratio(name, name2)

if ratio>90 and ratio<100:

partial_top_names[partial_top_names["nombre"] == name] = name2

return partial_top_names

cores = mp.cpu_count()

df_split = np.array_split(data, cores, axis=0)

pool = mp.Pool(cores)

processed_parts = pool.map(partial(fuzz_analisis0_partial, data), df_split)

processed = np.vstack(list(processed_parts))

pool.close()

pool.join()

First of this solution only executes fuzz.ratio once instead of twice and since this is what takes most of the time, this should give you about a 50% runtime improvement. Then as a second improvement it checks a length based ratio beforehand. This length based ratio is always at least as big as fuzz.ratio, but can be calculated in constant time. Therefore all names that have a big length difference can be processed a lot faster. Beside this make sure your using fuzzywuzzy with python-Levenshtein since this is a lot faster than the version using difflib. As an even faster alternative you could use RapidFuzz(I am the author of RapidFuzz). RapidFuzz already calculates the length ratio when you pass it a cutoff score fuzz.ratio(name, name2, score_cutoff=90), so the lenght_ratio function is not required when using it.

Using RapidFuzz the equivalent function fuzz_analisis0_partial can be programmed the following way:

from rapidfuzz import fuzz

def fuzz_analisis0_partial(top_names, partial_top_names):

for name2 in top_names["nombre"]:

for name in partial_top_names["nombre"]:

ratio = fuzz.ratio(name, name2, score_cutoff=90)

if ratio > 90 and ratio < 100:

partial_top_names[partial_top_names["nombre"] == name] = name2

return partial_top_names

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值