python模糊搜索列表对象_在一个Python数据框/字典中搜索另一数据框中的模糊匹配...-CSDN博客

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):

df1:

PRODUCT_ID PRODUCT_DESCRIPTION

0 165985858958 "Fish Burger with Lettuce"

1 185965653252 "Chicken Salad with Dressing"

2 165958565556 "Pork and Honey Rissoles"

3 655262522233 "Cheese, Ham and Tomato Sandwich"

4 857485966653 "Coleslaw with Yoghurt Dressing"

5 524156285551 "Lemon and Raspberry Cheesecake"

I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:

df2 (also saved as dict_2)

PROD_ID PROD_DESCRIPTION

0 548576 "Fish Burger"

1 156956 "Chckn Salad w/Ranch Dressing"

2 257848 "Rissoles - Lamb & Rosemary"

3 298770 "Lemn C-cake"

4 651452 "Potato Salad with Bacon"

5 100256 "Cheese Cake - Lemon Raspberry Coulis"

What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:

PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION

0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"

1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"

2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"

3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN

4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN

5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"

6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"

I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.

I've used the following questions and answers (amongst others):

and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.

I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.

I'm using Python 2.7, pandas and have fuzzywuzzy installed.

解决方案

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:

for j in df3.columns:

vi = df.get_value(i, 'PRODUCT_DESCRIPTION')

vj = df2.get_value(j, 'PROD_DESCRIPTION')

df3.set_value(

i, j, fuzz.ratio(vi, vj))

print(df3)