python模糊搜索列表对象_在一个Python数据框/字典中搜索另一数据框中的模糊匹配...

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):

df1:

PRODUCT_ID PRODUCT_DESCRIPTION

0 165985858958 "Fish Burger with Lettuce"

1 185965653252 "Chicken Salad with Dressing"

2 165958565556 "Pork and Honey Rissoles"

3 655262522233 "Cheese, Ham and Tomato Sandwich"

4 857485966653 "Coleslaw with Yoghurt Dressing"

5 524156285551 "Lemon and Raspberry Cheesecake"

I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:

df2 (also saved as dict_2)

PROD_ID PROD_DESCRIPTION

0 548576 "Fish Burger"

1 156956 "Chckn Salad w/Ranch Dressing"

2 257848 "Rissoles - Lamb & Rosemary"

3 298770 "Lemn C-cake"

4 651452 "Potato Salad with Bacon"

5 100256 "Cheese Cake - Lemon Raspberry Coulis"

What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:

PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION

0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"

1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"

2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"

3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN

4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN

5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"

6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"

I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.

I've used the following questions and answers (amongst others):

and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.

I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.

I'm using Python 2.7, pandas and have fuzzywuzzy installed.

解决方案

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:

for j in df3.columns:

vi = df.get_value(i, 'PRODUCT_DESCRIPTION')

vj = df2.get_value(j, 'PROD_DESCRIPTION')

df3.set_value(

i, j, fuzz.ratio(vi, vj))

print(df3)

0 1 2 3 4 5

0 63 15 24 23 34 27

1 26 84 19 21 52 32

2 18 31 33 12 35 34

3 10 31 35 10 41 42

4 29 52 32 10 42 12

5 15 28 21 49 8 55

Set a threshold for acceptable distance. I set 50

Find the index value (for df2) that has maximum value for every row.

threshold = df3.max(1) > 50

idxmax = df3.idxmax(1)

Make assignments

df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)

df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)

df

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值