我正在尝试使用以下代码在Python中模糊合并两个数据框:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
#all_data_st = pd.merge(prospectus, filings, on='NamePeriod')
filings['key']=filings.NamePeriod.apply(lambda x : [process.extract(x, prospectus.NamePeriod, limit=1)][0][0][0])
all_data_st = filings.merge(prospectus,left_on='key',right_on='NamePeriod')
all_data_st.to_excel('merged_file_fuzzy.xlsx')
想法是基于每个数据框的两列(名称和年份)进行模糊合并。我试图将这两个字段合并到一个字段中(NamePeriod),然后在该字段上合并,但出现以下错误:
TypeError: expected string or bytes-like object
任何想法如何执行此模糊合并?这些列在数据框中的外观如下:
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
Name ... NamePeriod
0 NaN ... NaN
1 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2019
2 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2018
3 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2017
4 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2016
... ... ...
15922 Huitao Technology Co., Ltd. ... NaN
15923 Leaping Group Co., Ltd. ... NaN
15924 PUYI, INC. ... NaN
15925 Puhui Wealth Investment Management Co., Ltd. ... NaN
15926 Tidal Royalty Corp. ... NaN
[15927 rows x 3 columns]
prospectus_issuer_name fyear NamePeriod
0 ALCAN ALUM LTD 1990 ALCAN ALUM LTD 1990
1 ALCAN ALUM LTD 1991 ALCAN ALUM LTD 1991
2 ALCAN ALUM LTD 1992 ALCAN ALUM LTD 1992
3 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
4 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
... ... ...
1798 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1799 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1800 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1801 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1802 BOC AVIATION PTE LTD GLOBAL 2016 BOC AVIATION PTE LTD GLOBAL 2016
[1803 rows x 3 columns]
这是我尝试运行的完整代码:
import pandas as pd
from rapidfuzz import process, utils
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
filings.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
prospectus.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([filings, prospectus], ignore_index=True)
from rapidfuzz import fuzz, utils
df3.dropna(subset = ["name"], inplace=True)
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)
给我一个错误IndexError: list index out of range
解决方案
总结问题:
有两个DataFrame,都有名称和年份的键
您想合并两个DataFrame并删除所有重复的元素,其中重复的元素是具有相同年份和非常相似名称的元素
我正在使用以下两个示例DataFrames:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['NAM PROPERTY INC.', 'NAM PROPERTY INC.', 'ALCAN ALUM LTD'],
'Period': [2019, 2019, 2018]})
df2 = pd.DataFrame({
'prospectus_issuer_name': ['NAM TAI PROPERTY INC.', 'ALCAN ALUM LTD', 'AMOCO CDA PETE CO'],
'fyear': [2019, 2019, 1992]})
我对这个问题的解决方法是从确定两个数据帧开始
df1.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
df2.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([df1, df2], ignore_index=True)
之后,可以遍历此新DataFrame并删除所有重复的行。我正在使用RapidFuzz在这里,因为它比FuzzyWuzzy(我是作者)快。以下代码将提前创建一个预处理名称列表,因为这些条目可能会被多次使用,并且预处理会占用大量的运行时间。之后,它会遍历所有行,并始终将其与所有具有较高索引的行(具有较低索引的行已进行比较,因为ratio(a,b)== ratio(b,a))进行比较,并且正确的年份。在正确的年份进行过滤可以使慢速字符串匹配算法的运行次数减少很多。对于年份相似且名称非常相似的所有行,将保留第一行,并删除其他行。您可能需要尝试使用score_cutoff和匹配算法,才能找到最适合您需求的算法。
from rapidfuzz import fuzz, utils
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)