我认为你不需要在熊猫身上这样做。这是我草率的解决方案,但它通过字典获得您想要的输出。在from fuzzywuzzy import process
df = pd.DataFrame([
['0016F00001c7GDZQA2', 'Daniela Abriani'],
['0016F00001c7GPnQAM', 'Daniel Abriani'],
['0016F00001c7JRrQAM', 'Nisha Well'],
['0016F00001c7Jv8QAE', 'Katherine'],
['0016F00001c7cXiQAI', 'Katerine'],
['0016F00001c7dA3QAI', 'Katherin'],
['0016F00001c7kHyQAI', 'Nursing and Midwifery Council Research Office'],
['0016F00001c8G8OQAU', 'Nisa Well']],
columns=['ID', 'NAME'])
在字典中获取唯一的哈希值。在
^{pr2}$
定义函数checkpair。你需要它来删除相互的哈希对。此方法将添加(hash1, hash2)和(hash2, hash1),但我认为您只希望保留其中一对:def checkpair (a,b,l):
for x in l:
if (a,b) == (x[2],x[0]):
l.remove(x)
现在迭代hashdict.items()查找前3个匹配项。fuzzyfuzzy docs详细介绍了process方法。在matches = []
for k,v in hashdict.items():
#see docs for extract 4 because you are comparing a name to itself
top3 = process.extract(v, hashdict, limit=4)
#remove the hashID compared to itself
for h in top3:
if k == h[2]:
top3.remove(h)
#append tuples to the list "matches" if it meets a score criteria
[matches.append((k, v, x[2], x[0], x[1])) for x in top3 if x[1] > 60] #change score?
#remove reciprocal pairs
[checkpair(m[0], m[2], matches) for m in matches]
df = pd.DataFrame(matches, columns=['id1', 'name1', 'id2', 'name2', 'score'])
# write to file
writer = pd.ExcelWriter('/path/to/your/file.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
输出:id1 name1 id2 name2 score
0 0016F00001c7JRrQAM Nisha Well 0016F00001c8G8OQAU Nisa Well 95
1 0016F00001c7GPnQAM Daniel Abriani 0016F00001c7GDZQA2 Daniela Abriani 97
2 0016F00001c7Jv8QAE Katherine 0016F00001c7dA3QAI Katherin 94
3 0016F00001c7Jv8QAE Katherine 0016F00001c7cXiQAI Katerine 94
4 0016F00001c7dA3QAI Katherin 0016F00001c7cXiQAI Katerine 88