python distance matrix_String Distance Matrix in Python using pdist

问题

How to calculate Jaro Winkler distance matrix of strings in Python?

I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in spelling. A response to a similar question suggested using Scipy's pdist function with a custom distance function. I've tried to implement this solution with the jaro_winkler function in the Levenshtein package. The problem with this is that the jaro_winkler function requires a string input, whereas the pdict function seems to require a 2D array input.

Example:

import numpy as np

from scipy.spatial.distance import pdist

from Levenshtein import jaro_winkler

fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)

dm = pdist(fname, jaro_winkler)

dm = squareform(dm)

Expected Output - Something like this:

Bob Carl Kristen Calr Doug

Bob 1.0 - - - -

Carl 0.0 1.0 - - -

Kristen 0.0 0.46 1.0 - -

Calr 0.0 0.93 0.46 1.0 -

Doug 0.53 0.0 0.0 0.0 1.0

Actual Error:

jaro_winkler expected two Strings or two Unicodes

I'm assuming this is because the jaro_winkler function is seeing an ndarray instead of a string, and I'm not sure how to convert the function input to a string in the context of the pdist function.

Does anyone have a suggestion to allow this to work? Thanks in advance!

回答1:

You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance

import numpy as np

from Levenshtein import distance

from scipy.spatial.distance import pdist, squareform

# my list of strings

strings = ["hello","hallo","choco"]

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1))

transformed_strings = np.array(strings).reshape(-1,1)

# calculate condensed distance matrix by wrapping the Levenshtein distance function

distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))

# get square matrix

print(squareform(distance_matrix))

Output:

array([[ 0., 1., 4.],

[ 1., 0., 4.],

[ 4., 4., 0.]])

回答2:

For anyone with a similar problem - One solution I just found is to extract the relevant code from the pdist function and add a [0] to the jaro_winkler function input to call the string out of the numpy array.

Example:

X = np.asarray(fname, order='c')

s = X.shape

m, n = s

dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)

k = 0

for i in xrange(0, m - 1):

for j in xrange(i + 1, m):

dm[k] = jaro_winkler(X[i][0], X[j][0])

k = k + 1

dms = squareform(dm)

Even though this algorithm works I'd still like to learn if there's a "right" computer-sciency-way to do this with the pdist function. Thanks, and hope this helps someone!

回答3:

Here's a concise solution that requires neither numpy nor scipy:

from Levenshtein import jaro_winkler

data = ['Bob','Carl','Kristen','Calr', 'Doug']

dm = [[ jaro_winkler(a, b) for b in data] for a in data]

print('\n'.join([''.join([f'{item:6.2f}' for item in row]) for row in dm]))

1.00 0.00 0.00 0.00 0.53

0.00 1.00 0.46 0.93 0.00

0.00 0.46 1.00 0.46 0.00

0.00 0.93 0.46 1.00 0.00

0.53 0.00 0.00 0.00 1.00

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Go语言(也称为Golang)是由Google开发的一种静态强类型、编译型的编程语言。它旨在成为一门简单、高效、安全和并发的编程语言,特别适用于构建高性能的服务器和分布式系统。以下是Go语言的一些主要特点和优势: 简洁性:Go语言的语法简单直观,易于学习和使用。它避免了复杂的语法特性,如继承、重载等,转而采用组合和接口来实现代码的复用和扩展。 高性能:Go语言具有出色的性能,可以媲美C和C++。它使用静态类型系统和编译型语言的优势,能够生成高效的机器码。 并发性:Go语言内置了对并发的支持,通过轻量级的goroutine和channel机制,可以轻松实现并发编程。这使得Go语言在构建高性能的服务器和分布式系统时具有天然的优势。 安全性:Go语言具有强大的类型系统和内存管理机制,能够减少运行时错误和内存泄漏等问题。它还支持编译时检查,可以在编译阶段就发现潜在的问题。 标准库:Go语言的标准库非常丰富,包含了大量的实用功能和工具,如网络编程、文件操作、加密解密等。这使得开发者可以更加专注于业务逻辑的实现,而无需花费太多时间在底层功能的实现上。 跨平台:Go语言支持多种操作系统和平台,包括Windows、Linux、macOS等。它使用统一的构建系统(如Go Modules),可以轻松地跨平台编译和运行代码。 开源和社区支持:Go语言是开源的,具有庞大的社区支持和丰富的资源。开发者可以通过社区获取帮助、分享经验和学习资料。 总之,Go语言是一种简单、高效、安全、并发的编程语言,特别适用于构建高性能的服务器和分布式系统。如果你正在寻找一种易于学习和使用的编程语言,并且需要处理大量的并发请求和数据,那么Go语言可能是一个不错的选择。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值