python 计量_在python最快的成对距离度量

weixin_39612720

于 2020-11-24 08:46:27 发布

阅读量72

点赞数

文章标签： python 计量

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.

Here's an example that gives me what I want with an array of 1000 numbers.

import numpy as np

import random

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])

dists = np.abs(r - r[:, None])

What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.

Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.

解决方案

Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.

Here's some code:

import numpy as np

import random

import sklearn.metrics.pairwise

import scipy.spatial.distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])

c = r[:, None]

def option1(r):

dists = np.abs(r - r[:, None])

def option2(r):

dists = scipy.spatial.distance.pdist(r, 'cityblock')

def option3(r):

dists = sklearn.metrics.pairwise.manhattan_distances(r)

Timing with IPython:

In [36]: timeit option1(r)

100 loops, best of 3: 5.31 ms per loop

In [37]: timeit option2(c)

1000 loops, best of 3: 1.84 ms per loop

In [38]: timeit option3(c)

100 loops, best of 3: 11.5 ms per loop

I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).