python 余弦距离,在Python中计算余弦距离的优化方法

最新推荐文章于 2022-05-01 15:02:31 发布

深蓝保

最新推荐文章于 2022-05-01 15:02:31 发布

阅读量528

点赞数

文章标签： python 余弦距离

I wrote a method to calculate the cosine distance between two arrays:

def cosine_distance(a, b):

if len(a) != len(b):

return False

numerator = 0

denoma = 0

denomb = 0

for i in range(len(a)):

numerator += a[i]*b[i]

denoma += abs(a[i])**2

denomb += abs(b[i])**2

result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))

return result

Running it can be very slow on a large array. Is there an optimized version of this method that would run faster?

Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve:

def cosine_distance(a, b):

if len(a) != len(b):

raise ValueError, "a and b must be same length" #Steve

numerator = 0

denoma = 0

denomb = 0

for i in range(len(a)): #Mike's optimizations:

ai = a[i] #only calculate once

bi = b[i]

numerator += ai*bi #faster than exponent (barely)

denoma += ai*ai #strip abs() since it's squaring

denomb += bi*bi

result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))

return result

解决方案

If you can use SciPy, you can use cosine from spatial.distance:

If you can't use SciPy, you could try to obtain a small speedup by rewriting your Python (EDIT: but it didn't work out like I thought it would, see below).

from itertools import izip

from math import sqrt

def cosine_distance(a, b):

if len(a) != len(b):

raise ValueError, "a and b must be same length"

numerator = sum(tup[0] * tup[1] for tup in izip(a,b))

denoma = sum(avalue ** 2 for avalue in a)

denomb = sum(bvalue ** 2 for bvalue in b)

result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))

return result

It is better to raise an exception when the lengths of a and b are mismatched.

By using generator expressions inside of calls to sum() you can calculate your values with most of the work being done by the C code inside of Python. This should be faster than using a for loop.

I haven't timed this so I can't guess how much faster it might be. But the SciPy code is almost certainly written in C or C++ and it should be about as fast as you can get.

If you are doing bioinformatics in Python, you really should be using SciPy anyway.

EDIT: Darius Bacon timed my code and found it slower. So I timed my code and... yes, it is slower. The lesson for all: when you are trying to speed things up, don't guess, measure.

I am baffled as to why my attempt to put more work on the C internals of Python is slower. I tried it for lists of length 1000 and it was still slower.

I can't spend any more time on trying to hack the Python cleverly. If you need more speed, I suggest you try SciPy.

EDIT: I just tested by hand, without timeit. I find that for short a and b, the old code is faster; for long a and b, the new code is faster; in both cases the difference is not large. (I'm now wondering if I can trust timeit on my Windows computer; I want to try this test again on Linux.) I wouldn't change working code to try to get it faster. And one more time I urge you to try SciPy. :-)