以下方法比scipy.spatial.distance.pdist快大约30倍。 它在大矩阵上运行非常快(假设你有足够的内存)
关于稀疏的优化问题,请参阅下面的讨论。# base similarity matrix (all dot products)
# replace this with A.dot(A.T).todense() for sparse representation
similarity = numpy.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag