问题: 海量高维向量中如何找出相似的topN
原理:
假设如果两个点无限近的话,任何平面都无法切分他们,所以可对这些点在空间中用超平面进行切分,如果这些点紧挨着的,会被切分到同一边
annoy算法详细解释:https://www.cnblogs.com/futurehau/p/6524396.html
github项目地址: https://github.com/spotify/annoy
python演示代码:
#coding=utf-8
from annoy import AnnoyIndex
import random
f = 2 #维度
t = AnnoyIndex(f) # Length of item vector that will be indexed
tmp=[];
x=[];
y=[];
for i in xrange(500):
v = [random.gauss(0, 1) for z in xrange(f)]
tmp.append(v)
x.append(v[0])
y.append(v[1])
t.add_item(i, v) #添加向量
t.build(100) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f)
u.load('test.ann') # super fast, will just mmap the file
nearest= u.get_nns_by_item(1, 40) # will find the 1000 nearest neighbors of the first(0) vec
target = tmp.__getitem__(1)
nearx=[];
neary=[];
nearest.pop(0)
for i in nearest:
near= tmp.__getitem__(i)
#print u.get_distance(1,i)
print u.get_item_vector(i)
nearx.append(near[0])
neary.append(near[1])
import matplotlib.pyplot as plt
p1 = plt.scatter(x, y, marker='x', color='g', label='1', s=30)
p1 = plt.scatter(target[0], target[1], marker='*', color='r', label='1', s=30)
plt.scatter(nearx, neary, marker='+', color='b', label='1', s=30)
plt.title('Scatter')
plt.legend(loc='upper right')
plt.xticks(x)
plt.show()
结果:
图中红色点为目标点,蓝色为跟这个目标点相似的,