SPTAG安装与测试

SPTAG是一个将向量数据存储以及快速查询的开源项目,针对我个人的情况,可用于以下工作中:
存储大量的深度学习最后一层的结果,即文本编码向量、图像编码向量,用于文本相似度、图像相似度的计算,目前支撑的是L2相似度、余弦相似度计算。
具体介绍参考:https://cloud.tencent.com/developer/article/1429751
github地址:https://github.com/microsoft/SPTAG
按照官方要求下载对应的安装包:
安装包如下:
cmake >= 3.12.0:cmake-3.14.4.tar.gz
安装可参考链接:https://mp.csdn.net/mdeditor/90382970#
swig >= 3.0: 1、命令安装: sudo apt-get install swig
安装过程中可能会出现:E: Unmet dependencies. Try ‘apt-get -f install’ with no packages (or specify a solution).
我的解决方案:sudo apt-get -f install
2、源码安装可参考链接:https://blog.csdn.net/zhangkzz/article/details/88555830
我用的是命令安装
查看版本:chl@chl:~$ swig -version
SWIG Version 3.0.8
Compiled with g++ [x86_64-pc-linux-gnu]
Configured options: +pcre
boost >= 1.67.0:boost_1_67_0.tar.gz
安装过程:
解压后进入文件夹中
执行以下命令:
sudo ./bootstrap.sh
sudo ./b2
sudo ./b2 install
安装1.7.0版本的会报错,卸载后重新安装的1.67.0版本
tbb >= 4.2:tbb-2019_U6.tar.gz
安装参考链接:https://www.jianshu.com/p/57b67477ff53

正式编译SPTAG
下载好SPTAG项目后进入文件夹:
执行:
mkdir build
cd build && cmake … && make -j32
若是遇到问题,大多与依赖包不兼容有关,进行相应的卸载安装正确的版本
默认的python是2版本的,修改3版本可参考issue部分进行修改,我没有修改
测试SPTAG
贴一下官方代码以及我的注释

# encoding: utf-8
import sys
sys.path.append('SPTAG/Release')
import SPTAG
import numpy as np
import time
nr = 1024*100
nc = 512
k = 3
r = 2000


def testBuild(algo, distmethod, x, out):  # 'BKT' 'L2' (100,10) 'testindices' 是保存数据的路径
    i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])  # 参数: 索引构建方法 数据类型 大小10 -> 初始化索引
    i.SetBuildParam("NumberOfThreads", '4')  # 设置线程数
    i.SetBuildParam("DistCalcMethod", distmethod)  # 设置距离计算方法
    ret = i.Build(x.tobytes(), x.shape[0])  # 数据转为bytes型便于保存  100 开始用BKT方法 构建数据
    i.Save(out)  # 保存向量到testindices/vectors.bin中


def testBuildWithMetaData(algo, distmethod, x, s, out):
    i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
    i.SetBuildParam("NumberOfThreads", '4')
    i.SetBuildParam("DistCalcMethod", distmethod)
    if i.BuildWithMetaData(x.tobytes(), s, x.shape[0]):  # 给x数据创建 元数据
        i.Save(out)


def testSearch(index, q, k):  # 'testindices'数据路径  q是要查询的向量,大小为(3,10) k是3 是取相似度最高的前k个
    j = SPTAG.AnnIndex.Load(index)  # 加载数据库中的数据
    for t in range(q.shape[0]):
        result = j.Search(q[t].tobytes(), k)  #依次取向量转为bytes类型 再计算相似度
        print (result[0])  # ids  查询到的k个数据库中的索引
        print (result[1])  # distances  # 距离 该数据q[t] 与数据库中的被检索出来的数据距离,即相似度


def testSearchWithMetaData(index, q, k):
    j = SPTAG.AnnIndex.Load(index)
    j.SetSearchParam("MaxCheck", '1024')  # 设置搜索参数
    for t in range(q.shape[0]):
        result = j.SearchWithMetaData(q[t].tobytes(), k)
        print (result[0])  # ids  # 索引
        print (result[1])  # distances  #距离值
        print (result[2])  # metadata  # 元数据


def testAdd(index, x, out, algo, distmethod):  # index ='testindices' x=要添加的数据 out = 'testindices' algo = 'BKT' distmethod =距离计算方法
    if index != None:  # 存在就加载
        i = SPTAG.AnnIndex.Load(index)
    else:  # 否则 创建
        i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
    i.SetBuildParam("NumberOfThreads", '4')
    i.SetBuildParam("DistCalcMethod", distmethod)
    if i.Add(x.tobytes(), x.shape[0]):
        i.Save(out)


def testAddWithMetaData(index, x, s, out, algo, distmethod):
    if index != None:
        i = SPTAG.AnnIndex.Load(index)
    else:
        i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
    i = SPTAG.AnnIndex(algo, 'Float', x.shape[1])
    i.SetBuildParam("NumberOfThreads", '4')
    i.SetBuildParam("DistCalcMethod", distmethod)
    if i.AddWithMetaData(x.tobytes(), s, x.shape[0]):  # s ?? 添加元数据 成功就保存
        i.Save(out)


def testDelete(index, x, out):
    i = SPTAG.AnnIndex.Load(index)  # 加载数据
    ret = i.Delete(x.tobytes(), x.shape[0])  # 删除已存在的数据  删除成功返回True 否则False
    print (ret) # True
    i.Save(out)


def Test(algo, distmethod):
    x = np.ones((nr, nc), dtype=np.float32) * np.reshape(np.arange(nr, dtype=np.float32), (nr, 1))  # n= 100  (100, 10)
    q = np.ones((r, nc), dtype=np.float32) * np.reshape(np.arange(r, dtype=np.float32), (r, 1)) * 2  # r =3  (3, 10)
    m = ''
    for i in range(nr):
        m += str(i) + '\n'

    print ("Build.............................")
    start = time.time()
    # testBuild(algo, distmethod, x, 'testindices')  # 创建
    print ('创建数据耗时{0}秒'.format(time.time()-start))
    testSearch('testindices', q, k)  # 检索
    print ('查询数据耗时{0}秒'.format((time.time() - start)/np.float(q.shape[0])))
    # print ("Add.............................")
    # testAdd('testindices', x, 'testindices', algo, distmethod)  # 添加新的数据到数据库中
    # testSearch('testindices', q, k)  # 再检索一次
    # print ("Delete.............................")
    # testDelete('testindices', q, 'testindices')  # 删除数据
    # testSearch('testindices', q, k)  # 再检索一次
    #
    # print ("AddWithMetaData.............................")  # 与元数据添加
    # testAddWithMetaData(None, x, m, 'testindices', algo, distmethod)
    # print ("Delete.............................")
    # testSearchWithMetaData('testindices', q, k)  # 元数据检索
    # testDelete('testindices', q, 'testindices')  # 元数据删除
    # testSearchWithMetaData('testindices', q, k)


if __name__ == '__main__':
    Test('BKT', 'L2')
    # Test('KDT', 'L2')

测试耗时性能:

Build.............................
Setting NumberOfThreads with value 4
Setting DistCalcMethod with value L2
Start to build BKTree 1
1 BKTree built, 102401 102400
build RNG graph!
Parallel TpTree Partition begin 
Finish Getting Leaves for Tree 0
Finish Getting Leaves for Tree 1
Finish Getting Leaves for Tree 2
Finish Getting Leaves for Tree 3
Finish Getting Leaves for Tree 4
Finish Getting Leaves for Tree 5
Finish Getting Leaves for Tree 6
Finish Getting Leaves for Tree 7
Finish Getting Leaves for Tree 8
Finish Getting Leaves for Tree 9
Finish Getting Leaves for Tree 10
Finish Getting Leaves for Tree 11
Finish Getting Leaves for Tree 12
Finish Getting Leaves for Tree 13
Finish Getting Leaves for Tree 14
Finish Getting Leaves for Tree 15
Finish Getting Leaves for Tree 16
Finish Getting Leaves for Tree 17
Finish Getting Leaves for Tree 18
Finish Getting Leaves for Tree 19
Finish Getting Leaves for Tree 20
Finish Getting Leaves for Tree 21
Finish Getting Leaves for Tree 22
Finish Getting Leaves for Tree 23
Finish Getting Leaves for Tree 24
Finish Getting Leaves for Tree 25
Finish Getting Leaves for Tree 26
Finish Getting Leaves for Tree 27
Finish Getting Leaves for Tree 28
Finish Getting Leaves for Tree 29
Finish Getting Leaves for Tree 30
Finish Getting Leaves for Tree 31
Parallel TpTree Partition done
Processing Tree 0 96%
Processing Tree 1 95%
Processing Tree 2 93%
Processing Tree 3 93%
Processing Tree 4 95%
Processing Tree 5 93%
Processing Tree 6 96%
Processing Tree 7 93%
Processing Tree 8 93%
Processing Tree 9 95%
Processing Tree 10 95%
Processing Tree 11 95%
Processing Tree 12 93%
Processing Tree 13 95%
Processing Tree 14 95%
Processing Tree 15 93%
Processing Tree 16 98%
Processing Tree 17 95%
Processing Tree 18 98%
Processing Tree 19 95%
Processing Tree 20 93%
Processing Tree 21 93%
Processing Tree 22 95%
Processing Tree 23 93%
Processing Tree 24 96%
Processing Tree 25 98%
Processing Tree 26 96%
Processing Tree 27 96%
Processing Tree 28 95%
Processing Tree 29 98%
Processing Tree 30 95%
Processing Tree 31 95%
Refine 1 99%Refine RNG, graph acc:1
Refine 2 99%Refine RNG, graph acc:1
Save Data To testindices/vectors.bin
Save Data (102400, 512) Finish!
Save BKT to testindices/tree.bin
Save BKT (1,102401) Finish!
Save Graph To testindices/graph.bin
Save Graph (102400,32) Finish!
创建数据耗时1592.27814603秒
查询数据耗时0.000952189445496秒
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值