【Linux C/C++开发】向量检索库annoy开发

最新推荐文章于 2025-04-30 19:58:04 发布

deepallin

最新推荐文章于 2025-04-30 19:58:04 发布

阅读量642

点赞数 11

分类专栏： linux 文章标签： c++ 开发语言 linux

本文链接：https://blog.csdn.net/liangyuna8787/article/details/147578859

版权

linux 专栏收录该内容

13 篇文章

订阅专栏

前言

人工智能场景中，有获取相似、近似、相邻的查询需求，这种”≈“的求解，是传统key-value数据库无法计算的，本文主要讲解的是annoy库，这个库在python中比较常用，因为源代码是C++开发的，并且annoy的C++的接口函数额外支持多线程，以下就用C++的代码实现详细讲解。

annoy特性

高效的近似最近邻搜索：基于‌随机投影树（Random Projection Trees）‌划分数据空间，查询时间复杂度为 O(log n)，适合大规模高维数据（如百万级向量）的实时搜索需求；
低内存消耗与高扩展性：支持将索引文件存储在磁盘，按需部分加载至内存，可处理远超物理内存容量的数据集；
‌灵活的距离度量支持：内置多种距离计算方式，欧氏距离（Euclidean）、余弦相似度（Angular/Cosine）、曼哈顿距离（Manhattan）、汉明距离（Hamming）、点积（Dot Product）；
‌维度适应性：在低维（如≤100维）和中维（≤1000维）数据中表现优异，精度与速度显著优于暴力搜索。

应用场景

推荐系统：适用于音乐、视频等内容的相似性推荐，通过快速匹配用户或物品的特征向量实现个性化服务；
图像与文本检索：处理图像特征向量（如ResNet提取的特征）或文本嵌入向量（如BERT生成的表示），支持大规模相似内容搜索；
实时响应场景：在广告投放、实时监控等对延迟敏感的场景中，通过近似搜索快速返回结果。

局限性

不支持动态更新‌：索引不可修改，新增数据需全量重建；（可以定时每小时/每天重构新的索引文件）
高维精度下降：维度超过1000时，搜索精度可能显著降低；（取决于内存，如果内存多，可以设置越多的树，树越多搜索越精准，但没必要一直薅着annoy不放）

以下是以人脸特征为样例，详细讲解C++的annoy实现。

人脸特征查询--C++annoy开发

源码及编译

仓库地址： https://github.com/spotify/annoy

编译步骤：

1、解压后，在annoy目录下，创建build目录；

2、进入build目录，执行cmake ..

3、执行sudo make，此时会看到build目录下有一个include目录，拷贝此目录到自己项目中即可。

篇外：其实源码的src目录下已经包含了编译后include目录下生成的3个头文件，不需要编译也可以的，本文关联的资源中包含这3个文件。

应用场景

从其他系统中获取到了手机号与人脸特征的信息，希望后台提供高效的人脸特征搜索功能。

设计思路

手机号+人脸特征维度信息保存在指定的文本文件data.txt（方便定期更新索引树）；
通过annoy把data.txt生成索引树文件face_index.tree（以及手机号<->索引id映射文件）；
提供查询接口，输入一个人脸特征时，输出人脸最近似的人的手机号；

代码实现

main()通过传参方式，支持索引树文件的生成，以及人脸维度特征的查询功能：

save时，加载data.txt文件，并通过annoy生成索引树文件face_index.tree和映射文件phone_map.txt；
query +"特征维度信息"时，加载索引树文件face_index.tree，通过annoy获取到最近似值的索引id，通过id查询映射文件phone_map.txt文件中对应的手机号并把手机号作为返回结果。

//main.cpp
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include "./include/annoylib.h"
#include "./include/kissrandom.h"

using namespace std;
using namespace Annoy;

const int DIMENSION = 7;  // 人脸特征维度
const string TREE_FILE = "face_index.tree";  // 索引文件路径
const string PHONE_MAP_FILE = "phone_map.txt";  // 手机号映射文件路径
const string SRC_DATA_FILE = "data.txt";  // 数据文件

//AnnoyIndex两个参数Euclidean、Angular的区别
//Euclidean（欧氏距离）：直接计算原始坐标的欧氏距离，保留向量模长信息。--场景：图像像素特征、物理空间坐标（如 GPS 点）、需要绝对距离比较的场景
//Angular（余弦距离）：1 - cosine_similarity。--场景：文本 TF-IDF 向量、用户行为 Embedding、任何方向比模长更重要的场景
//AnnoyIndexSingleThreadedBuildPolicy是单线程方式，AnnoyIndexMultiThreadedBuildPolicy是多线程方式
// 加载数据并返回手机号到ID的映射
unordered_map<int, string> load_data(
    AnnoyIndex<int, float, Euclidean, Kiss32Random, AnnoyIndexSingleThreadedBuildPolicy>& index
    ) {
    unordered_map<int, string> phone_map;
    ifstream file(SRC_DATA_FILE);
    string line;
    int current_id = 0;

    while (getline(file, line)) {
        istringstream iss(line);
        string phone;
        vector<float> features(DIMENSION);

        iss >> phone;
        for (int i = 0; i < DIMENSION; ++i) {
            iss >> features[i];
        }

        index.add_item(current_id, features.data());
        phone_map[current_id] = phone;
        current_id++;
    }

    return phone_map;
}

// 保存手机号映射到文件
void save_phone_map(const unordered_map<int, string>& phone_map) {
    ofstream file(PHONE_MAP_FILE);
    for (const auto& entry : phone_map) {  // 传统迭代方式
        file << entry.first << " " << entry.second << "\n";
    }
}

// 从文件加载手机号映射
unordered_map<int, string> load_phone_map() {
    unordered_map<int, string> phone_map;
    ifstream file(PHONE_MAP_FILE);
    int id;
    string phone;

    while (file >> id >> phone) {
        phone_map[id] = phone;
    }

    return phone_map;
}

// 构建并保存索引
void build_and_save_index() {
    AnnoyIndex<int, float, Euclidean, Kiss32Random, AnnoyIndexSingleThreadedBuildPolicy> index(DIMENSION);
    auto phone_map = load_data(index);

    if (phone_map.empty()) {
        cerr << "错误：未加载到数据" << endl;
        return;
    }

    //build树的个数n_trees，树越多，精度越高,占用内存越多
    //小型数据集（<10万条）：10~20 棵树
    //中型数据集（10万~100万条）：20~50 棵树
    //大型数据集（>100万条）：50~100 棵树
    //数据维度≤50：10~30 棵树
    //数据维度50~200：30~50 棵树
    //数据维度>200：50~100
    if (!index.build(10)) {
        cerr << "构建索引失败" << endl;
        return;
    }

    if (!index.save(TREE_FILE.c_str())) {
        cerr << "保存索引文件失败" << endl;
    } else {
        cout << "索引文件"<< TREE_FILE << "保存成功" << endl;
        save_phone_map(phone_map);
        cout << "手机号映射文件"<< PHONE_MAP_FILE << "保存成功" << endl;
    }
}

// 查询功能
vector<string> query_index(const vector<float>& query_vec, int top_k = 3) {
    AnnoyIndex<int, float, Euclidean, Kiss32Random, AnnoyIndexSingleThreadedBuildPolicy> index(DIMENSION);
    if (!index.load(TREE_FILE.c_str())) {
        cerr << "加载索引文件失败" << endl;
        return {};
    }

    auto phone_map = load_phone_map();
    if (phone_map.empty()) {
        cerr << "加载手机号映射失败" << endl;
        return {};
    }

    vector<int> result_ids;
    vector<float> distances;
    //search_k = -1 表示使用默认值10*n_trees，值越大，结果越精确，但查询耗时增加
    index.get_nns_by_vector(query_vec.data(), top_k, -1, &result_ids, &distances);

    vector<string> result_phones;
    for (int id : result_ids) {
        if (phone_map.find(id) != phone_map.end()) {
            result_phones.push_back(phone_map.at(id));
        } else {
            cerr << "警告: 未找到ID " << id << " 对应的手机号" << endl;
        }
    }

    return result_phones;
}

// 解析查询字符串
vector<float> parse_query(const string& query_str) {
    vector<float> query;
    stringstream ss(query_str);
    string item;

    while (getline(ss, item, ',')) {
        try {
            query.push_back(stof(item));
        } catch (...) {
            cerr << "无效查询参数: " << item << endl;
        }
    }

    if (query.size() != DIMENSION) {
        cerr << "错误: 查询向量维度应为 " << DIMENSION << endl;
        return {};
    }
    return query;
}

int main(int argc, char* argv[]) {
    if (argc < 2) {
        cerr << "用法: \n"
             << "  构建索引: " << argv[0] << " save\n"
             << "  执行查询: " << argv[0] << " query \"0.1,0.2,0.3,0.4,0.5,0.6,0.7\"\n";
        return 1;
    }

    string mode(argv[1]);
    if (mode == "save") {
        build_and_save_index();
    } else if (mode == "query" && argc > 2) {
        auto query_vec = parse_query(argv[2]);
        if (!query_vec.empty()) {
            auto results = query_index(query_vec,1);
            cout << "最相似人脸的手机号: ";
            for (const auto& phone : results) {
                cout << phone << " ";
            }
            cout << endl;
        }
    } else {
        cerr << "无效参数" << endl;
        return 1;
    }

    return 0;
}

编译方式

g++ main.cpp -o face_search
#g++ -std=c++11 -O3 main.cpp -o face_search

关键函数详解

1.创建索引类AnnoyIndex对象

AnnoyIndex<int, float, Euclidean, Kiss32Random, AnnoyIndexSingleThreadedBuildPolicy> index(DIMENSION)

<int>：表示索引项的键类型为整数，即每个向量的唯一标识符（ID）使用整数值。
<float>：向量元素的数据类型为单精度浮点数（每个维度的值存储为float）。
Euclidean：指定使用欧氏距离（L2距离）作为向量间的相似性度量，适用于需要几何距离的场景。
Kiss32Random：采用KISS32算法生成随机数，用于构建索引时的随机投影，影响树的构建和分割过程。
AnnoyIndexSingleThreadedBuildPolicy：构建索引时使用单线程策略，避免多线程开销，适用于资源受限环境或小规模数据。
DIMENSION：构造函数参数，填写向量的维度数，以上代码定义了DIMENSION为7个维度（因为data.txt文件中的特征维度是7个---只是用于测试，实际人脸特征为128时，就改为128）

2.添加数据到annoy索引

index.add_item(current_id, features.data());

current_id:与维度特征对应的Id值
features.data()：维度特征值

3.构建索引树

index.build(n_trees)

n_trees：build树的个数，树越多，精度越高,占用内存越多

数据情况	树数量
小型数据集（<10万条）	10~20 棵树
中型数据集（10万~100万条）	20~50 棵树
大型数据集（>100万条）	50~100 棵树
数据维度≤50	10~30 棵树
数据维度50~200	30~50 棵树
数据维度>200	50~100棵树

4.保存索引树文件

index.save(filepath)

filepath:保存的文件路径

5.加载索引树文件

index.load(filepath)

filepath:树索引文件路径

6.查询并获取索引id(距离)

index.get_nns_by_vector(query_vec.data(), top_k, -1, &result_ids, &distances);

query_vec.data()：待查询的特征维度值
top_k：返回最近似值的结果数量
search_k：-1 表示使用默认值10*n_trees，值越大，结果越精确，但查询耗时增加
result_ids：最近似值的id
distances：待查询的特征维度值与最近似值的特征维度的距离

annoy源码中自带的多线程样例

在examples目录有一个包含多线程的样例precision_test.cpp

/*
 * precision_test.cpp

 *
 *  Created on: Jul 13, 2016
 *      Author: Claudio Sanhueza
 *      Contact: csanhuezalobos@gmail.com
 */

#include <iostream>
#include <iomanip>
#include "../src/kissrandom.h"
#include "../src/annoylib.h"
#include <chrono>
#include <algorithm>
#include <map>
#include <random>

using namespace Annoy;
int precision(int f=40, int n=1000000){
	std::chrono::high_resolution_clock::time_point t_start, t_end;

	std::default_random_engine generator;
	std::normal_distribution<double> distribution(0.0, 1.0);

	//******************************************************
        //Building the tree 使用Angular距离、Kiss32随机数生成器和多线程构建策略
	AnnoyIndex<int, double, Angular, Kiss32Random, AnnoyIndexMultiThreadedBuildPolicy> t = AnnoyIndex<int, double, Angular, Kiss32Random, AnnoyIndexMultiThreadedBuildPolicy>(f);

	std::cout << "Building index ... be patient !!" << std::endl;
	std::cout << "\"Trees that are slow to grow bear the best fruit\" (Moliere)" << std::endl;



	for(int i=0; i<n; ++i){
		double *vec = (double *) malloc( f * sizeof(double) );

		for(int z=0; z<f; ++z){
                        vec[z] = (distribution(generator));//生成正态分布的随机向量，逐条添加到索引。
		}

		t.add_item(i, vec);

                //std::cout << "Loading objects ...\t object: "<< i+1 << "\tProgress:"<< std::fixed << std::setprecision(2) << (double) i / (double)(n + 1) * 100 << "%\r";

	}
	std::cout << std::endl;
	std::cout << "Building index num_trees = 2 * num_features ...";
	t_start = std::chrono::high_resolution_clock::now();
        t.build(2 * f); // 构建索引，树的数量=2*f
	t_end = std::chrono::high_resolution_clock::now();
	auto duration = std::chrono::duration_cast<std::chrono::seconds>( t_end - t_start ).count();
	std::cout << " Done in "<< duration << " secs." << std::endl;


	std::cout << "Saving index ...";
	t.save("precision.tree");
	std::cout << " Done" << std::endl;



	//******************************************************
	std::vector<int> limits = {10, 100, 1000, 10000};
	int K=10;
	int prec_n = 1000;

	std::map<int, double> prec_sum;
	std::map<int, double> time_sum;
	std::vector<int> closest;

	//init precision and timers map
	for(std::vector<int>::iterator it = limits.begin(); it!=limits.end(); ++it){
		prec_sum[(*it)] = 0.0;
		time_sum[(*it)] = 0.0;
	}

	// doing the work
	for(int i=0; i<prec_n; ++i){

		//select a random node
                int j = rand() % n;// 随机选择测试点

		std::cout << "finding nbs for " << j << std::endl;

		// getting the K closest
                t.get_nns_by_item(j, K, n, &closest, nullptr);// 获取真实最近邻

		std::vector<int> toplist;
		std::vector<int> intersection;

		for(std::vector<int>::iterator limit = limits.begin(); limit!=limits.end(); ++limit){

			t_start = std::chrono::high_resolution_clock::now();
                         // 执行近似搜索
                        t.get_nns_by_item(j, (*limit), 0, &toplist, nullptr); //search_k defaults to "n_trees * n" if not provided.
			t_end = std::chrono::high_resolution_clock::now();
			auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t_end - t_start ).count();

			//intersecting results
			std::sort(closest.begin(), closest.end(), std::less<int>());
			std::sort(toplist.begin(), toplist.end(), std::less<int>());
			intersection.resize(std::max(closest.size(), toplist.size()));
                        // 计算交集
			std::vector<int>::iterator it_set = std::set_intersection(closest.begin(), closest.end(), toplist.begin(), toplist.end(), intersection.begin());
			intersection.resize(it_set-intersection.begin());

			// storing metrics
			int found = intersection.size();
			double hitrate = found / (double) K;
			prec_sum[(*limit)] += hitrate;

                        time_sum[(*limit)] += duration;// 记录时间


			//deallocate memory
			vector<int>().swap(intersection);
			vector<int>().swap(toplist);
		}

		//print resulting metrics
		for(std::vector<int>::iterator limit = limits.begin(); limit!=limits.end(); ++limit){
			std::cout << "limit: " << (*limit) << "\tprecision: "<< std::fixed << std::setprecision(2) << (100.0 * prec_sum[(*limit)] / (i + 1)) << "% \tavg. time: "<< std::fixed<< std::setprecision(6) << (time_sum[(*limit)] / (i + 1)) * 1e-04 << "s" << std::endl;
		}

		closest.clear(); vector<int>().swap(closest);

	}

	std::cout << "\nDone" << std::endl;
	return 0;
}


void help(){
	std::cout << "Annoy Precision C++ example" << std::endl;
	std::cout << "Usage:" << std::endl;
	std::cout << "(default)		./precision" << std::endl;
	std::cout << "(using parameters)	./precision num_features num_nodes" << std::endl;
	std::cout << std::endl;
}

void feedback(int f, int n){
	std::cout<<"Runing precision example with:" << std::endl;
	std::cout<<"num. features: "<< f << std::endl;
	std::cout<<"num. nodes: "<< n << std::endl;
	std::cout << std::endl;
}


int main(int argc, char **argv) {
	int f, n;


	if(argc == 1){
		f = 40;
		n = 1000000;

		feedback(f,n);

		precision(40, 1000000);
	}
	else if(argc == 3){

		f = atoi(argv[1]);
		n = atoi(argv[2]);

		feedback(f,n);

		precision(f, n);
	}
	else {
		help();
		return EXIT_FAILURE;
	}


	return EXIT_SUCCESS;
}

编译方式

g++ -std=c++14 -O3 -DANNOYLIB_MULTITHREADED_BUILD -pthread precision_test.cpp -o precision_test

结尾

annoy本身是支持python的，如果需要提升性能，生成环境部署考虑提升性能时，可以通过C++开发进行优化。