创新项目实训调研(二)——对其他类型搜索索引的调研

一、faiss整体介绍

Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU.

Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors.

Other methods, like HNSW and NSG add an indexing structure on top of the raw vectors to make searching more efficient.

二、相关索引介绍

Flat index

扁平索引只是将向量编码成固定大小的代码,并将其存储在一个由 ntotal * code_size 字节组成的数组中。

在搜索时,所有索引矢量会按顺序解码,并与查询矢量进行比较。对于 IndexPQ,比较是在压缩域中进行的,因此速度更快。

向量编码

The available encodings are (from least to strongest compression):

  • no encoding at all (IndexFlat): the vectors are stored without compression;
  • 16-bit float encoding (IndexScalarQuantizer with QT_fp16): the vectors are compressed to 16-bit floats, which may cause some loss of precision;(直接对整个向量进行压缩)
  • 8/6/4-bit integer encoding (IndexScalarQuantizer with QT_8bit/QT_6bit/QT_4bit): vectors quantized to 256/64/16 levels;
  • PQ encoding (IndexPQ): vectors are split into sub-vectors that are each quantized to a few bits (usually 8). See the example below.(将向量切割,对每一个段进行量化)
  • Residual encodings (IndexResidual): vectors are quantized and progressively refined by residual. At each quantization stage, the size of the codebook can be refined.(量化后通过残差继续细化)
#Python PQ example
m = 16                                   # number of subquantizers
n_bits = 8                               # bits allocated per subquantizer
pq = faiss.IndexPQ (d, m, n_bits)        # Create the index
pq.train (x_train)                       # Training
pq.add (x_base)                          # Populate the index
D, I = pq.search (x_query, k)            # Perform a search

The number of bits n_bits must be equal to 8, 12 or 16. The dimension d should be a multiple of m

Cell-probe methods (IndexIVF * indexes)

A typical way to speed-up the process at the cost of loosing the guarantee to find the nearest neighbor is to employ a partitioning technique such as k-means. The corresponding algorithms are sometimes referred to as cell-probe methods.(使用如kmeans一样的划分技术能够以失去找到最近邻的保证获得加速)

IVFPQPQ的一个增强版本,它通过引入倒排列表来提高搜索效率,适用于需要快速检索大规模数据集的场景。而PQ则更侧重于通过量化来减少存储需求,适用于存储资源受限的环境。(这样看IVFPQ是PQ在搜索场景下的一个应用) 而残差量化是通过量化后的残差进行进一步细分。

IVF1024 , PQNx4fs , RFlat:

IVF1024

  • IVF代表Inverted File,这是一种用于存储和检索数据点的数据结构,它允许快速根据特征向量查找对应的数据点。
  • 1024表示索引中使用的子向量(或称为列表)的数量。在IVF中,原始的高维向量被分割成多个较小的子向量,每个子向量都被量化并存储在一个单独的列表中。查询时,系统会根据量化后的子向量快速定位到可能的候选数据点。

PQNx4fs

  • PQ代表Product Quantization,这是一种向量量化技术,用于将连续的高维向量映射到离散的低维空间中,以减少存储需求和加速搜索过程。
  • N表示量化后的向量的维度。
  • x4表示使用的代码本大小(codebook size),这里是4的倍数。代码本是一组预定义的向量,用于量化输入向量。
  • fs代表快速扫描(Fast Scan),这是一种加速搜索的技术,它允许在不加载整个索引到GPU内存的情况下进行搜索。

RFlat

  • RFlat是一种重排(Re-ranking)策略,用于在初步搜索结果基础上进行精确距离计算,以提高搜索的准确性。这种策略通常在找到一组候选向量后使用,通过计算这些候选向量与查询向量之间的实际距离来重新排序它们,从而找到最接近查询向量的数据点。

IndexHNSW variants

IndexHNSW supports the following Flat indexes: IndexHNSWFlat (no encoding), IndexHNSWSQ (scalar quantizer), IndexHNSWPQ (product quantizer), IndexHNSW2Level (two-level encoding).

IndexLSH and its relationship with cell-probe methods

The most popular cell-probe method is probably the original Locality Sensitive Hashing method referred to as [E2LSH] (Locality Sensitive Hashing (LSH) Home Page). However this method and its derivatives suffer from two drawbacks:

  • They require a lot of hash functions (=partitions) to achieve acceptable results, leading to a lot of extra memory. Memory is not cheap.
  • The hash function are not adapted to the input data. This is good for proofs but leads to suboptimal choice results in practice.

Indexes based on a residual quantizer

三、索引选择

如果数据量在1000-10000,不如直接使用暴力搜索

This is done via a "Flat" index. If the whole dataset does not fit in RAM, you can build small indexes one after another, and combine the search results .

首先考虑内存是否是个问题

1. 不是 ,HNSWM  or  IVF1024 , PQNx4fs , RFlat

如果内存很大或数据集很小,HNSW 是最好的选择,它是一种非常快速和精确的索引。4 <= M <= 64 是指每个向量的链接数,越多越精确,但会占用更多内存。速度和精确度的权衡可通过 efSearch 参数设置。每个向量的内存使用量为 (d * 4 + M * 2 * 4) 字节。

HNSW does only support sequential adds (not add_with_ids) so here again, prefix with IDMap if needed. HNSW does not require training and does not support removing vectors from the index.

The second option is faster than HNSW. However it requires a re-ranking stage and thus there are two parameters to adjust: the k_factor of reranking and the nprobe of the IVF.

不支持GPU

2.需要考虑一些内存问题,"...,Flat"

聚类后,"Flat "只是将向量整理成桶,因此不会对其进行压缩,存储大小与原始数据集相同。速度和准确性之间的权衡可通过 nprobe 参数来设置。

支持GPU,但要求使用的聚类方法同样支持GPU

3.内存比较重要,then OPQM_D,...,PQMx4fsr

If storing the whole vectors is too expensive, this performs two operations:

  • an OPQ transform to dimension D to reduce the dimension
  • a PQ quantization of the vectors into M 4-bit codes.

Therefore the total storage is M/2 bytes per vector.

使用量化索引

**OPQM_*D:***applies an OPQ transform to M blocks in D dim

4.内存非常重要,OPQM_D,...,PQM

PQM compresses the vectors using a product quantizer that outputs M-byte codes. M is typically <= 64, for larger codes SQ is usually as accurate and faster. OPQ is a linear transformation of the vectors to make them easier to compress. D is a dimension such that:

  • D is a multiple of M (required)
  • D <= d, with d the dimension of the input vectors (preferable)
  • D = 4**M* (preferable)

Supported on GPU: yes

根据数据集大小选择聚类方法

This question is used to fill in the clustering options (the ... above). The dataset is clustered into buckets and at search time, only a fraction of the buckets are visited (nprobe buckets). The clustering is performed on a representative sample of the dataset vectors, typically a sample of the dataset. We indicate the optimal size for this sample.

If below 1M vectors: ...,IVFK,...

根据这里,我们确定,对于我们的任务,使用IVFFlat索引。

四、对现有的向量数据库的调研

对当前主流向量数据库的调查:

在几个主流的向量数据库中, milvus在大规模、检索性能、社区影响力等方面都具备绝对优势,其分布式架构也更Match下一代存储的理念。

Weaviate在使用案例上,有很多现成的例子,跟当前GPT前沿热门项目贴合比较紧秘,但在大规模生产环境使用场景中,还需要接受考验。

Chroma是一个很轻量级的数据库,底层使用了clickhouse、duckdb等存储引擎。

实际上,我们的项目最后没有使用向量数据库,因为本项目的数据量并不很大,且只需要其中的检索功能,直接使用检索算法构建索引即可满足我们的需求,且更为灵活方便。

参考:

GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值