创新项目实训调研（二）——对其他类型搜索索引的调研

m0_62650609

已于 2024-06-22 19:55:21 修改

阅读量70

点赞数

文章标签：算法

于 2024-06-22 12:50:45 首次发布

原文链接：https://github.com/facebookresearch/faiss

版权

一、faiss整体介绍

Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU.

Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors.

Other methods, like HNSW and NSG add an indexing structure on top of the raw vectors to make searching more efficient.

二、相关索引介绍

Flat index

扁平索引只是将向量编码成固定大小的代码，并将其存储在一个由 ntotal * code_size 字节组成的数组中。

在搜索时，所有索引矢量会按顺序解码，并与查询矢量进行比较。对于 IndexPQ，比较是在压缩域中进行的，因此速度更快。

向量编码

The available encodings are (from least to strongest compression):

no encoding at all (IndexFlat): the vectors are stored without compression;
16-bit float encoding (IndexScalarQuantizer with QT_fp16): the vectors are compressed to 16-bit floats, which may cause some loss of precision;（直接对整个向量进行压缩）
8/6/4-bit integer encoding (IndexScalarQuantizer with QT_8bit/QT_6bit/QT_4bit): vectors quantized to 256/64/16 levels;
PQ encoding (IndexPQ): vectors are split into sub-vectors that are each quantized to a few bits (usually 8). See the example below.（将向量切割，对每一个段进行量化）
Residual encodings (IndexResidual): vectors are quantized and progressively refined by residual. At each quantization stage, the size of the codebook can be refined.（量化后通过残差继续细化）

#Python PQ example
m = 16                                   # number of subquantizers
n_bits = 8                               # bits allocated per subquantizer
pq = faiss.IndexPQ (d, m, n_bits)        # Create the index
pq.train (x_train)                       # Training
pq.add (x_base)                          # Populate the index
D, I = pq.search (x_query, k)            # Perform a search

The number of bits n_bits must be equal to 8, 12 or 16. The dimension d should be a multiple of m

Cell-probe methods (IndexIVF * indexes)

A typical way to speed-up the process at the cost of loosing the guarantee to find the nearest neighbor is to employ a partitioning technique such as k-means. The corresponding algorithms are sometimes referred to as cell-probe methods.(使用如kmeans一样的划分技术能够以失去找到最近邻的保证获得加速)

IVFPQ是PQ的一个增强版本，它通过引入倒排列表来提高搜索效率，适用于需要快速检索大规模数据集的场景。而PQ则更侧重于通过量化来减少存储需求，适用于存储资源受限的环境。（这样看IVFPQ是PQ在搜索场景下的一个应用）而残差量化是通过量化后的残差进行进一步细分。

IVF1024 , PQNx4fs , RFlat：

IVF1024

IVF代表Inverted File，这是一种用于存储和检索数据点的数据结构，它允许快速根据特征向量查找对应的数据点。
1024表示索引中使用的子向量（或称为列表）的数量。在IVF中，原始的高维向量被分割成多个较小的子向量，每个子向量都被量化并存储在一个单独的列表中。查询时，系统会根据量化后的子向量快速定位到可能的候选数据点。

PQNx4fs

PQ代表Product Quantization，这是一种向量量化技术，用于将连续的高维向量映射到离散的低维空间中，以减少存储需求和加速搜索过程。
N表示量化后的向量的维度。
x4表示使用的代码本大小（codebook size），这里是4的倍数。代码本是一组预定义的向量，用于量化输入向量。
fs代表快速扫描（Fast Scan），这是一种加速搜索的技术，它允许在不加载整个索引到GPU内存的情况下进行搜索。

RFlat

RFlat是一种重排（Re-ranking）策略，用于在初步搜索结果基础上进行精确距离计算，以提高搜索的准确性。这种策略通常在找到一组候选向量后使用，通过计算这些候选向量与查询向量之间的实际距离来重新排序它们，从而找到最接近查询向量的数据点。

IndexHNSW variants

IndexHNSW supports the following Flat indexes: IndexHNSWFlat (no encoding), IndexHNSWSQ (scalar quantizer), IndexHNSWPQ (product quantizer), IndexHNSW2Level (two-level encoding).

IndexLSH and its relationship with cell-probe methods

The most popular cell-probe method is probably the original Locality Sensitive Hashing method referred to as [E2LSH] (Locality Sensitive Hashing (LSH) Home Page). However this method and its derivatives suffer from two drawbacks:

They require a lot of hash functions (=partitions) to achieve acceptable results, leading to a lot of extra memory. Memory is not cheap.
The hash function are not adapted to the input data. This is good for proofs but leads to suboptimal choice results in practice.

Indexes based on a residual quantizer

三、索引选择

如果数据量在1000-10000，不如直接使用暴力搜索

This is done via a "Flat" index. If the whole dataset does not fit in RAM, you can build small indexes one after another, and combine the search results .

首先考虑内存是否是个问题

1. 不是，`HNSW`M or `IVF1024 , PQ`N`x4fs , RFlat`

如果内存很大或数据集很小，HNSW 是最好的选择，它是一种非常快速和精确的索引。4 <= M <= 64 是指每个向量的链接数，越多越精确，但会占用更多内存。速度和精确度的权衡可通过 efSearch 参数设置。每个向量的内存使用量为 (d * 4 + M * 2 * 4) 字节。

HNSW does only support sequential adds (not add_with_ids) so here again, prefix with IDMap if needed. HNSW does not require training and does not support removing vectors from the index.

The second option is faster than HNSW. However it requires a re-ranking stage and thus there are two parameters to adjust: the k_factor of reranking and the nprobe of the IVF.

不支持GPU

2.需要考虑一些内存问题，`"...,Flat"`

聚类后，"Flat "只是将向量整理成桶，因此不会对其进行压缩，存储大小与原始数据集相同。速度和准确性之间的权衡可通过 nprobe 参数来设置。

支持GPU，但要求使用的聚类方法同样支持GPU

3.内存比较重要，then `OPQ`M`_`D`,...,PQ`M`x4fsr`

If storing the whole vectors is too expensive, this performs two operations:

an OPQ transform to dimension D to reduce the dimension
a PQ quantization of the vectors into M 4-bit codes.

Therefore the total storage is M/2 bytes per vector.

使用量化索引

**OPQM_*D：***applies an OPQ transform to M blocks in D dim

4.内存非常重要，`OPQ`M`_`D`,...,PQ`M

PQM compresses the vectors using a product quantizer that outputs M-byte codes. M is typically <= 64, for larger codes SQ is usually as accurate and faster. OPQ is a linear transformation of the vectors to make them easier to compress. D is a dimension such that:

D is a multiple of M (required)
D <= d, with d the dimension of the input vectors (preferable)
D = 4**M* (preferable)

Supported on GPU: yes

根据数据集大小选择聚类方法

This question is used to fill in the clustering options (the ... above). The dataset is clustered into buckets and at search time, only a fraction of the buckets are visited (nprobe buckets). The clustering is performed on a representative sample of the dataset vectors, typically a sample of the dataset. We indicate the optimal size for this sample.

If below 1M vectors: ...,IVFK,...

根据这里，我们确定，对于我们的任务，使用IVFFlat索引。