Faiss(10)：IVFPQ-search过程分析

最新推荐文章于 2024-08-18 23:05:06 发布

翔底

最新推荐文章于 2024-08-18 23:05:06 发布

阅读量2.2k

点赞数 2

分类专栏： Faiss

本文链接：https://blog.csdn.net/rangfei/article/details/108822676

版权

Faiss 专栏收录该内容

17 篇文章 42 订阅

订阅专栏

1. 说明

前面从创建索引、训练、添加向量等过程分析过来，终于要到搜索部分了，对于整个faiss框架来说，前面的操作虽然费时但都是一次完成的，实际上经常需要使用的只有search一项内容，所以这也是整个研究重点。

2. 过程分析

2.1 python core

D, I = gpu_index.search(xq_t[x],top_k)

上面是调用一次搜索的过程。gpu_index是索引的实例，search是实现在该索引内的方法，xq_t[x]是一个64维向量，即要搜索的原数据。比如top_k = 100表示从索引的数据集中搜索出最相近的100个向量。

D表示搜索结果与原向量的距离数组，即近邻向量到原向量的距离由小到大排列；
I表示搜索结果与原向量的标签

将top_k设置为10时，一次搜索后打印的D和I如下：

#D:
[[4.3474708 4.712     4.729798  4.830625  4.8382506 4.9058332 4.921468
  4.9473658 4.9496455 5.009597 ]]
  
#I:
[[436048 134399 127835  68701   4850  35935 116754 235634 399932 875034]]

2.2 faiss core

由于之前已经将index拷贝到GPU中，所以调用search方法时运行的是GPU中的search函数。

GpuIndex::search

此函数定义在gpu/GpuIndex.cu文件中，该类继承自faiss::index类。
这里需要说明.cu文件是CUDA环境下运行于GPU的源文件，编程方法与c++一样，但是该文件下所有的变量定义、内存分配和函数执行等都是在GPU下执行，可以通过拷贝内存的方式使得CPU和GPU之间互相访问。

/*************************************
* n: 原向量个数
* x: 原向量的首地址
* k: 搜索的近邻个数
* distances: 搜索结果与原向量的距离数组，cpu地址
* labels：搜索结果与原向量的标签，cpu地址
**************************************/
void GpuIndex::search(Index::idx_t n,
                 const float* x,
                 Index::idx_t k,
                 float* distances,
                 Index::idx_t* labels) const {
  //进行合法性检查，包括index是否训练过，n值k值是否合法等
  ...

  // 创建当前设备对象，也就是GPU，销毁后会还原先前的设备
  DeviceScope scope(device_);
  auto stream = resources_->getDefaultStream(device_);

  // 拷贝数据到GPU
  auto outDistances =
    toDevice<float, 2>(resources_, device_, distances, stream,
                       {(int) n, (int) k});

  auto outLabels =
    toDevice<faiss::Index::idx_t, 2>(resources_, device_, labels, stream,
                                     {(int) n, (int) k});

  bool usePaged = false;

  //如果x的地址在CPU中，则从CPU的内存页中进行搜索
  if (getDeviceForAddress(x) == -1) {
    size_t dataSize = (size_t) n * this->d * sizeof(float);

    if (dataSize >= minPagedSize_) {
      searchFromCpuPaged_(n, x, k,
                          outDistances.data(),
                          outLabels.data());
      usePaged = true;
    }
  }

  //在GPU中进行搜索
  if (!usePaged) {
    searchNonPaged_(n, x, k,
                    outDistances.data(),
                    outLabels.data());
  }

  // Copy back if necessary
  fromDevice<float, 2>(outDistances, distances, stream);
  fromDevice<faiss::Index::idx_t, 2>(outLabels, labels, stream);
}

在GPU设备中能进行搜索的前提是：
该索引已经训练过；
n值小于编译器允许的int型数最大值，std::numeric_limits::max()；
k值小于CUDA SDK的最大限制值，(Index::idx_t) getMaxKSelection()；
faiss会使用设备驻留指针来调用searchImpl_，即使输入向量对于GPU过大，仍然会为输出distances和labels留出空间，除非所有输入都太大了，那么就会添加另一个平铺层。
这个函数是GPUIndex的总的search函数，所有进入GPU的搜索过程都会进入此函数，但是这里不实际执行搜索的过程，只是为搜索准备上下文，其工作流程可以概括为：
对传入的参数n, x和index等进行合法性检查；
创建GPU设备实例，分配输出数据的地址空间；
进行搜索，根据原向量的位置分为在CPU中搜索和在GPU中搜索；
将搜索结果拷贝CPU；
搜索的具体过程由驻留的指针来调用searchImpl_来执行。即searchNonPaged_()内。

Note：
在这一步里面并没有将输入的原向量x的内容拷贝到GPU中，只是根据CPU里分配的distances和labels的地址在GPU里也分配了对应空间。

searchNonPaged_

此函数仍然是在为搜索过程准备数据。

/*************************************
* n: 原向量个数
* x: 原向量的首地址
* k: 搜索的近邻个数
* outDistancesData: 用于容纳输出distances的空间的首地址
* outIndicesData：用于容纳输出labels的空间的首地址
**************************************/
void GpuIndex::searchNonPaged_(int n,
                          const float* x,
                          int k,
                          float* outDistancesData,
                          Index::idx_t* outIndicesData) const {
  //获取设备的数据流
  auto stream = resources_->getDefaultStream(device_);

  // 将原向量x的内容拷贝到GPU中
  auto vecs = toDevice<float, 2>(resources_,
                                 device_,
                                 const_cast<float*>(x),
                                 stream,
                                 {n, (int) this->d});
 
 // 调用GpuIndexIVFPQ的search方法来具体搜索
  searchImpl_(n, vecs.data(), k, outDistancesData, outIndicesData);
}

从源代码中可以看到，这个函数主要完成两个工作：

将要搜索的原向量从CPU中拷贝到GPU；
调用驻留在GPU中的IndexIVFPQ实例进行search；

GpuIndexIVFPQ::searchImpl_

上一步searchNonPage_最后调用的searchImpl_最终调用实际索引实例的searchImpl_函数，如下所示，定义在gpu/GpuIndexIVFPQ.cu文件中

/*************************************
* n: 原向量个数
* x: 原向量的首地址，此时已经拷贝到GPU中了
* k: 搜索的近邻个数
* distances: 搜索结果与原向量的距离数组，gpu地址
* labels：搜索结果与原向量的标签，gpu地址
**************************************/
void GpuIndexIVFPQ::searchImpl_(int n,
                           const float* x,
                           int k,
                           float* distances,
                           Index::idx_t* labels) const {
  // Device is already set in GpuIndex::search
  FAISS_ASSERT(index_);
  FAISS_ASSERT(n > 0);

  // Data is already resident on the GPU
  Tensor<float, 2, true> queries(const_cast<float*>(x), {n, (int) this->d});
  Tensor<float, 2, true> outDistances(distances, {n, k});

  static_assert(sizeof(long) == sizeof(Index::idx_t), "size mismatch");
  Tensor<long, 2, true> outLabels(const_cast<long*>(labels), {n, k});

  index_->query(queries, nprobe, k, outDistances, outLabels);
}

搜索程序运行到这里才刚刚进入驻留在GPU内存中索引实例，这个函数内部的工作可以分成两部分：

检查数据合法性，将已经拷贝到GPU内存中的原向量和分配的distances、labels的空间装配成搜索时需要的数据结构；
调用index_->query进行搜索，index_是索引实例中的量化器实例，其中包含训练和添加原始向量后的反向列表。

IVFPQ::query

/*************************************
* queries: 包含原向量的容器
* nprobe: 每次搜索时查询的聚类数量
* k: 搜索的近邻个数
* outDistances: 包含distances空间的容器
* outIndices：包含labels空间的容器
**************************************/
void IVFPQ::query(Tensor<float, 2, true>& queries,
             int nprobe,
             int k,
             Tensor<float, 2, true>& outDistances,
             Tensor<long, 2, true>& outIndices) {
  // 参数合法性检查
  FAISS_ASSERT(nprobe <= GPU_MAX_SELECTION_K);
  FAISS_ASSERT(k <= GPU_MAX_SELECTION_K);

  // resources_是GPU设备资源的集合，mem是GPU临时内存管理器
  auto& mem = resources_->getMemoryManagerCurrentDevice();
  // 返回GPU所有计算单元的stream
  auto stream = resources_->getDefaultStreamCurrentDevice();
  nprobe = std::min(nprobe, quantizer_->getSize());

  FAISS_ASSERT(queries.getSize(1) == dim_);
  FAISS_ASSERT(outDistances.getSize(0) == queries.getSize(0));
  FAISS_ASSERT(outIndices.getSize(0) == queries.getSize(0));

  // Reserve space for the closest coarse centroids
  DeviceTensor<float, 2, true>
    coarseDistances(mem, {queries.getSize(0), nprobe}, stream);
  DeviceTensor<int, 2, true>
    coarseIndices(mem, {queries.getSize(0), nprobe}, stream);

  // Find the `nprobe` closest coarse centroids; we can use int
  // indices both internally and externally
  quantizer_->query(queries,
                    nprobe,
                    coarseDistances,
                    coarseIndices,
                    true);

  if (precomputedCodes_) {
    runPQPrecomputedCodes_(queries,
                           coarseDistances,
                           coarseIndices,
                           k,
                           outDistances,
                           outIndices);
  } else {
    runPQNoPrecomputedCodes_(queries,
                             coarseDistances,
                             coarseIndices,
                             k,
                             outDistances,
                             outIndices);
  }

  // If the GPU isn't storing indices (they are on the CPU side), we
  // need to perform the re-mapping here
  // FIXME: we might ultimately be calling this function with inputs
  // from the CPU, these are unnecessary copies
  if (indicesOptions_ == INDICES_CPU) {
    HostTensor<long, 2, true> hostOutIndices(outIndices, stream);

    ivfOffsetToUserIndex(hostOutIndices.data(),
                         numLists_,
                         hostOutIndices.getSize(0),
                         hostOutIndices.getSize(1),
                         listOffsetToUserIndex_);

    // Copy back to GPU, since the input to this function is on the
    // GPU
    outIndices.copyFrom(hostOutIndices, stream);
  }
}

从代码中可以看出量化器搜索主要执行以下内容：

参数检查，分配最接近的粗质心空间；
找出nprobe个最接近的粗质心(quantizer query)；
扫描带有预计算代码的反向列表(runPQPrecomputedCodes)；
如果index存放在CPU的内存中，则进行地址重映射（当前使用的程序不运行这一步骤）；

IVFPQ::runPQPrecomputedCodes_

/*************************************
* queries: 包含原向量的容器
* coarseDistances: 用于存放粗质心的distances的空间的容器
* coarseIndices: 用于存放粗质心的labels的空间的容器
* k: 搜索的近邻个数
* outDistances: 包含distances空间的容器
* outIndices：包含labels空间的容器
**************************************/
void IVFPQ::runPQPrecomputedCodes_(
  Tensor<float, 2, true>& queries,
  DeviceTensor<float, 2, true>& coarseDistances,
  DeviceTensor<int, 2, true>& coarseIndices,
  int k,
  Tensor<float, 2, true>& outDistances,
  Tensor<long, 2, true>& outIndices) {
  auto& mem = resources_->getMemoryManagerCurrentDevice();
  auto stream = resources_->getDefaultStreamCurrentDevice();

  // Compute precomputed code term 3, - 2 * (x|y_R)
  // This is done via batch MM
  // {sub q} x {(query id)(sub dim) * (code id)(sub dim)'} =>
  // {sub q} x {(query id)(code id)}
  DeviceTensor<float, 3, true> term3Transposed(
    mem,
    {queries.getSize(0), numSubQuantizers_, numSubQuantizerCodes_},
    stream);

  // These allocations within are only temporary, so release them when
  // we're done to maximize free space
  {
    auto querySubQuantizerView = queries.view<3>(
      {queries.getSize(0), numSubQuantizers_, dimPerSubQuantizer_});
    DeviceTensor<float, 3, true> queriesTransposed(
      mem,
      {numSubQuantizers_, queries.getSize(0), dimPerSubQuantizer_},
      stream);
    runTransposeAny(querySubQuantizerView, 0, 1, queriesTransposed, stream);

    DeviceTensor<float, 3, true> term3(
      mem,
      {numSubQuantizers_, queries.getSize(0), numSubQuantizerCodes_},
      stream);

    runIteratedMatrixMult(term3, false,
                          queriesTransposed, false,
                          pqCentroidsMiddleCode_, true,
                          -2.0f, 0.0f,
                          resources_->getBlasHandleCurrentDevice(),
                          stream);

    runTransposeAny(term3, 0, 1, term3Transposed, stream);
  }

  NoTypeTensor<3, true> term2;
  NoTypeTensor<3, true> term3;
  DeviceTensor<half, 3, true> term3Half;

  if (useFloat16LookupTables_) {
    term3Half =
      convertTensor<float, half, 3>(resources_, stream, term3Transposed);

    term2 = NoTypeTensor<3, true>(precomputedCodeHalf_);
    term3 = NoTypeTensor<3, true>(term3Half);
  } else {
    term2 = NoTypeTensor<3, true>(precomputedCode_);
    term3 = NoTypeTensor<3, true>(term3Transposed);
  }

  runPQScanMultiPassPrecomputed(queries,
                                coarseDistances, // term 1
                                term2, // term 2
                                term3, // term 3
                                coarseIndices,
                                useFloat16LookupTables_,
                                bytesPerVector_,
                                numSubQuantizers_,
                                numSubQuantizerCodes_,
                                deviceListDataPointers_,
                                deviceListIndexPointers_,
                                indicesOptions_,
                                deviceListLengths_,
                                maxListLength_,
                                k,
                                outDistances,
                                outIndices,
                                resources_);
}

在进行预计算期间会分配大量内存，由于这些只是临时的，在计算完成后应该释放以节省资源，所以这里L30-L67使用了大括号规定作用域。该区间内基于MM批处理完成预计算的工作: term 3 - 2 * (x|y_R)，包含runTransposeAny和runIteratedMatrixMult。
runTransposeAny()函数在两个维度之间执行置换移位，描述如下：

/// Performs an out-of-place transposition between any two dimensions.
/// Best performance is if the transposed dimensions are not
/// innermost, since the reads and writes will be coalesced.
/// Could include a shared memory transposition if the dimensions
/// being transposed are innermost, but would require support for
/// arbitrary rectangular matrices.
/// This linearized implementation seems to perform well enough,
/// especially for cases that we care about (outer dimension
/// transpositions).

runIteratedMatrixMult()函数计算C_i = alpha * A_i * B_i + beta * C_i。