说明
本篇主要分析IVFPQ类型的索引的训练过程。遵循从APP -> 到faiss core的实现的流程。
过程分析
app
假设现在已经有了一个可用的IVFPQ类型索引index实例,那么可以直接在程序中调用:
index.train(learning_d)
这里的learning_d表示训练集,这里的值是database总数与学习率的乘积。
faiss core
train()
IndexIVFPQ类里面没有override train函数,所以上面的接口调用的是其父类IndexIVF下定义的train函数,其定义如下:
//n: 训练集向量个数
//x: 训练指针
void IndexIVF::train (idx_t n, const float *x)
{
if (verbose)
printf ("Training level-1 quantizer\n");
train_q1 (n, x, verbose, metric_type);
if (verbose)
printf ("Training IVF residual\n");
train_residual (n, x);
is_trained = true;
}
verbose相当于Debug信息的开关,在基类Index的构造函数中默认为false,不会对程序流程产生影响。
Faiss建立了3个level的量化编码,level1和level2是必须的,level3是可选的。其中MultiIndexQuantizer属于level1,IndexIVFPQ包含level2。(IndexIVFPQR包含level2和level3)
train_q1
train_q1是父类Level1Quantizer的函数。内部实际调用了level1的MultiIndexQuantizer进行训练。MultiIndexQuantizer使用2×8的乘积量化器(2表示量化后subvector的个数,8表示量化后每个subvector的位数),对输入向量编码,并可以计算出每个输入向量在其编码下的残差。
其定义如下:
//n: 训练集中向量个数
//x: 训练集指针
//verbose: 是否打印信息,默认为false
//metric_type: 搜索的类型,如L1、L2等,具体内容见index.h
void Level1Quantizer::train_q1 (size_t n, const float *x, bool verbose, MetricType metric_type)
{
size_t d = quantizer->d;
if (quantizer->is_trained && (quantizer->ntotal == nlist)) {
if (verbose)
printf ("IVF quantizer does not need training.\n");
} else if (quantizer_trains_alone == 1) {
...
} else if (quantizer_trains_alone == 0) {
if (verbose)
printf ("Training level-1 quantizer on %ld vectors in %ldD\n",
n, d);
// 创建一个聚类实例,其中cp的niter(聚类迭代)为10
Clustering clus (d, nlist, cp);
quantizer->reset();
// clustering_index用于是否覆盖聚类时使用的索引,默认为null。
if (clustering_index) {
clus.train (n, x, *clustering_index);
quantizer->add (nlist, clus.centroids.data());
} else {
clus.train (n, x, *quantizer);
}
quantizer->is_trained = true;
} else if (quantizer_trains_alone == 2) {
...
}
}
只有当训练集中向量个数不小于聚类个数的设置时才会需要使用量化器Quantizer进行训练。
quantizer_trains_alone表示量化器训练的方式:
- 0表示在kmeans训练中使用量化器作为索引
- 1表示把训练集传给量化器的train()
- 2表示kmeans在平面索引上训练+将质心添加到量化器
默认值为0,在demo用例中也使用的是默认值,所以这里只对该类型进行分析。
clus.train
聚类clus使用K-means算法完成对训练集的聚类工作,设置聚类中心然后对所有向量进行分类。对训练集dataset的向量个数有要求,应介于nlist * min_points_per_centroid 和 nlist * max_points_per_centroid之间,若小于最小值会发警告,若大于最大值则会从训练集中随机采样nlist * max_points_per_centroid个向量进行训练。
但是在实际代码中,看到还有一种条件:dataset == nlist,即训练集向量个数与聚类数量相同时,clus只是简单地将每一个响亮复制到每一个聚类中,并作为聚类中心使用。
train_residual
train_residual函数在IVFPQ索引中有重新定义,所以这里的函数实现为:
void IndexIVFPQ::train_residual (idx_t n, const float *x)
{
train_residual_o (n, x, nullptr);
}
void IndexIVFPQ::train_residual_o (idx_t n, const float *x, float *residuals_2)
{
const float * x_in = x;
// 从训练集中提取nmax个向量
x = fvecs_maybe_subsample (
d, (size_t*)&n, pq.cp.max_points_per_centroid * pq.ksub,
x, verbose, pq.cp.seed);
ScopeDeleter<float> del_x (x_in == x ? nullptr : x);
const float *trainset;
ScopeDeleter<float> del_residuals;
if (by_residual) {
if(verbose) printf("computing residuals\n");
idx_t * assign = new idx_t [n]; // assignement to coarse centroids
ScopeDeleter<idx_t> del (assign);
quantizer->assign (n, x, assign);
float *residuals = new float [n * d];
del_residuals.set (residuals);
for (idx_t i = 0; i < n; i++)
quantizer->compute_residual (x + i * d, residuals+i*d, assign[i]);
trainset = residuals;
} else {
trainset = x;
}
if (verbose)
printf ("training %zdx%zd product quantizer on %ld vectors in %dD\n",
pq.M, pq.ksub, n, d);
pq.verbose = verbose;
pq.train (n, trainset);
// 重新排列训练后的质心,这里不执行
if (do_polysemous_training) {
if (verbose)
printf("doing polysemous training for PQ\n");
PolysemousTraining default_pt;
PolysemousTraining *pt = polysemous_training;
if (!pt) pt = &default_pt;
pt->optimize_pq_for_hamming (pq, n, trainset);
}
// prepare second-level residuals for refine PQ
// 计算pq.train后的残差,供level3训练使用,这里不执行
if (residuals_2) {
uint8_t *train_codes = new uint8_t [pq.code_size * n];
ScopeDeleter<uint8_t> del (train_codes);
pq.compute_codes (trainset, train_codes, n);
for (idx_t i = 0; i < n; i++) {
const float *xx = trainset + i * d;
float * res = residuals_2 + i * d;
pq.decode (train_codes + i * pq.code_size, res);
for (int j = 0; j < d; j++)
res[j] = xx[j] - res[j];
}
}
if (by_residual) {
precompute_table ();
}
}
函数首先调用fvecs_maybe_subsample 将输入的训练集提取出聚类所能包含的最多的向量。即实际训练时所用的向量其实是原始训练集的子集。
train_residual调用train_residual_o来实现level2的训练,IndexIVFPQ只进行2级训练,所以后面do_polysemous_training为false,后面也不再调用refine_pq.train进行level3的训练。
在train_residual_o中quantizer->assign分配给粗质心,返回一个到排序的结果向量列表,根据这一列表调用quantizer->compute_residual计算出残差,供pq.train进行level2的训练。
pq.train
在train_q1的训练过程中会调用clus.train来设置聚类中心,但由于这里只调用一次,所以其分类结果只是粗聚类。
pq.train将聚类分成16个sub-clus,针对每个sub-clus分别调用clus.train来完成K-mean算法。
precompute_table
聚类中心计算完成后,通过precompute_table 函数预计算出聚类中心之间的距离向量。
运行结果
IVF-PQ training...
Training level-1 quantizer
Training level-1 quantizer on 100000 vectors in 64D
Sampling a subset of 65536 / 100000 for training
Clustering 65536 points in 64D to 256 clusters, redo 1 times, 10 iterations
Preprocessing in 0.02 s
Iteration 9 (8.50 s, search 7.91 s): objective=304051 imbalance=1.005 nsplit=0
Training IVF residual
Input training set too big (max size is 65536), sampling 65536 / 100000 vectors
computing residuals
training 16x256 product quantizer on 65536 vectors in 64D
Training PQ slice 0/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.06 s, search 19.53 s): objective=1537.02 imbalance=1.049 nsplit=0
Training PQ slice 1/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.70 s, search 20.21 s): objective=1470.22 imbalance=1.038 nsplit=0
Training PQ slice 2/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.65 s, search 20.08 s): objective=1470.22 imbalance=1.037 nsplit=0
Training PQ slice 3/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (20.68 s, search 19.09 s): objective=1471.52 imbalance=1.033 nsplit=0
Training PQ slice 4/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (22.18 s, search 20.61 s): objective=1466.54 imbalance=1.037 nsplit=0
Training PQ slice 5/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.59 s, search 19.99 s): objective=1473.35 imbalance=1.033 nsplit=0
Training PQ slice 6/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (22.56 s, search 20.94 s): objective=1470.36 imbalance=1.031 nsplit=0
Training PQ slice 7/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (22.81 s, search 21.24 s): objective=1474.88 imbalance=1.030 nsplit=0
Training PQ slice 8/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (20.11 s, search 18.61 s): objective=1475.37 imbalance=1.033 nsplit=0
Training PQ slice 9/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (20.16 s, search 18.62 s): objective=1471.15 imbalance=1.034 nsplit=0
Training PQ slice 10/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (20.30 s, search 18.82 s): objective=1472.53 imbalance=1.034 nsplit=0
Training PQ slice 11/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.12 s, search 19.62 s): objective=1469.52 imbalance=1.028 nsplit=0
Training PQ slice 12/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.01 s, search 19.47 s): objective=1477.39 imbalance=1.039 nsplit=0
Training PQ slice 13/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (20.56 s, search 19.00 s): objective=1471.21 imbalance=1.034 nsplit=0
Training PQ slice 14/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (20.52 s, search 18.99 s): objective=1467.99 imbalance=1.035 nsplit=0
Training PQ slice 15/16
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (21.22 s, search 19.64 s): objective=1476.48 imbalance=1.032 nsplit=0
precomputing IVFPQ tables type 1
IVF-PQ Train done! Time: 348.63287377357483
IVF-PQ ntotal after training: 0
总结
IndexIVFPQ会针对训练集进行两级训练,后一级的训练基于前一级训练的残差。
- 首先通过 train_q1 将最高256*nlist个向量进行粗聚类算法(迭代10次)。
- 计算后的残差传递给train_residual再次进行细粒度的聚类运算,这次运算会将nlist个粗聚类划分为16个sub-clus,针对每个clus采用K-means算法计算聚类中心(迭代25次)。
- 计算后的Index作为结果输出。
Note
索引的训练不是越多越好,在faiss的源代码中已经默认设置了一个quantizer容纳的最多向量是256个,所以训练集最大为nlist *256,大于该值则会从训练集中随机取子集。