相似度查询工具
1.背景
想要实现一个图片特征向量相似度搜索的功能,项目使用的是Java开发,数据库是PostgreSQL,可选择的方案有:
- Vector database - Milvus部署方便,有可视化界面Attu,有JavaSDK(但是需要专门部署)。
- PostgreSQL插件(Cube 支持100维,Pase 支持512维,Vector 支持16000维)。
由于提取的图片的特征向量有1024维,所以只能使用Milvus和PostgreSQL插件Vector了。
2.应用
2.1 Milvus
Milvus官网有详细的安装流程和代码这里不再赘述,使用Docker安装,版本为2.2.9
,这里为大家提供一个简单的工具类,数据库连接参数没有参数化,小伙伴们可以优化,对结果数据进行了简单的格式化:
结果封装:
@Data
@Builder
public class MilvusRes {
public float score;
public String imagePath;
}
工具类:
@Slf4j
@Component
public class MilvusUtil {
public MilvusServiceClient milvusServiceClient;
@PostConstruct
private void connectToServer() {
milvusServiceClient = new MilvusServiceClient(
ConnectParam.newBuilder()
.withHost("your service host")
.withPort(19530)
.build());
// 加载数据
LoadCollectionParam faceSearchNewLoad = LoadCollectionParam.newBuilder().withCollectionName("CollectionName").build();
R<RpcStatus> rpcStatusR = milvusServiceClient.loadCollection(faceSearchNewLoad);
log.info("Milvus LoadCollection [{}]", rpcStatusR.getStatus());
}
public int insertDataToMilvus(String id, String path, float[] feature) {
List<InsertParam.Field> fields = new ArrayList<>();
List<Float> featureList = new ArrayList<>(feature.length);
for (float v : feature) {
featureList.add(v);
}
fields.add(new InsertParam.Field("field1", Collections.singletonList(id)));
fields.add(new InsertParam.Field("field2", Collections.singletonList(path)));
fields.add(new InsertParam.Field("field3", Collections.singletonList(featureList)));
InsertParam insertParam = InsertParam.newBuilder()
.withCollectionName("CollectionName")
//.withPartitionName("novel")
.withFields(fields)
.build();
R<MutationResult> insert = milvusServiceClient.insert(insertParam);
return insert.getStatus();
}
public List<MilvusRes> searchImageByFeature(float[] feature) {
List<Float> featureList = new ArrayList<>(feature.length);
for (float v : feature) {
featureList.add(v);
}
List<String> queryOutputFields = Arrays.asList("field");
SearchParam faceSearch = SearchParam.newBuilder()
.withCollectionName("CollectionName")
.withMetricType(MetricType.IP)
.withVectorFieldName("VectorFieldName")
.withVectors(Collections.singletonList(featureList))
.withOutFields(queryOutputFields)
.withTopK(10).build();
// 执行搜索
long l = System.currentTimeMillis();
R<SearchResults> respSearch = milvusServiceClient.search(faceSearch);
log.info("MilvusServiceClient.search cost [{}]", System.currentTimeMillis() - l);
// 解析结果数据
SearchResultData results = respSearch.getData().getResults();
int scoresCount = results.getScoresCount();
SearchResultsWrapper wrapperSearch = new SearchResultsWrapper(results);
List<MilvusRes> milvusResList = new ArrayList<>();
for (int i = 0; i < scoresCount; i++) {
float score = wrapperSearch.getIDScore(0).get(i).getScore();
Object imagePath = wrapperSearch.getFieldData("field1", 0).get(i);
MilvusRes milvusRes = MilvusRes.builder().score(score).imagePath(imagePath.toString()).build();
milvusResList.add(milvusRes);
}
return milvusResList;
}
}
数量如图:
性能测试结果如下:
MilvusServiceClient.search cost [24]
2.2 Vector
基础信息以下网站都有说明,这里不再赘述。
- 高维向量相似度搜索(pgvector) (aliyun.com)
- 如何启用和使用 pgvector - Azure Database for PostgreSQL 灵活服务器 | Microsoft Learn
数据库PostgreSQL使用的是Docker部署,版本为12.12,插件安装流程如下:
# 进入容器
docker exec -it CONTAINER ID /bin/bash
# 1.更新 apt-get
apt-get update
# 未更新直接安装会报错
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package postgresql-12-postgis-3
E: Unable to locate package postgresql-12-postgis-3-dbgsym
E: Unable to locate package postgresql-12-postgis-3-scripts
# 2.安装插件
apt-get install postgresql-12-pgvector
数据库操作:
-- 添加 vector 扩展
CREATE EXTENSION vector;
-- 查询可使用的扩展
SELECT \* FROM pg_available_extensions;
-- 创建表
CREATE TABLE "public"."test" (
"field1" VARCHAR ( 64 ),
"field2" VARCHAR ( 128 ),
"field3" vector ( 1024 ),
CONSTRAINT "test\_pkey" PRIMARY KEY ( "field1" )
);
创建索引的时候要根据使用的算法:
-- 创建索引
CREATE INDEX ON test USING ivfflat ( field3);
CREATE INDEX ON test USING ivfflat ( field3 vector_ip_ops) WITH (lists = 50);
CREATE INDEX ON test USING ivfflat ( field3 vector_ip_ops) WITH (lists = 500);
CREATE INDEX ON test USING ivfflat ( field3 vector_ip_ops) WITH (lists = 1024);