每日读源码---Day1_similarity_search_with_score_by_vector

一只天蝎

于 2024-09-04 16:03:25 发布

阅读量94

点赞数

分类专栏：大模型学习编程语言---Python 文章标签：机器学习人工智能 python

本文链接：https://blog.csdn.net/weixin_45880844/article/details/141895231

版权

编程语言---Python 同时被 2 个专栏收录

32 篇文章 2 订阅

订阅专栏

大模型学习

8 篇文章 0 订阅

订阅专栏

#根据给定的嵌入向量在文档集合中查找最相似的文档
def similarity_search_with_score_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[Union[Callable, Dict[str, Any]]] = None,
        fetch_k: int = 20,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
        """Return docs most similar to query.

        Args:
            embedding: Embedding vector to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            filter (Optional[Union[Callable, Dict[str, Any]]]): Filter by metadata.
                Defaults to None. If a callable, it must take as input the
                metadata dict of Document and return a bool.
            fetch_k: (Optional[int]) Number of Documents to fetch before filtering.
                      Defaults to 20.
            **kwargs: kwargs to be passed to similarity search. Can include:
                score_threshold: Optional, a floating point value between 0 to 1 to
                    filter the resulting set of retrieved docs

        Returns:
            List of documents most similar to the query text and L2 distance
            in float for each. Lower score represents more similarity.
        """
        #函数开始时导入 faiss 库，这是一个用于高效相似性搜索和密集向量聚类的库
        faiss = dependable_faiss_import()
        #将输入的嵌入向量转换为适合 faiss 处理的格式
        vector = np.array([embedding], dtype=np.float32)
        #如果设置了 _normalize_L2，则对向量进行 L2 归一化
        if self._normalize_L2:
            faiss.normalize_L2(vector)
        #使用 faiss 索引执行搜索，根据提供的向量找到最相似的文档。
        scores, indices = self.index.search(vector, k if filter is None else fetch_k)
        docs = []
		#根据提供的过滤条件过滤结果。对于每个检索到的文档，计算与查询向量的相似度分数。
        if filter is not None:
            filter_func = self._create_filter_func(filter)

        for j, i in enumerate(indices[0]):
            if i == -1:
                # This happens when not enough docs are returned.
                continue
            _id = self.index_to_docstore_id[i]
            doc = self.docstore.search(_id)
            if not isinstance(doc, Document):
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            if filter is not None:
                if filter_func(doc.metadata):
                    docs.append((doc, scores[0][j]))
            else:
                docs.append((doc, scores[0][j]))
		#如果提供了 score_threshold，则只保留高于或低于此阈值的文档（取决于距离策略）。
        score_threshold = kwargs.get("score_threshold")
        if score_threshold is not None:
            cmp = (
                operator.ge
                if self.distance_strategy
                in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
                else operator.le
            )
            docs = [
                (doc, similarity)
                for doc, similarity in docs
                if cmp(similarity, score_threshold)
            ]
        #返回最相似的文档列表，每个文档包括文档本身和对应的相似度分数
        return docs[:k]

embedding (List[float]): 要查找相似文档的嵌入向量。
k (int, 默认为 4): 要返回的文档数量。
filter (Optional[Union[Callable, Dict[str, Any]]], 默认为 None): 用于过滤结果的元数据。如果提供，可以是一个可调用的函数或字典。函数必须接收文档的元数据字典并返回一个布尔值。
fetch_k (int, 默认为 20): 在应用过滤之前要检索的文档数量。
kwargs (Any): 传递给相似性搜索的其他参数。【比如：score_threshold (Optional[float]): 一个介于0到1之间的浮点数，用于过滤返回的检索文档集合。】
返回一个列表，其中包含与查询文本最相似的文档和每个文档的 L2 距离（浮点数）。较低的分数表示更高的相似性。