langchian delete 删除大bug:无法删除,删除id后乱重排

langchain 无法删除 --终极解决方案

@ 解决者:chiquitita-101
@ 协助者:Howard_DL

修改文件路径:langchian文件夹下vectorstores/MyFAISS.py文件下delete_doc函数

原代码

    def delete_doc(self, source: str or List[str]):
        try:
            if isinstance(source, str):
                ids = [k for k, v in self.docstore._dict.items() if v.metadata["source"] == source]
                vs_path = os.path.join(os.path.split(os.path.split(source)[0])[0], "vector_store")
            else:
                ids = [k for k, v in self.docstore._dict.items() if v.metadata["source"] in source]
                vs_path = os.path.join(os.path.split(os.path.split(source[0])[0])[0], "vector_store")
            if len(ids) == 0:
                return f"docs delete fail"
            else:
                _reversed_index = {v: k for k, v in self.index_to_docstore_id.items()}
                index_to_delete = [_reversed_index[i] for i in ids]
                # 从 self.index 中删除对应id
                self.index.remove_ids(np.array(index_to_delete, dtype=np.int64)) #在index中进行了删除
                for id in ids:
                    index = list(self.index_to_docstore_id.keys())[list(self.index_to_docstore_id.values()).index(id)]
                    self.index_to_docstore_id.pop(index) #在index_to_docstore_id中进行删除
                    self.docstore._dict.pop(id)# 在docstore._dict中进行删除
                self.save_local(vs_path)
                return f"docs delete success"
        except Exception as e:
            print(e)
            return f"docs delete fail"

😜😜想直接复制修改后的代码,要不要先看一下原理

删除原理

三大类index, index_to_docstore_id, docstore._dict

bug原因:faiss index索引 删除后向量idx仍然连续,而建立的index_to_docstore_id字典删除相应idx后不连续。比如,0,1,2,3 使用remove_ids删除1,faiss索引向量idx为0,1,2;而index_to_docstore_id字典中的idx为0,2,3,与faiss索引中的向量idx 不再一致

self.index: 来源:index = faiss.read_index(“xx.faiss”)
self.index_to_docstore_id 是一个字典:从整型(int) 到uuid(str)
self.docstore._dict:字典类型,从uuid(str) 到Documents(切分后的句子与路径组成的一个类)

self.index_to_docstore_id

变量打印如下:

type(self.index_to_docstore_id): <class 'dict'>

self.index_to_docstore_id: 
{0: '5b79d294-e5cb-43d7-8f29-fe4edfc6eb29', 
1: 'dac0d439-d844-4d65-8018-5cd4e0865807',
2: '1c70b121-aea3-4da9-ba51-347a40234213',
3: 'e265fbfa-7b4e-4906-9127-9c27443469ad',
4: 'afb78b0e-32c5-4a47-8196-b7c87f0b7a1f',
...}

self.docstore._dict:

变量打印

type(self.docstore._dict): <class 'dict'>

self.docstore._dict: 
{'5b79d294-e5cb-43d7-8f29-fe4edfc6eb29': Document(page_content=' 沙漠一行精兵骑马缓缓而过,精兵身后背着弓箭,马匹缓慢的步子摇着挂瓶叮当作响。', metadata={'source': '/opt/ch/langchain-ChatGLM/knowledge_base/123/content/画皮剧本.txt'}),
.,
 'dac0d439-d844-4d65-8018-5cd4e0865807': Document(page_content='士兵甲:怎么整个山丘看起来都像桂花糕?', metadata={'source': '/opt/ch/langchain-ChatGLM/knowledge_base/123/content/画皮剧本.txt'})}

修改后代码


    def delete_doc(self, source: str or List[str]):
        try:
            if isinstance(source, str):
                ids = [k for k, v in self.docstore._dict.items() if v.metadata["source"] == source]
                vs_path = os.path.join(os.path.split(os.path.split(source)[0])[0], "vector_store")
            else:
                ids = [k for k, v in self.docstore._dict.items() if v.metadata["source"] in source]
                vs_path = os.path.join(os.path.split(os.path.split(source[0])[0])[0], "vector_store")
            if len(ids) == 0:
                return f"docs delete fail"
            else:
                _reversed_index = {v: k for k, v in self.index_to_docstore_id.items()}
                index_to_delete = [_reversed_index[i] for i in ids]
                #删除
                # 从 self.index 中删除对应id
                self.index.remove_ids(np.array(index_to_delete, dtype=np.int64))  # faiss 中的变量index
                for idx in index_to_delete:
                    self.index_to_docstore_id.pop(idx) #dict index --uuid
                for id in ids:
                #    index = list(self.index_to_docstore_id.keys())[list(self.index_to_docstore_id.values()).index(id)]
                #    self.index_to_docstore_id.pop(index)
                    self.docstore._dict.pop(id)# uuid-Documents :dict
                #重排
                index_to_docstore_id_items = sorted(self.index_to_docstore_id.items())#0123  013  012
                for i in range(len(index_to_docstore_id_items)):
                    index_to_docstore_id_items[i] = (i, index_to_docstore_id_items[i][1])
                self.index_to_docstore_id.clear()
                self.index_to_docstore_id.update(index_to_docstore_id_items)               
                self.save_local(vs_path)
                return f"docs delete success"
        except Exception as e:
            print(e)
            return f"docs delete fail"
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值