基于 Assistant API 创建在线私有知识库代码实操-CSDN博客

本文链接：https://blog.csdn.net/qq_43588095/article/details/147422156

一、向量数据库概述

向量数据库与传统数据库区别：传统数据库（如MySQL、Oracle）使用SQL语句进行数据搜索，主要基于文本的精确匹配，适用于关键字搜索。向量数据库则是为实现更高效的搜索过程而设计，通过存储向量并利用相似度搜索算法（如余弦相似度），解决语义搜索问题，实现模糊匹配，查找与查询向量最相似的向量，返回相关文本内容。
向量数据库原理及相关技术：向量数据库通过嵌入模型将文本块转换为向量表示，每个文本块被映射到高维向量空间。利用这些向量，通过相似度计算（如余弦相似度）来衡量文本之间的相似程度，从而实现语义搜索。

二、在Assistant API中使用向量数据库的操作

创建向量数据库
- 接口与参数：使用client.vector_stores.create()方法创建向量数据库，主要参数及说明如下：

client.vector_stores.create()参数总表

方法	参数	子参数/选项	类型	必填	说明	示例/默认值
`client.vector_stores.create()`	`name`	-	`string`	否	向量存储库的名称标识。	`"Legal_Docs"`
	`file_ids`	-	`array`	否	初始化关联的文件ID列表（需先通过`files.create()` 上传）。	`["file-123", "file-456"]`
	`chunking_strategy`	`type`	`string`	否	分块类型：`"auto"`（默认）或 `"static"`（需自定义分块参数）。	`"static"`
		`static.max_chunk_size_tokens`	`integer`	当`type="static"` 时必填	每块最大token数（100≤值≤4096）。	默认`800`
		`static.chunk_overlap_tokens`	`integer`	当`type="static"` 时必填	块间重叠token数（需≤`max_chunk_size_tokens/2`）。	默认`400`
	`expires_after`	`anchor`	`string`	是（若设置过期策略）	过期计算的基准时间，仅支持`"last_active_at"`。	`"last_active_at"`
		`days`	`integer`	是（若设置过期策略）	从`anchor` 开始的天数后过期（1-365）。	`30`
	`metadata`	-	`map`	否	自定义元数据（键≤64字符，值≤512字符）。	`{"category": "finance"}`

关键补充说明

chunking_strategy 的默认行为
- 未指定时默认使用 type="auto"（分块800 tokens，重叠400 tokens）。
- static 类型需手动设置 max_chunk_size_tokens 和 chunk_overlap_tokens，且后者不得超过前者的50%。
expires_after 的依赖关系
- 必须同时设置 anchor 和 days 才生效。例如：{"anchor": "last_active_at", "days": 7} 表示7天不活跃后过期。

-示例代码：

# 创建向量数据库，使用默认分块策略
# client为OpenAI客户端实例，通过其beta.vector_stores.create()方法创建向量数据库
# name参数指定向量数据库名称为"llms_vector_store_1"
# file_ids参数传入要存储到该向量数据库的文件ID列表
vector_store_1 = client.vector_stores.create(
    name="llms_vector_store_1",
    file_ids=['file - ZYsCCgh19b9YTpq20iBRkQ2m', 'file - xzF4P1FUcFjYUP5jigQv1s4r', 'file - 0dySxtvdUFHhEM7zrDNzxrwS']
)
# 创建向量数据库，使用自定义分块策略
# 除了指定名称和文件ID列表外，还通过chunking_strategy参数自定义分块策略
# type为"static"表示使用静态（自定义）分块策略
# static子字典中，max_chunk_size_tokens指定每个块的最大令牌数为1000
# chunk_overlap_tokens指定块之间重叠的令牌数为500
vector_store_2 = client.vector_stores.create(
    name="llms_vector_store_2",
    file_ids=['file - 2zEAXzz8NSry24ZthHgfnCeF'],
    chunking_strategy={
        "type": "static",
        "static": {
            "max_chunk_size_tokens": 1000,
            "chunk_overlap_tokens": 500
        }
    }
)

文件与向量数据库的交互操作
- 查看文件列表：使用client.files.list()方法查看当前客户端上传至OpenAI云服务器的所有文件详细信息，通过遍历data属性获取每个文件的file ID和文件名等信息。
- 向已存在向量数据库追加文件：使用client.vector_stores.files.create()方法向指定向量数据库添加文件，每次仅支持添加一个文件，需传入向量数据库的id及新增加文件的id。
  若要批量添加文件，使用client.vector_stores.file_batches.create()方法，在file_ids字段中以列表形式传递多个file id。
- 检索向量数据库中的文件：使用client.vector_stores.files.retrieve()方法查看指定文件在存储时应用的切分策略及其他详细信息，需传入向量数据库的id和文件的id。
  若想查看某个向量数据库中的全部文件存储信息，使用client.vector_stores.file_batches.list_files()方法，传入向量数据库的id和批处理的id。
- 删除向量数据库中的文件：有两种方式，一是使用client.vector_stores.files.delete()方法，在指定向量数据库内删除文件，需传入向量数据库的id和要删除文件的id；二是使用client.files.delete()方法直接删除OpenAI云服务器上的文件，该方式会影响所有使用该文件的服务或工具。
- 更新向量数据库：使用client.vector_stores.update()方法更新向量数据库信息，如修改名称等，传入向量数据库的id及要更新的参数（如name）。
- 删除向量数据库：使用client.vector_stores.delete()方法删除整个向量数据库，传入要删除的向量数据库的id，返回结果中包含deleted = true则说明删除成功。

三、操作注意事项

切分策略的显示：在向量数据库对象实例中无法直接查看文件的切分策略，因为一个向量数据库可存储多个文件，且这些文件可在不同时间添加，各自有不同的切分策略，切分策略与文件对象绑定。只有在涉及文件对象的操作（如向向量数据库追加文件、检索文件详细信息）返回的响应体中，才会包含具体的切分策略。
文件操作的影响：在删除文件时需谨慎，若文件同时用于多个功能（如本地知识库、代码解释器、微调等），直接删除OpenAI云服务器上的文件可能会影响其他功能的正常运行。
retrieve方法的重要性：retrieve方法在Assistant API的多个对象操作中都很关键，如在选择向量数据库进行操作时，通过传入相应的id，可以选定具体对象进行后续操作，在构建应用程序时，可用于实现用户在前端灵活切换数据库的功能。

四、向量数据库对象属性查看

以下代码用于查看向量数据库对象的详细属性信息：

# vector_store_1为之前创建的向量数据库对象
# 通过to_dict()方法将向量数据库对象转换为字典形式，以便查看其详细属性
vector_store_1.to_dict()

运行上述代码后，得到如下返回结果及参数说明：

{
    # 向量数据库的唯一标识符
    'id': 'vs_101b5YM1VuUtVVRubxNu1MSa',
    # 向量数据库的创建时间戳
    'created_at': 1727438119,
    # 文件数量统计信息
    'file_counts': {
        # 已取消的文件数量
        'cancelled': 0,
        # 已完成处理的文件数量
        'completed': 0,
        # 处理失败的文件数量
        'failed': 0,
        # 正在处理中的文件数量
        'in_progress': 3,
        # 文件的总数量
        'total': 3
    },
    # 向量数据库最后一次活动的时间戳
    'last_active_at': 1727438119,
    # 元数据，这里为空字典
   'metadata': {},
    # 向量数据库的名称
    'name': 'llms_vector_store_1',
    # 对象类型，这里为"vector_store"
    'object':'vector_store',
    # 向量数据库的状态，这里为"in_progress"，表示正在处理中
   'status': 'in_progress',
    # 已使用的字节数，这里为0
    'usage_bytes': 0,
    # 过期时间（相对时间，从创建时间开始计算），这里为None
    'expires_after': None,
    # 过期时间（绝对时间戳），这里为None
    'expires_at': None
}

五、代码实操

from openai import OpenAI
client = OpenAI()

# 创建向量数据库（默认分块策略）
vector_store_1 = client.vector_stores.create(
    name="llms_vector_store_1",
    file_ids=['file-ZYsCCgh19b9YTpq20iBRkQ2m', 'file-xzF4P1FUcFjYUP5jigQv1s4r', 'file-0dySxtvdUFHhEM7zrDNzxrwS']
)
print(vector_store_1.to_dict())

# 创建向量数据库（自定义分块策略）
vector_store_2 = client.vector_stores.create(
    name="llms_vector_store_3",
    file_ids=['file-2zEAXzz8NSry24ZthHgfnCeF'],
    chunking_strategy={
        "type": "static",
        "static": {
            "max_chunk_size_tokens": 1000,
            "chunk_overlap_tokens": 500
        }
    }
)
print(vector_store_2.to_dict())

# 查询所有向量数据库
vector_stores = client.vector_stores.list()
for vector_id in vector_stores.data:
    print(f"vector_id:{vector_id.id}-{vector_id.name}")

# 向已存在向量数据库追加单个文件
vector_store_file = client.vector_stores.files.create(
    vector_store_id="vs_Q2YAwM7TLVJ9yymhi0F7sKws",
    file_id="file-9hkFbPwsbnzltkh6oLdn7pjg"
)
print(vector_store_file.to_dict())

# 向已存在向量数据库追加多个文件（批处理）
vector_store_file_batch = client.vector_stores.file_batches.create(
    vector_store_id="vs_n0qrYqIMhN1NTWahmFrGsT2k",
    file_ids=["file-0dySxtvdUFHhEM7zrDNzxrwS", "file-cct6euj70Ek4WpMpasZUZ0LC"]
)
print(vector_store_file_batch.to_dict())

# 在指定向量数据库中检索指定文件详细信息
vector_store_file = client.vector_stores.files.retrieve(
    vector_store_id="vs_n0qrYqIMhN1NTWahmFrGsT2k",
    file_id="file-0dySxtvdUFHhEM7zrDNzxrwS"
)
print(vector_store_file.to_dict())

# 查看某个向量数据库中的全部文件存储信息（批处理）
vector_store_files = client.vector_stores.file_batches.list_files(
    vector_store_id="vs_Q2YAwM7TLVJ9yymhi0F7sKws",
    batch_id="vsfb_da8915144e6144e582d3618dc5756265"
)
print(vector_store_files.to_dict())

# 在已存在向量数据库中删除指定文件
deleted_vector_store_file = client.vector_stores.files.delete(
    vector_store_id="vs_n0qrYqIMhN1NTWahmFrGsT2k",
    file_id="file-2zEAXzz8NSry24ZthHgfnCeF"
)
print(deleted_vector_store_file)

# 检索向量数据库信息
vector_store = client.vector_stores.retrieve(
    vector_store_id="vs_dab0nsKPP1IRuM45yfd70wg0"
)
print(vector_store.to_dict())

# 更新向量数据库名称
vector_store = client.vector_stores.update(
    vector_store_id="vs_ztjSGgwqPnWxJsKcyJUnVmK1",
    name="test_kb_110"
)
print(vector_store.to_dict())

# 删除向量数据库
deleted_vector_store = client.vector_stores.delete(
    vector_store_id="vs_ztjSGgwqPnWxJsKcyJUnVmK1"
)
print(deleted_vector_store.to_dict())