3.1 Chroma-使用指南-python

最新推荐文章于 2024-06-25 18:51:26 发布

皮皮姑娘

最新推荐文章于 2024-06-25 18:51:26 发布

阅读量1.8k

点赞数 15

分类专栏：数据库文章标签：数据库语言模型 python 软件工程运维

本文链接：https://blog.csdn.net/weixin_44078774/article/details/134739183

版权

数据库专栏收录该内容

7 篇文章 0 订阅

订阅专栏

初始化一个持久的 Chroma 客户端

import chromadb

您可以配置 Chroma 来保存和加载本地机器上的数据。数据将自动持久化，并在启动时加载（如果存在）。

client = chromadb.PersistentClient(path="/path/to/save/to")

path 是 Chroma 在磁盘上存储其数据库文件的位置，并在启动时加载这些文件。

客户端对象具有一些有用的便利方法。

client.heartbeat() # returns a nanosecond heartbeat. Useful for making sure the client remains connected.
client.reset() # Empties and completely resets the database. ⚠️ This is destructive and not reversible.

以client/server模式运行 Chroma

Chroma 还可以配置为client/server模式运行。在此模式下，Chroma 客户端连接到在单独进程中运行的 Chroma 服务器。

要启动 Chroma 服务器，请运行以下命令：

chroma run --path /db_path

然后使用 Chroma HTTP 客户端连接到服务器：

import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

就是这样！只需进行这个更改，Chroma 的 API 将以client-server模式运行。

使用 python http-only客户端

如果您正在以client-server模式运行 Chroma，您可能不需要完整的 Chroma 库。相反，您可以使用仅包含客户端功能的轻量级库。在这种情况下，您可以安装 chromadb-client 包。这个包是一个轻量级的针对服务器的 HTTP 客户端，具有很小的依赖资源占用。

pip install chromadb-client

import chromadb
# Example setup of the client to connect to your chroma server
client = chromadb.HttpClient(host='localhost', port=8000)

请注意，chromadb-client包是完整Chroma库的子集，不包括所有依赖项。如果您想使用完整的Chroma库，可以安装chromadb包。最重要的是，没有默认的嵌入函数。如果您在没有嵌入的情况下添加（add()）文档，必须手动指定一个嵌入函数并安装其所需的依赖项。

使用集合

Chroma允许您使用集合原语管理嵌入集合。

创建、检查和删除集合。

Chroma 在 URL 中使用集合名称，因此在为其命名时有一些限制：

名称的长度必须介于3个到63个字符之间；
名称必须以小写字母或数字开头和结尾，中间可以包含点、破折号和下划线；
名称不能包含两个连续的点；
名称不能是有效的 IP 地址。

Chroma 集合通过名称和可选的嵌入函数创建。如果您提供了嵌入函数，则需要在每次访问集合时都提供它。

collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
collection = client.get_collection(name="my_collection", embedding_function=emb_fn)

如果您已经自己生成了嵌入（embeddings），您可以直接将它们加载进去：

collection.add(
    embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

注意：
如果您之后想要获取集合（get_collection），您必须使用创建集合时提供的嵌入函数进行获取。

嵌入函数接受文本作为输入，并执行分词和嵌入操作。如果未提供嵌入函数，Chroma 将使用默认的 sentence transfomer 。

您可以了解有关🧬 embedding functions以及如何创建自己的嵌入函数的更多信息。

可以使用.get_collection按名称检索现有集合，使用.delete_collection删除集合。您还可以使用.get_or_create_collection获取集合（如果存在），或在不存在时创建它们。

collection = client.get_collection(name="test") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
collection = client.get_or_create_collection(name="test") # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
client.delete_collection(name="my_collection") # Delete a collection and all associated embeddings, documents, and metadata. ⚠️ This is destructive and not reversible

集合具有一些有用的便利方法。

collection.peek() # returns a list of the first 10 items in the collection
collection.count() # returns the number of items in the collection
collection.modify(name="new_name") # Rename the collection

更改距离函数

create_collection 还接受一个可选的 metadata 参数，可以通过设置 hnsw:space 的值来自定义嵌入空间的距离方法。

 collection = client.create_collection(
        name="collection_name",
        metadata={"hnsw:space": "cosine"} # l2 is the default
    )

hnsw:space 的有效选项包括 “l2”、“ip” 或 “cosine”。默认值是 “l2”，即squared L2 norm。

Distance	parameter	Equation
Squared L2	‘l2’	$\sum\left(A_i-B_i\right)^2$
Inner product	‘ip’	$d = 1.0 - \sum\left(A_i \times B_i\right) $
Cosine similarity	‘cosine’	$\frac{\sum\left(A_i \times B_i\right)}{\sqrt{\sum\left(A_i^2\right)} \cdot \sqrt{\sum\left(B_i^2\right)}}$

向集合中添加数据

使用 .add 方法将数据添加到 Chroma。

原始文件：

collection.add(
    documents=["lorem ipsum...", "doc2", "doc3", ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

如果传递给 Chroma 一个文档列表，它会自动对其进行标记化并使用集合的嵌入函数进行嵌入（如果在创建集合时没有提供嵌入函数，则使用默认值）。Chroma 也会存储文档本身。如果文档过大而无法使用所选择的嵌入函数进行嵌入，则会引发异常。

每个文档必须具有唯一的关联 id。尝试两次.add 相同的 ID 只会存储初始值。可以为每个文档提供一个可选的元数据字典列表，以存储额外的信息并实现筛选功能。

或者，您可以直接提供一个与文档关联的嵌入列表，Chroma 将存储关联的文档而不对其进行嵌入。

collection.add(
    documents=["doc1", "doc2", "doc3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

如果提供的嵌入与集合的维度不同，将会引发异常。

您也可以将文档存储在其他地方，然后只需提供嵌入和元数据列表给 Chroma。您可以使用 ids 将嵌入与在其他地方存储的文档关联起来。

collection.add(
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

查询集合

可以使用 .query 方法以多种方式查询 Chroma 集合。

您可以通过一个 query_embeddings 集合进行查询。

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

查询结果将按顺序返回和每个query_embedding最接近的前 n_results 匹配项。可以提供一个可选的 where 过滤器字典来根据每个文档关联的元数据进行过滤。此外，还可以提供一个可选的 where_document 过滤器字典来根据文档的内容进行过滤。

如果提供的query_embeddings与集合的维度不同，将引发异常。

您还可以通过一个query_texts集合查询。Chroma将使用集合的嵌入函数对每个query_text进行嵌入，然后使用生成的嵌入进行查询。

collection.query(
    query_texts=["doc10", "thus spake zarathustra", ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

您还可以使用 .get 方法通过 id 从集合中检索项。

collection.get(
    ids=["id1", "id2", "id3", ...],
    where={"style": "style1"}
)

.get 方法还支持 where 和 where_document 过滤器。如果没有提供 ids，则会返回所有与 where 和 where_document 过滤器匹配的集合中的项。

选择要返回的数据

在使用 get 或 query 方法时，您可以使用 include 参数来指定要返回的数据类型，包括 embeddings（嵌入向量）、documents（文档）、metadatas（元数据）以及 query 方法中的 distances（距离）。默认情况下，Chroma 将返回文档、元数据和查询结果的距离（仅针对 query 方法）。由于性能原因，embeddings默认不包含在返回结果中，而 ids 总是会返回。您可以通过将包含要返回字段名称的数组传递给查询或获取方法的 includes 参数来指定要返回的数据类型。

# Only get documents and ids
collection.get({
    include: [ "documents" ]
})

collection.query({
    queryEmbeddings: [[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    include: [ "documents" ]
})

使用 Where 过滤器

Chroma支持通过metadata和document内容进行查询过滤。使用where过滤器可以按照metadata进行筛选，而使用where_document过滤器可以按照document内容进行筛选。

按照元数据进行过滤

要对metadata进行过滤，您需要向查询中提供一个where过滤器字典。该字典必须具有以下结构：

{
    "metadata_field": {
        <Operator>: <Value>
    }
}

元数据过滤器支持以下操作符：

$eq - equal to (string, int, float)
$ne - not equal to (string, int, float)
$gt - greater than (int, float)
$gte - greater than or equal to (int, float)
$lt - less than (int, float)
$lte - less than or equal to (int, float)

使用 $eq 操作符等同于使用 where 过滤器。

{
    "metadata_field": "search_string"
}

# is equivalent to

{
    "metadata_field": {
        "$eq": "search_string"
    }
}

NOTE

Where过滤器只搜索键存在的嵌入。如果您搜索collection.get(where={"version": {"$ne": 1}})，没有版本键的元数据将不会返回。

按照文档内容进行过滤

要对文档内容进行过滤，您需要向查询中提供一个where_document过滤器字典。该字典必须具有以下结构：

# Filtering for a search_string
{
    "$contains": "search_string"
}

使用逻辑运算符

您还可以使用逻辑运算符$and和 $or 来组合多个过滤器。

$and 运算符将返回匹配列表中所有过滤器的结果。

{
    "$and": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

$or 运算符将返回与列表中任何一个过滤器匹配的结果。

{
    "$or": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

使用inclusion运算符（$in 和 $nin）
支持以下inclusion运算符：

$in - a value is in predefined list (string, int, float, bool)
$nin - a value is not in predefined list (string, int, float, bool)

$in 运算符将返回元数据属性是所提供列表的一部分的结果：

{
  "metadata_field": {
    "$in": ["value1", "value2", "value3"]
  }
}

$nin 运算符将返回元数据属性不是所提供列表的一部分的结果：

{
  "metadata_field": {
    "$nin": ["value1", "value2", "value3"]
  }
}

实际示例
有关使用包含运算符的其他示例和演示，请参见提供的笔记本。链接在这里。

更新集合中的数据

可以使用.update对集合中项目的任何属性进行更新。

collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

如果在集合中找不到id，将记录错误并忽略更新操作。如果提供的documents没有相应的embeddings，将使用集合的embeddings函数重新计算embeddings向量。

如果提供的embeddings与集合的维度不同，将引发异常。

Chroma 还支持 upsert 操作，即更新现有项，如果项不存在则添加它们。

collection.upsert(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

如果集合中不存在某个id，则将像add一样创建相应的项。具有现有id的项将按照update进行更新。

从集合中删除数据

Chroma 支持使用.delete按id从集合中删除项。将删除与每个项相关联的嵌入向量、文档和元数据。⚠️ 因为这是一项破坏性操作，所以无法撤销。

collection.delete(
    ids=["id1", "id2", "id3",...],
    where={"chapter": "20"}
)

.delete 也支持 where 过滤器。如果没有提供id，它将删除与 where 过滤器匹配的集合中的所有项。

认证

您可以在仅处于服务器/客户端模式时配置 Chroma 使用身份验证。

支持的身份验证方法：

Authentication Method	Basic Auth (Pre-emptive)	Static API Token
Description	RFC 7617 Basic Auth with `user:password` base64-encoded `Authorization` header.	Static auth token in `Authorization: Bearer` or in `X-Chroma-Token:` headers.
Status	`Alpha`	`Alpha`
Server-Side Support	✅ `Alpha`	✅ `Alpha`
Client/Python	✅ `Alpha`	✅ `Alpha`
Client/JS	✅ `Alpha`	✅ `Alpha`

基本认证

服务器设置

生成服务器端凭据

安全实践
一个良好的安全实践是将密码安全地存储。在下面的示例中，我们使用bcrypt（目前是Chroma服务器端认证中唯一受支持的哈希函数）对明文密码进行哈希处理。

要生成密码哈希，请运行以下命令。请注意，您需要在系统上安装htpasswd。

htpasswd -Bbn admin admin > server.htpasswd

运行服务器

设置以下环境变量：

export CHROMA_SERVER_AUTH_CREDENTIALS_FILE="server.htpasswd"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER='chromadb.auth.providers.HtpasswdFileServerAuthCredentialsProvider'
export CHROMA_SERVER_AUTH_PROVIDER='chromadb.auth.basic.BasicAuthServerProvider'

然后像平常一样运行服务端：

chroma run --path /db_path

客户端设置

import chromadb
from chromadb.config import Settings

client = chromadb.HttpClient(
  settings=Settings(chroma_client_auth_provider="chromadb.auth.basic.BasicAuthClientProvider",chroma_client_auth_credentials="admin:admin"))
client.heartbeat()  # this should work with or without authentication - it is a public endpoint

client.get_version()  # this should work with or without authentication - it is a public endpoint

client.list_collections()  # this is a protected endpoint and requires authentication

静态 API TOKEN身份验证

TOKENS
TOKENS必须是字母数字的 ASCII 字符串。TOKENS区分大小写。

服务器设置

安全提示
当前的静态 API token身份验证实现仅支持基于环境变量的令牌。

运行服务器

设置以下环境变量，以将 Authorization: Bearer test-token 作为您的 authentication header。

export CHROMA_SERVER_AUTH_CREDENTIALS="test-token"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider"

要使用 X-Chroma-Token: test-token 类型的 authentication header，您可以设置一个额外的环境变量。

export CHROMA_SERVER_AUTH_CREDENTIALS="test-token"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider"
export CHROMA_SERVER_AUTH_TOKEN_TRANSPORT_HEADER="X_CHROMA_TOKEN"

客户端设置

import chromadb
from chromadb.config import Settings

client = chromadb.HttpClient(
    settings=Settings(chroma_client_auth_provider="chromadb.auth.token.TokenAuthClientProvider",
                      chroma_client_auth_credentials="test-token"))
client.heartbeat()  # this should work with or without authentication - it is a public endpoint

client.get_version()  # this should work with or without authentication - it is a public endpoint

client.list_collections()  # this is a protected endpoint and requires authentication

皮皮姑娘

关注

15
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
3.1 Chroma-使用指南-python

在使用 get 或 query 方法时，您可以使用 include 参数来指定要返回的数据类型，包括 embeddings（嵌入向量）、documents（文档）、metadatas（元数据）以及 query 方法中的 distances（距离）。如果传递给 Chroma 一个文档列表，它会自动对其进行标记化并使用集合的嵌入函数进行嵌入（如果在创建集合时没有提供嵌入函数，则使用默认值）。Chroma将使用集合的嵌入函数对每个query_text进行嵌入，然后使用生成的嵌入进行查询。
复制链接

扫一扫