向量数据库技术系列三-Chroma介绍

最新推荐文章于 2025-04-09 18:52:09 发布

恰恰虎

最新推荐文章于 2025-04-09 18:52:09 发布

阅读量1.5k

点赞数 26

文章标签： chromadb 数据库向量

本文链接：https://blog.csdn.net/tcy83/article/details/144943921

版权

一、前言

Chroma是一个开源的AI原生向量数据库，旨在帮助开发者更加便捷地构建大模型应用，将知识、事实和技能等文档整合进大型语言模型（LLM）中。它提供了简单易用的API，支持存储嵌入及其元数据、嵌入文档和查询、搜索嵌入等功能。主要有以下特点:

轻量级：Chroma是一个基于向量检索库实现的轻量级向量数据库，不需要复杂的配置和大规模基础设施支持，非常适合小型或中型项目。
易用性：提供简单的API，易于集成和使用。开发者可以快速上手，无需复杂的配置。
功能丰富：支持存储嵌入及其元数据、嵌入文档和查询、搜索嵌入等功能。
多语言支持：提供Python和JavaScript客户端SDK，社区还提供了其他语言的支持。
开源：采用Apache 2.0开源许可，社区活跃，不断推动其更新与发展。

二、相关概念

1、数据的组织组件

多租户（Tenant）

租户是数据库的逻辑分组，用于模拟一个组织或用户。一个租户可以拥有多个数据库。

集合(Collection)

集合是嵌入向量、文档和元数据的分组机制，是存储嵌入、文档和任何附加元数据的地方。类似于传统数据库的表。

文档(Document)

文档是向量化前的原始的文本块。需要注意的是，它并不是文件。

元数据(MetaData)

元数据是用来描述文档的相关属性，它是一对kv的键值对，支持String，boolean，int，float类型。

2、存储组件

SQLite数据库

在Chroma单节点模式下，所有关于租户、数据库、集合和文档的数据都存储在一个SQLite数据库中。SQLite作为一个轻量级的关系型数据库，能够为Chroma提供稳定的数据存储基础。

3、查询处理组件

嵌入函数(Embedding Function)

也称为嵌入模型，提供统一的api接口，将原始的文本转化为向量，chroma支持多种大模型嵌入，比如openai，grmini等，默认为all-MiniLM-L6-v2。

距离函数（Distance Function）l

 用于计算两个嵌入向量之间的差异（距离）。Chroma支持余弦相似度、欧几里得距离（L2）和内积（IP）等距离函数。在查询过程中，Chroma会根据选定的距离度量，将输入向量与存储的向量进行比较，并返回最相似的向量。

三、基本操作

1、安装chromadb库

以下都使用python进行实操演示

pip install chromadb

2、创建客户端

chroma支持三种客户端类型

(1)非持久化客户端

这种运行在内存中，一般用在对于数据不需要持久化的场景。比如调试，实验的场景。

import chromadb

client = chromadb.Client()

(2)持久化客户端

在创建的时候，可以配置本地的存储路径

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection")

(3)http模式

前两种，都是本地模式，chroma的服务端和客户端需要位于同一台机器。CS模式可以独立部署，通过httpclient进行访问。

import chromadb

chroma_client = chromadb.HttpClient(host='localhost', port=8000)

本案采用持久化客户端模式进行演示。

3、创建collect

创建collect时，可以配置如下参数。

name标识collect的名称，是必填项；
embedding_function，指定嵌入函数，不填为默认的嵌入模型。
metadata，元数据，比如索引方式等，非必填。

这里使用get_or_create_collection方法进行创建，避免每次都创建新的集合。同时使用默认的嵌入模型。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")

创建完成后，可以看到本地的myCollection2文件下有个chroma.sqlite3的文件。

4、写入数据

写入数据时，配置以下参数:

document，原始的文本块。
metadatas，描述文本块的元数据，kv键值对。
ids，文本块的唯一标识
embeddings，对于已经向量化的文本块，可以直接写入结果。如果不填，则在写入时，使用指定或者默认的嵌入函数进行向量化。

写入如下数据，upsert表示如存在就更新，否则新写入数据。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about dog",
        "This is a document about oranges"
    ],
   metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
   ids=["id1", "id2"]
)

写入后可以到sqlite中查看相关的数据

5、查询数据

(1)、向量查询

使用query实现向量相似度的查询。

query_texts，待查询的文本块
n_results，返回的结果的数，按照相似度从高到底，返回距离最近的TOP。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about dog",
        "This is a document about oranges"
    ],
   metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
   ids=["id1", "id2"]
);


results = collection.query(
    query_texts=["This is a query document about cat"], # Chroma will embed this for you
    n_results=2 # how many results to return
)

print(results)

打印的结果如下

{'ids': [['id1', 'id2']], 'embeddings': None, 'documents': [['This is a document about dog', 'This is a document about oranges']], 'uris': None, 'data': None, 'metadatas': [[{'chapter': '3', 'verse': '16'}, {'chapter': '3', 'verse': '5'}]], 'distances': [[0.9647536639619662, 1.4269337554105601]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

可以看到第一条的相似度最高。

(2)元数据过滤

chroma还支持元数据的过滤，比如我要在元数据verse为5的结果中查询相似度最高的。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about dog",
        "This is a document about oranges"
    ],
   metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
   ids=["id1", "id2"]
);


results = collection.query(
    query_texts=["This is a query document about cat"], # Chroma will embed this for you
    n_results=2 ,# how many results to return
    where={
        "verse": {
        "$eq": "5"
    }
    }
)

print(results)

查询的结果如下：：

{'ids': [['id2']], 'embeddings': None, 'documents': [['This is a document about oranges']], 'uris': None, 'data': None, 'metadatas': [[{'chapter': '3', 'verse': '5'}]], 'distances': [[1.4269337554105601]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

由于仅id为id2的数据匹配这一过滤条件，这里仅返回的该条数据。

(3)全文搜索

chroma支持文档的全文检索，比如再包含apple的document中查询相似度最高的数据。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about dog",
        "This is a document about oranges"
    ],
   metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
   ids=["id1", "id2"]
);


results = collection.query(
    query_texts=["This is a query document about cat"], # Chroma will embed this for you
    n_results=2 ,# how many results to return
    where_document={"$contains":"apple"}
)

print(results)

查询结果如下：

{'ids': [[]], 'embeddings': None, 'documents': [[]], 'uris': None, 'data': None, 'metadatas': [[]], 'distances': [[]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

由于没有文档包含"apple"字符，所以检索的结果为空。

四、案例实践

下面我们将从中国诗词中选取20个短句的向量作为输入，再挑选一句作为查询语句，验证下向量化以及检索的效果。

挑选的输入诗词如下(使用kimi挑选)：

海内存知己，天涯若比邻。（王勃《送杜少府之任蜀州》）
大漠孤烟直，长河落日圆。（王维《使至塞上》）
春眠不觉晓，处处闻啼鸟。（孟浩然《春晓》）
会当凌绝顶，一览众山小。（杜甫《望岳》）
海上生明月，天涯共此时。（张九龄《望月怀远》）
举头望明月，低头思故乡。（李白《静夜思》）
山重水复疑无路，柳暗花明又一村。（陆游《游山西村》）
不识庐山真面目，只缘身在此山中。（苏轼《题西林壁》）
采菊东篱下，悠然见南山。（陶渊明《饮酒》）
谁言寸草心，报得三春晖。（孟郊《游子吟》）
忽如一夜春风来，千树万树梨花开。（岑参《白雪歌送武判官归京》）
落霞与孤鹜齐飞，秋水共长天一色。（王勃《滕王阁序》）
青山遮不住，毕竟东流去。（辛弃疾《菩萨蛮·书江西造口壁》）
春江潮水连海平，海上明月共潮生。（张若虚《春江花月夜》）
两岸猿声啼不住，轻舟已过万重山。（李白《早发白帝城》）
问渠那得清如许？为有源头活水来。（朱熹《观书有感》）
竹外桃花三两枝，春江水暖鸭先知。（苏轼《惠崇春江晚景》）
身无彩凤双飞翼，心有灵犀一点通。（李商隐《无题》）
众里寻他千百度，蓦然回首，那人却在，灯火阑珊处。（辛弃疾《青玉案·元夕》）
莫愁前路无知己，天下谁人不识君。（高适《别董大》）

待查询的语句：

明月几时有，把酒问青天

代码如下

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/shici")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="shici")

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "海内存知己，天涯若比邻",
        "大漠孤烟直，长河落日圆",
        "春眠不觉晓，处处闻啼鸟",
        "会当凌绝顶，一览众山小",
        "海上生明月，天涯共此时",
        "举头望明月，低头思故乡",
        "山重水复疑无路，柳暗花明又一村",
        "不识庐山真面目，只缘身在此山中",
        "采菊东篱下，悠然见南山",
        "谁言寸草心，报得三春晖",
        "忽如一夜春风来，千树万树梨花开",
        "落霞与孤鹜齐飞，秋水共长天一色",
        "青山遮不住，毕竟东流去",
        "春江潮水连海平，海上明月共潮生",
        "两岸猿声啼不住，轻舟已过万重山",
        "问渠那得清如许？为有源头活水来",
        "竹外桃花三两枝，春江水暖鸭先知",
        "身无彩凤双飞翼，心有灵犀一点通",
        "众里寻他千百度，蓦然回首，那人却在，灯火阑珊处",
        "莫愁前路无知己，天下谁人不识君"
    ],
   ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9", "id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17", "id18", "id19", "id20"]
);

results = collection.query(
    query_texts=["明月几时有，把酒问青天"], # Chroma will embed this for you
    n_results=5 # how many results to return
    # where_document={"$contains":"apple"}
)

print(results)

打印的结果

{'ids': [['id6', 'id5', 'id13', 'id1', 'id20']], 'embeddings': None, 'documents': [['举头望明月，低头思故乡', '海上生明月，天涯共此时', '青山遮不住，毕竟东流去', '海内存知己，天涯若比邻', '莫愁前 
路无知己，天下谁人不识君']], 'uris': None, 'data': None, 'metadatas': [[None, None, None, None, None]], 'distances': [[0.4827560689814884, 0.5092440264675281, 0.5768293567797822, 0.5936621113091055, 0.5975973195463598]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

可以看出，前两名都是与月亮有关，也是表达思念之前。总体效果还不错。。