langchain chroma 与 chromadb笔记

chromadb可独立使用也可搭配langchain 框架使用。

环境:

        python 3.9

        langchain=0.2.16

        chromadb=0.5.3

chromadb 使用示例

import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions


# 加载embedding模型
en_embedding_name = "/home/model/peft_prac/all-MiniLM-L6-v2"  
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
 en_embedding_name, "cuda:2", True
 )

# 实例化chromadb,添加一个collection
collection_first = 'coll_1st'
client_test = chromadb.Client()
collection = client_test.create_collection(name=collection_first, embedding_function=ef) 


# 添加数据三元组,list类型
collection.add(
    documents=["it's an apple", "this is a book"], 
    metadatas=[{"source": "t4"},  {"source": "t5"}], 
    ids=["id4",  "id5"])

## 统计collection的items数量
collection.count()

# 查找数据
coll2 = client_test.get_collection(collection_first)
print('check_collection',  coll2.peek(1)) # 取出第一个数据,此时embedding有值
print('check_collection',  coll2.get(ids=["id4"])) # 选择第一个数据,此时embedding无值
collection = client.get_or_create_collection("testname") # 有则获取,无则创建

# 更新数据
collection.upsert(
    ids=["id4", ...],
    embeddings=[[1.1, 2.3, 3.2], ...], #非必须
    metadatas=[{"chapter": "3", "verse": "16"} ...],
    documents=["it's a book", ...],
)

# 使用embedding 检索

collection.query(
    query_embeddings=[[1.1, 2.3, 3.2]],
    n_results=1,
    where={"style": "style2"}
)

# 使用text 检索(使用更新前的数据检索),distance越小,语义越接近
print('chromadb_search', coll2.query(query_texts="it's a book", n_results=2))
output:
chromadb_search {'ids': [['id5', 'id4']], 'distances': [[0.3473210334777832, 1.2127960920333862]], 'metadatas': [[{'source': 't5'}, {'source': 't4'}]], 'embeddings': None, 'documents': [['this is a book', "it's an apple"]], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}

# 使用text 检索(使用更新后的数据检索),注意:本体检索,distance 却不是1
print('chromadb_search', coll2.query(query_texts="it's a book", n_results=1))
output:
chromadb_search {'ids': [['id4']], 'distances': [[1.168771351402198e-12]], 'metadatas': [[{'info': 'new data', 'source': 't4'}]], 'embeddings': None, 'documents': [["it's a book"]], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}

        chromdb 可以使用多个collection

langchain chroma 使用示例

import chromadb
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_chroma import Chroma


# 加载embedding 模型, 不推荐使用embedding_functions.SentenceTransformerEmbeddingFunction
# 这种embedding 不支持使用db.add_documents() 和 db.similarity_search(),用着不方便
en_embedding_name = "/home/zmh/peft_prac/all-MiniLM-L6-v2"  
embeddings = HuggingFaceEmbeddings(
    model_name = en_embedding_name,
    model_kwargs={"device": "cuda:1"}
)

# 创建db, 还可以在本地保存db,如果路径中有数据,会在实例化中加载数据
# collection_name:若存在则可使用其中数据,不存在会新建。会作为db 的默认collection
collection_test = 'llama2_demo'
db = Chroma(
    client=client_test, # 可以不指定client
    collection_name=collection_test,
    embedding_function=embeddings, 
    persist_directory='db/'
)

# 基本数据信息
student_info = "Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA, is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking in her free time in hopes of working at a tech company after graduating from the University of Washington."

club_info = "The university chess club provides an outlet for students to come together and enjoy playing the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning the rules to experienced tournament players. The club typically meets a few times per week to play casual games participate in tournaments, analyze famous chess matches, and improve members' skills."

university_info = "The University of Washington, founded in 1861 in Seattle, is a public research university with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell. "As the flagship institution of the six public universities in Washington state, UW encompasses over 500 buildings and 20 million square feet of space, including one of the largest library systems in the world."


texts_org = [student_info, club_info, university_info]
text_meta = [{"source": 'student_info'},  {"source": 'club_info'},  {"source": 'university_info'}]
text_ids = ['101',  '102',  '103']

# 处理数据,
text_splitter = CharacterTextSplitter(separator='.', chunk_size=1000, chunk_overlap=0)
texts_doctment = text_splitter.create_documents(texts_org, metadatas=text_meta)
# 添加数据
db.add_documents(texts_doctment, ids=text_ids)

#查询数据
coll = db._collection
print('coll', type(coll), coll.name, coll.metadata)
output:
coll <class 'chromadb.api.models.Collection.Collection'> llama2_demo None
print('sample of db_info',  coll.peek(1)) # 获取第一个数据
print("collection_info", coll.get()) # 获取整个集合的数据


#检索数据,返回的是直接的document 信息,没有distance 分数
res = db.similarity_search("What is the student name?", k=2)
print('res',  res)
output:
res [Document(metadata={'source': 'student_info'}, page_content='Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA, is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking in her free time in hopes of working at a tech company after graduating from the University of Washington'), Document(metadata={'source': 'club_info'}, page_content="The university chess club provides an outlet for students to come together and enjoy playing the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning the rules to experienced tournament players. The club typically meets a few times per week to play casual games participate in tournaments, analyze famous chess matches, and improve members' skills")]


        chroma 一个实例对象就一个collection

chroma 保存和加载模型

# 保存到磁盘

db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")

docs = db2.similarity_search(query, k=1)

 

# 从磁盘加载

db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

docs = db3.similarity_search(query, k=1)

参考

ChromaDB python 使用教程及记录 - 知乎

2 langchain chromadb 的部分信息参考某个博客,忘了,待补充

### 如何在LangChain中使用Chroma库 #### 集成准备 为了使 ChromaLangChain 能够协同工作,需先安装必要的 Python 库。这可以通过执行一系列的 pip 命令来完成[^4]。 ```bash !pip install -U langchain umap-learn scikit-learn langchain_community tiktoken langchain-openai langchainhub chromadb ``` 对于某些特定环境下的依赖项编译问题,可能还需要设置 CMake 参数并强制更新 `llama-cpp-python` 包: ```bash !CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -qU llama-cpp-python ``` 以上命令确保了开发环境中具备运行基于 LangChain 的应用程序所需的一切条件,并特别针对 GPU 加速进行了优化配置。 #### 数据编码调整 由于部分系统可能存在字符集不兼容的情况,在集成过程中建议修改默认字符编码方式以避免潜在错误。通过重定义 Python 内置函数 `getpreferredencoding()` 来指定统一采用 UTF-8 编码标准。 ```python import locale def getpreferredencoding(do_setlocale=True): return 'UTF-8' locale.getpreferredencoding = getpreferredencoding ``` 这段代码片段的作用在于防止因操作系统差异而导致的文字乱码现象,保障程序能够稳定读取和写入非 ASCII 字符串数据。 #### 构建流程概述 当一切准备工作就绪之后,便可以着手创建一个简单的问答系统实例。此过程涉及到了 Chroma 对于多模态数据的支持特性以及它 LangChain 结合所带来的优势——即更加快捷高效的 AI 应用构建能力[^2]。 具体来说,利用 Chroma 提供的强大功能模块,如数据预处理、特征抽取等操作,可显著提高下游任务的表现效果;而借助 LangChain,则能轻松搭建起完整的对话管理框架结构,使得整个系统的交互逻辑更加清晰明了[^3]。 #### 示例代码展示 下面给出了一段简化版的应用场景模拟代码,展示了如何加载预先训练好的 Zephyr 模型参数文件,并将其应用于实际交流当中。 ```python from pathlib import Path import torch model_path = Path('path/to/zephyr/model.bin') device = 'cuda' if torch.cuda.is_available() else 'cpu' model_state_dict = torch.load(model_path, map_location=device) # Assuming the model class is defined elsewhere as MyModelClass model_instance = MyModelClass() model_instance.load_state_dict(model_state_dict) model_instance.eval() print(f'Model loaded successfully on {device}. Ready to serve.') ``` 请注意上述路径 `'path/to/zephyr/model.bin'` 需要替换为真实的模型存储位置。此外,假设存在名为 `MyModelClass` 的类用来表示所使用的神经网络架构,这里仅作为示意用途,请根据实际情况自行调整导入语句及相关初始化方法。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值