LLM（九）| 使用LlamaIndex本地运行Mixtral 8x7大模型

最新推荐文章于 2024-04-25 23:34:50 发布

wshzd

最新推荐文章于 2024-04-25 23:34:50 发布

阅读量1.5k

点赞数 26

分类专栏： ChatGPT 笔记文章标签： chatgpt AIGC 语言模型

本文链接：https://blog.csdn.net/wshzd/article/details/135386016

版权

笔记同时被 2 个专栏收录

171 篇文章 53 订阅

订阅专栏

ChatGPT

81 篇文章 6 订阅

订阅专栏

欧洲人工智能巨头Mistral AI最近开源Mixtral 8x7b大模型，是一个“专家混合”模型，由八个70亿参数的模型组成。Mistral AI在一篇博客文章（https://mistral.ai/news/mixtral-of-experts/）介绍了Mixtral 8x7b，在许多基准上与GPT-3.5和Llama2 70b相匹配甚至是超越。

下面我们使用LlamaIndex在本地部署一下Mixtral 8x7b大模型：

步骤1：安装Ollama

以前，安装并运行本地模型是一件巨大的痛苦，但随着Ollama的发布，这变得简单了！它适用于MacOS和Linux（很快就会在Windows上使用，尽管你现在可以通过Windows Subsystem for Linux在Windows中使用它），是开源的，可以免费下载（https://ollama.ai/download）。

下载后，只需一个命令即可获得Mixtral：

ollama run mixtral

第一次运行此命令时，需要下载模型，这可能需要很长时间。运行时需要48GB的RAM，如果没有这么大的内存，可以安装Mistral 7b模型，安装方式如下：

ollama run mistral

PS：下面步骤使用Mixtral，但Mistral同样也可以。

步骤2：安装依赖项

pip install llama-index qdrant_client torch transformers

第3步：简单测试

如果已经运行了Ollama并正确安装了LlamaIndex，使用如下脚本来测试一下是否正常：

# Just runs .complete to make sure the LLM is listeningfrom llama_index.llms import Ollamallm = Ollama(model="mixtral")response = llm.complete("Who is Laurie Voss?")print(response)

步骤4：加载数据并对其进行索引

可以处理任何数据，这里使用（https://www.dropbox.com/scl/fi/6sos49fluvfilj3sqcvoj/tinytweets.json?rlkey=qmxlaqp000kmx8zktvaj4u1vh&dl=0）数据，并把数据存储在开源Qdrant矢量数据库中。创建一个新的python文件，并加载我们所有的依赖项：

from pathlib import Pathimport qdrant_clientfrom llama_index import (    VectorStoreIndex,    ServiceContext,    download_loader,)from llama_index.llms import Ollamafrom llama_index.storage.storage_context import StorageContextfrom llama_index.vector_stores.qdrant import QdrantVectorStore

然后使用开源数据连接器集合LlamaHub的JSONReader从JSON文件中加载推文：

JSONReader = download_loader("JSONReader")loader = JSONReader()documents = loader.load_data(Path('./data/tinytweets.json'))

通过初始化Qdrant并将其传递到我们稍后将使用的存储上下文中：

client = qdrant_client.QdrantClient(    path="./qdrant_data")vector_store = QdrantVectorStore(client=client, collection_name="tweets")storage_context = StorageContext.from_defaults(vector_store=vector_store)

现在设置我们的StorageContext。我们将把它作为LLM传递给Mixtral，这样我们就可以在完成索引后测试事情是否正常工作；索引本身不需要Mixtral。通过传递“embed_mode=local”，我们指定LlamaIndex将在本地嵌入您的数据，这就是您需要torch和transformer的原因。

llm = Ollama(model="mixtral")service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

现在将所有内容放在一起：使用已经设置的服务和存储上下文从加载的文档中构建索引，并为其提供查询：

index = VectorStoreIndex.from_documents(documents,service_context=service_context,storage_context=storage_context)query_engine = index.as_query_engine()response = query_engine.query("What does the author think about Star Trek? Give details.")print(response)

Ollama需要启动Mixtral来回答问题，这可能需要一段时间，所以要耐心！应该得到这样的输出（但有更多细节）：

Based on the provided context information, the author has a mixed opinion about Star Trek.

验证索引

使用我们预先构建的索引，启动一个新的python文件并再次加载依赖项：

import qdrant_clientfrom llama_index import (    VectorStoreIndex,    ServiceContext,)from llama_index.llms import Ollamafrom llama_index.vector_stores.qdrant import QdrantVectorStore

这一次不需要加载数据，已经完成了！还需要Qdrant客户端和Mixtral：

client = qdrant_client.QdrantClient(    path="./qdrant_data")vector_store = QdrantVectorStore(client=client, collection_name="tweets")llm = Ollama(model="mixtral")service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

这一次，没有从文档中创建索引，而是使用from_vector_store直接从矢量存储中加载索引。我们还将similarity_top_k=20传递给查询引擎；这意味着它将一次获取20条推文（默认为2条），以获得更多上下文并更好地回答问题。

index = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)query_engine = index.as_query_engine(similarity_top_k=20)response = query_engine.query("Does the author like SQL? Give details.")print(response)

建立一个小的web服务

使用脚本封装运行的索引不是太友好，我们可以创建一个API。需要两个新的依赖项：

pip install flask flask-cors

像以前一样将我们的依赖项加载到一个新文件中：

from flask import Flask, request, jsonifyfrom flask_cors import CORS, cross_originimport qdrant_clientfrom llama_index.llms import Ollamafrom llama_index import (    VectorStoreIndex,    ServiceContext,)from llama_index.vector_stores.qdrant import QdrantVectorStore

获取矢量存储、LLM和加载的索引：

# re-initialize the vector storeclient = qdrant_client.QdrantClient(    path="./qdrant_data")vector_store = QdrantVectorStore(client=client, collection_name="tweets")# get the LLM againllm = Ollama(model="mixtral")service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")# load the index from the vector storeindex = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)

设置一个非常基本的Flask服务器：

app = Flask(__name__)cors = CORS(app)app.config['CORS_HEADERS'] = 'Content-Type'# This is just so you can easily tell the app is running@app.route('/')def hello_world():    return 'Hello, World!'

并添加一个接受查询（作为表单数据）、查询LLM并返回响应的路由：

@app.route('/process_form', methods=['POST'])@cross_origin()def process_form():    query = request.form.get('query')    if query is not None:        query_engine = index.as_query_engine(similarity_top_k=20)        response = query_engine.query(query)        return jsonify({"response": str(response)})    else:        return jsonify({"error": "query field is missing"}), 400if __name__ == '__main__':    app.run()

PS：最后两行很重要！flask run与LlamaIndex加载依赖项的方式不兼容，因此需要像这样直接运行这个API（假设您的文件名为app.py）

python app.py

随着API的启动和运行，可以使用cURL发送请求并对其进行验证：

curl --location '<http://127.0.0.1:5000/process_form>' \\--form 'query="What does the author think about Star Trek?"'

总结：

让Ollama在本地运行Mixtral
使用LlamaIndex查询Mixtral 8x7b
使用Qdrant矢量存储构建和查询数据索引
将索引包装成一个非常简单的web API
所有开源、免费且在本地运行！

参考文献：

[1] https://blog.llamaindex.ai/running-mixtral-8x7-locally-with-llamaindex-e6cebeabe0ab

wshzd

关注

26
点赞
踩
24

收藏

觉得还不错? 一键收藏
打赏
0
评论
LLM（九）| 使用LlamaIndex本地运行Mixtral 8x7大模型

欧洲人工智能巨头Mistral AI最近开源Mixtral 8x7b大模型，是一个“专家混合”模型，由八个70亿参数的模型组成。以前，安装并运行本地模型是一件巨大的痛苦，但随着Ollama的发布，这变得简单了！它适用于MacOS和Linux（很快就会在Windows上使用，尽管你现在可以通过Windows Subsystem for Linux在Windows中使用它），是开源的，可以免费下载（https://ollama.ai/download）。第一次运行此命令时，需要下载模型，这可能需要很长时间。
复制链接

扫一扫