使用LlamaCPP进行大模型推理

最新推荐文章于 2024-08-31 22:01:46 发布

llzwxh888

最新推荐文章于 2024-08-31 22:01:46 发布

阅读量262

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/ppoojjj/article/details/140255722

版权

在本文中，我们将介绍如何使用llama-cpp-python库与LlamaIndex进行大模型推理。我们将使用llama-2-chat-13b-ggml模型，并且会展示如何正确配置提示格式。

以下是主要步骤的安装和配置指南。

安装

为确保LlamaCPP的最佳性能，建议安装支持GPU的版本。具体安装指南请参阅这里。

一般来说：

如果你有CUDA和NVidia GPU，请使用CuBLAS
如果你在M1/M2 MacBook上运行，请使用METAL
如果你在AMD/Intel GPU上运行，请使用CLBLAST

安装必要的包：

%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp

配置LLM

LlamaCPP的LLM是高度可配置的。根据所使用的模型，你需要传入messages_to_prompt和completion_to_prompt函数来帮助格式化模型输入。

示例代码如下：

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

# 设置模型路径或者URL
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"

llm = LlamaCPP(
    model_url=model_url,
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    context_window=3900,
    generate_kwargs={},
    model_kwargs={"n_gpu_layers": 1},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

快速开始

我们可以简单地使用LlamaCPP的complete方法根据提示生成回复。

response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

使用流式响应

我们可以使用stream_complete端点在生成响应时进行流式处理，而不是等待整个响应生成后再处理。

response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

配置查询引擎

我们可以将LlamaCPP LLM抽象传递给LlamaIndex查询引擎。

from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode
)

# 使用Huggingface嵌入
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# 加载文档
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data"
).load_data()

# 创建向量存储索引
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# 设置查询引擎
query_engine = index.as_query_engine(llm=llm)

response = query_engine.query("What did the author do growing up?")
print(response)