本地部署流式返回LLM模型，使用gguf模型，全部放在显存大约占用4G

本文链接：https://blog.csdn.net/z1162562943/article/details/139241160

环境准备：
我选择的是CUDA12.4版本
对应安装的llama_cpp是：

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

对于环境有不懂的，可以查阅llama_cpp的文档
https://llama-cpp-python.readthedocs.io/en/latest/

定义过程

from llama_cpp import Llama
model_name = 'qwen1_5-7b-chat-q3_k_m.gguf'
# model_name = 'Phi-3-mini-4k-instruct-q4.gguf'
print(model_name)
class chat_bot:
    
	def __init__(self,model_name):
	    self.llm = Llama(model_path=model_name,n_gpu_layers=20,n_ctx=4096,n_threads=8,chat_format="llama-2")
	
	def talk(self,messages):
	    print(messages)
	    output = self.llm.create_chat_completion(
	        messages=messages,max_tokens=8192,
	        stream=True,
	        temperature=0.7,
	    )
	    print(output)
	    return output
	    
self.llm = chat_bot(model_name)

流式调用，需要放在自己定义好的方法里面，这里只展示内容提取

response = self.llm.talk(messages)
for chunk in response:
    self.myLog(f"chat-gpt 返回的数据: {chunk}\r\n")
    if chunk["choices"] != []:
        if "content" in chunk["choices"][0]["delta"]:
            content = str(chunk["choices"][0]["delta"]["content"])
            if (content != ""):
                self.myLog(f"chat-gpt content: {content}\r\n")
                resText += content
                yield content

备注：
n_gpu_layers 参数是选择有多少层要在GPU中运算，-1为全部使用GPU。（这个很有用，可以根据显存动态调整GPU的使用率）
n_threads 用多少cpu核心处理。（32cpu核处理起来速度也不错）
chat_format 则是选择的模型格式，这决定了很多模型的参数，用对应的模型，设置对应的chat_format，可以得到很好的效果。

messages 格式需要根据模型判断，并不唯一。