ollama实战(二): gguf 格式部署及转换方式（llamacpp）

被玩弄的小猫咪

已于 2024-10-04 14:05:38 修改

阅读量1.8k

点赞数 15

文章标签： llama

于 2024-10-01 18:26:22 首次发布

本文链接：https://blog.csdn.net/yierbubu1212/article/details/142673286

版权

ollama gguf 格式部署及转换方式

对于gguf格式（加载快），使用llamacpp进行量化，适合自己微调的模型量化后或者格式转化后再使用ollama部署

ollama下载参考上篇文章

https://blog.csdn.net/yierbubu1212/article/details/142673245?spm=1001.2014.3001.5502

1.llamacpp下载

git clone https://github.com/ggerganov/llama.cpp.git

1.1使用 llama.cpp 转换模型的流程

cd llama.cpp
pip install -r requirements.txt
python convert_hf_to_gguf.py -h

make

2.模型格式转化（根据显存选择需求）

2.1推荐量化方式（格式转化再量化）

先无损转gguf格式

python convert_hf_to_gguf.py /mnt/workspace/Qwen2.5-7B-Instruct-merge --outfile Qwen_instruct.gguf --outtype bf16

量化

./llama-quantize /mnt/workspace/Qwen_instruct_7b_.gguf /mnt/workspace/Qwen_instruct_7b_q4.gguf Q4_K_M

2.2模型量化（Safetensors转gguf格式）

注意事项

进行Safetensors转gguf格式，并选择八位量化

cpu工作

(tq1可能不支持ollama部署，且llamacpp推理效果差)

注意路径

python convert_hf_to_gguf.py /mnt/workspace/Qwen2.5-7B-Instruct-merge --outfile Qwen_instruct.gguf --outtype q8_0

python convert_hf_to_gguf.py ./mnt/workspace/Qwen2.5-7B-Instruct-merge --outfile Qwen_instruct.gguf --outtype q8_0（错误）

2.3已经量化的模型再量化（需要的话，效果差）

注：

./mnt/workspace/Qwen_instruct_7b_.gguf 不可写 路径会错

q8_0格式不支持4_K_M转换但是可以使用 `--allow-requantize` 选项

这个选项允许从已经量化的类型重新量化，虽然可能会降低模型的质量，但可以解决您的问题。

./llama-quantize --allow-requantize /mnt/workspace/Qwen_instruct_7b_.gguf /mnt/workspace/Qwen_instruct_7b_q4.gguf Q4_K_M

附：

通用选项
--help：显示帮助信息。
--allow-requantize：允许重新量化已经量化的张量。警告：这可能会严重降低质量。
--leave-output-tensor：保留输出权重未重新量化。增加模型大小，但可能提高质量，特别是在重新量化时。
--pure：禁用混合量化，将所有张量量化为同一类型。
--imatrix file_name：使用文件中的重要性矩阵进行量化优化。
--include-weights tensor_name：对特定张量使用重要性矩阵。
--exclude-weights tensor_name：排除特定张量使用重要性矩阵。
--output-tensor-type ggml_type：为输出权重张量使用指定的 ggml_type。
--token-embedding-type ggml_type：为词嵌入张量使用指定的 ggml_type。
--keep-split：生成的量化模型保持与输入相同的分片。
--override-kv KEY=TYPE:VALUE：覆盖模型元数据中的键值对。可以多次指定。

3.llamacpp推理方式

./llama-cli -m Qwen_instruct.gguf -p "离婚期间打架怎么处理" -n 128

4.ollama部署

4.1创建file文件

注：可能存在一直生成文本的情况，通过提示词进行控制

FROM ./Qwen_instruct_7b_.gguf


# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20

TEMPLATE """{{ if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{ .System }}
{{- if .Tools }}

# Tools

You are provided with function signatures within <tools></tools> XML tags:
<tools>{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""

# set the system message
SYSTEM """You are Qwen, created by Alibaba Cloud. You are a helpful assistant."""

4.2ollama模型创建

file 及你创建的文件名

ollama create mymodel -f file

4.3终端内运行模型（实测在 notebook 中运行不出来，最好在终端运行）

ollama run mymodel

注：我自己微调qwen2.5-7b 显存占用8g多

5.效果

llamacpp
在这里插入图片描述

ollama
在这里插入图片描述

6.附：

6.1 linux可能用得到指令

删除文件夹
rm -rf path
查看内存
df -h
查找大文件
find / -type f -size +1G

6.2 ollama常用指令

ollama serve # 启动ollama
ollama create # 从模型文件创建模型
ollama show  # 显示模型信息
ollama run  # 运行模型，会先自动下载模型
ollama pull  # 从注册仓库中拉取模型
ollama push  # 将模型推送到注册仓库
ollama list  # 列出已下载模型
ollama ps  # 列出正在运行的模型
ollama cp  # 复制模型
ollama rm  # 删除模型