概述
本文在qemu-riscv64平台上,利用向量扩展指令加速运行基于llama.cpp构建的大模型。
参考博客链接:
Accelerating llama.cpp with RISC-V Vector Extension
基于RVV的llama.cpp在Banana Pi F3 RISCV开发板上的演示
源码分析参考链接:
KGback:【项目分析】llama.cpp
2024/10/02: 工具准备OK,但qemu运行时被killed
工具版本
Qemu:
Gcc版本:
Github Release
llama.cpp:
llama.cpp Github 10月2号pull
llama-7b模型版本:
Huggingface gguf文件
编译
llama.cpp编译
cd llama.cpp
make RISCV_CROSS_COMPILE=1
运行命令
qemu-riscv64 -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-server -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p “Anything” -n 9
问题
命令运行现象
可能原因
运行内存可能太小
2024/10/03: 使用10xE团队的最新版,解决tokenizer的问题,但还是被killed
最新版Github链接
问题:运行7B模型被killed
运行现象
可能原因
有可能跟qemu运行的swapfile有关
可以团队成员提的一个issue:
Github Issue: qemu-riscv64 unexpectedly reached EOF error
解决办法
先尝试换一个更小的模型试试,不行就解决swapfile的问题
运行3B规模的model的现象:failed to allocate buffer of size
kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64 -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = llama3.2
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 28
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 3072
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 13: llama.attention.head_count u32 = 24
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.attention.key_length u32 = 128
llama_model_loader: - kv 18: llama.attention.value_length u32 = 128
llama_model_loader: - kv 19: general.file_type u32 = 27
llama_model_loader: - kv 20: llama.vocab_size u32 = 128256
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 29: tokenizer.chat_template str = {
{
- bos_token }}\n{
%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 59 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq3_s: 137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff