第十九篇-推荐-纯CPU(E5-2680)推理-Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M

本文链接：https://blog.csdn.net/hai4321/article/details/136264210

环境

系统：CentOS-7
CPU： Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz 14C28T
内存: 32G DDR3

依赖安装

make --version
GNU Make 4.3

gcc --version
gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)

g++ --version
g++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)

编译

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

make

等待一会

查看
ls
-rwxr-xr-x.  1 root root 1.6M Feb 23 07:54 main
-rwxr-xr-x.  1 root root 2.6M Feb 23 07:55 server
.....

下载模型

https://hf-mirror.com/TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf

测试

./main -m /models/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n -1 -e

llama_print_timings:        load time =    2968.89 ms
llama_print_timings:      sample time =     600.56 ms /  1261 runs   (    0.48 ms per token,  2099.71 tokens per second)
llama_print_timings: prompt eval time =    1853.61 ms /    19 tokens (   97.56 ms per token,    10.25 tokens per second)
llama_print_timings:        eval time =  297657.01 ms /  1260 runs   (  236.24 ms per token,     4.23 tokens per second)
llama_print_timings:       total time =  300631.20 ms /  1279 tokens

–

./main -m /models/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf --color --ctx_size 200 -n -1 -ins -t 20

./main -m /models/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf  --color   --ctx_size 2048   -n -1   -ins -b 256   --top_k 10000   --temp 0.2   --repeat_penalty 1.1   -t 20

llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 26.49 GiB (4.87 BPW) 
llm_load_print_meta: general.name     = emozilla
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors:        CPU buffer size = 27127.88 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 200
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    25.00 MiB
llama_new_context_with_model: KV self size  =   25.00 MiB, K (f16):   12.50 MiB, V (f16):   12.50 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.40 MiB
llama_new_context_with_model:        CPU compute buffer size =    44.74 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 20 / 28 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 



llama_print_timings:       total time =  673176.72 ms /  2838 tokens

总结

CPU利用
5867 root      20   0   26.9g  26.5g  26.5g R  1981 85.1  45:59.12 main                                                                                                         

2000% 24线程可再大些

速度还是可以的，4 tokens/s 继续研究中，这个CPU还是可以的，要是换成AMD高级CPU估计效果更好，主要是不用GPU