环境
系统:CentOS-7
CPU: Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz 14C28T
内存: 32G DDR3
依赖安装
make --version
GNU Make 4.3
gcc --version
gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
g++ --version
g++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
编译
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
等待一会
查看
ls
-rwxr-xr-x. 1 root root 1.6M Feb 23 07:54 main
-rwxr-xr-x. 1 root root 2.6M Feb 23 07:55 server
.....
下载模型
https://hf-mirror.com/TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf
测试
./main -m /models/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n -1 -e
llama_print_timings: load time = 2968.89 ms
llama_print_timings: sample time = 600.56 ms / 1261 runs ( 0.48 ms per token, 2099.71 tokens per second)
llama_print_timings: prompt eval time = 1853.61 ms / 19 tokens ( 97.56 ms per token, 10.25 tokens per second)
llama_print_timings: eval time = 297657.01 ms / 1260 runs ( 236.24 ms per token, 4.23 tokens per second)
llama_print_timings: total time = 300631.20 ms / 1279 tokens
–
./main -m /models/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf --color --ctx_size 200 -n -1 -ins -t 20
./main -m /models/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 20
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 26.49 GiB (4.87 BPW)
llm_load_print_meta: general.name = emozilla
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.38 MiB
llm_load_tensors: CPU buffer size = 27127.88 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 200
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 25.00 MiB
llama_new_context_with_model: KV self size = 25.00 MiB, K (f16): 12.50 MiB, V (f16): 12.50 MiB
llama_new_context_with_model: CPU input buffer size = 9.40 MiB
llama_new_context_with_model: CPU compute buffer size = 44.74 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 20 / 28 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
llama_print_timings: total time = 673176.72 ms / 2838 tokens
总结
CPU利用
5867 root 20 0 26.9g 26.5g 26.5g R 1981 85.1 45:59.12 main
2000% 24线程可再大些
速度还是可以的,4 tokens/s 继续研究中,这个CPU还是可以的,要是换成AMD高级CPU估计效果更好,主要是不用GPU