解决:70B Model quantizing on mac: tensor ‘layers.0.attention.wk.weight’ has wrong shape; expected 8192 x 8192
mac在显示器自动关闭的时候,总是会同时进入睡眠,这样在下载大模型的时候,网络会中断,就无法继续下载。试了下设置也没有解决,最后靠最简单的方式。
- 通过显示器-显示器能耗设置-能耗(显示器关闭时,防止自动进入睡眠),就可以防止系统睡掉。
- 上面设置还是会中途断掉,于是试着在下载的时候播放视频,并关闭显示器,再也没有断掉。
在Mac上部署LLaMA模型需要额外的一些设置,以使得可以在arm架构下的显存上运行,目前星标多的两个库分别为llama2.c 和 llama.cpp:
llama.cpp 项目较早
因为c语言版本只能用来推理,而我还对cpp版本不了解,先选用cpp版本。
下载 LLaMa-2
ls ../../llama/llama-2*
../../llama/llama-2-13b:
checklist.chk consolidated.00.pth consolidated.01.pth params.json
../../llama/llama-2-13b-chat:
checklist.chk consolidated.00.pth consolidated.01.pth params.json
../../llama/llama-2-70b:
checklist.chk consolidated.01.pth consolidated.03.pth consolidated.05.pth consolidated.07.pth
consolidated.00.pth consolidated.02.pth consolidated.04.pth consolidated.06.pth params.json
../../llama/llama-2-70b-chat:
checklist.chk consolidated.01.pth consolidated.03.pth consolidated.05.pth consolidated.07.pth params.json
consolidated.00.pth consolidated.02.pth consolidated.04.pth consolidated.06.pth ggml-model-f16.bin
../../llama/llama-2-7b:
checklist.chk consolidated.00.pth params.json
../../llama/llama-2-7b-chat:
checklist.chk consolidated.00.pth params.json
部署 LLaMa-2
- Step 1. 下载llama.cpp
- Step 2.
cd llama.cpp; make
- Step 3.准备python 3.10.x环境,
pip install torch numpy sentencepiece
- Step 4.转换原始模型到ggml,
python convert-pth-to-ggml.py models/7B/ 1
- Step 5.将模型转换成32bit,
./quantize ../../llama/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-f32.bin 0
275G;./quantize ../../llama/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-f16.bin 1
138G;./quantize ../../llama/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-q4-0.bin 2
39G.
参数含义如下
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] model-f32.bin [model-quant.bin] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
Allowed quantization types:
2 or Q4_0 : 3.50G, +0.2499 ppl @ 7B
3 or Q4_1 : 3.90G, +0.1846 ppl @ 7B
8 or Q5_0 : 4.30G, +0.0796 ppl @ 7B
9 or Q5_1 : 4.70G, +0.0415 ppl @ 7B
10 or Q2_K : 2.67G, +0.8698 ppl @ 7B
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5505 ppl @ 7B
12 or Q3_K_M : 3.06G, +0.2437 ppl @ 7B
13 or Q3_K_L : 3.35G, +0.1803 ppl @ 7B
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.56G, +0.1149 ppl @ 7B
15 or Q4_K_M : 3.80G, +0.0535 ppl @ 7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0353 ppl @ 7B
17 or Q5_K_M : 4.45G, +0.0142 ppl @ 7B
18 or Q6_K : 5.15G, +0.0044 ppl @ 7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ 7B
1 or F16 : 13.00G @ 7B
0 or F32 : 26.00G @ 7B
- Step 6.运行模型,
./main -m ./models/llama-2-70b-chat/ggml-model-q4-0.bin -n 8192
,会报如第一行写的错,应该加上选项./main -m ./models/llama-2-70b-chat/ggml-model-q4-0.bin -n 8192 -gqa 8
-gqa N, --gqa N grouped-query attention factor (TEMP!!! use 8 for LLaMAv2 70B) (default: 1)
然后就成功了,确实慢,几分钟就蹦了半个词。donnot 都出不来,我先睡一觉,等等它,给它空间。
main: build = 938 (c574bdd)
main: seed = 1690992788
llama.cpp: loading model from ./models/llama-2-70b-chat/ggml-model-f16.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 131565.25 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
llama_new_context_with_model: compute buffer total size = 145.35 MB
system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0
内存似乎还没有被拉满,先睡觉。
**
**
4bit
llama_print_timings: load time = 20689.72 ms
llama_print_timings: sample time = 1254.91 ms / 1729 runs ( 0.73 ms per token, 1377.79 tokens per second)
llama_print_timings: prompt eval time = 168671.04 ms / 1287 tokens ( 131.06 ms per token, 7.63 tokens per second)
llama_print_timings: eval time = 6960074.33 ms / 1723 runs ( 4039.51 ms per token, 0.25 tokens per second)
llama_print_timings: total time = 7130177.50 ms
8bit
llama_print_timings: load time = 84462.46 ms
llama_print_timings: sample time = 362.71 ms / 472 runs ( 0.77 ms per token, 1301.32 tokens per second)
llama_print_timings: prompt eval time = 3594.17 ms / 2 tokens ( 1797.09 ms per token, 0.56 tokens per second)
llama_print_timings: eval time = 6717785.87 ms / 471 runs (14262.82 ms per token, 0.07 tokens per second)
llama_print_timings: total time = 6721791.83 ms