llama2大模型初窥门径---入门实践，走进大模型-CSDN博客

本文链接：https://blog.csdn.net/a2591748032/article/details/141289454

一、工作准备

ubuntu环境：
Distributor ID:   Ubuntu
Description:   Ubuntu 22.04.4 LTS
Release:   22.04
sudo apt update
sudo apt-get install gcc g++ python3 python3-pip
#安装python依赖
python3 -m pip install torch numpy sentencepiece
sudo apt-get install ccache

二、构建llama

从github上下载llama.cpp，并进行编译构建。

下载路径：

git clone https://github.com/ggerganov/llama.cpp.git

编译构建：

cd llama.cpp

笔者这里将代码切换到比较早期的版本，新版本还没有玩转过来，有用新版本的运行大模型的请留言，笔者进行学习。

git checkout f7fc5f6c6f3700da311de7fe93977c81798f02d3

make -j8

三、下载大模型

笔者这里使用7B指令模型的Alpaca-2模型。有关其他模型请参考如下网站说明。

https://github.com/ymcui/Chinese-LLaMA-Alpaca-2

模型对比：

对比项	中文LLaMA-2	中文Alpaca-2
模型类型	基座模型	指令/Chat模型（类ChatGPT）
已开源大小	1.3B、7B、13B	1.3B、7B、13B
训练类型	Causal-LM (CLM)	指令精调
训练方式	7B、13B：LoRA + 全量emb/lm-head 1.3B：全量	7B、13B：LoRA + 全量emb/lm-head 1.3B：全量
基于什么模型训练	原版Llama-2（非chat版）	中文LLaMA-2
训练语料	无标注通用语料（120G纯文本）	有标注指令数据（500万条）
词表大小[1]	55,296	55,296
上下文长度[2]	标准版：4K（12K-18K）长上下文版（PI）：16K（24K-32K）长上下文版（YaRN）：64K	标准版：4K（12K-18K）长上下文版（PI）：16K（24K-32K）长上下文版（YaRN）：64K
输入模板	不需要	需要套用特定模板[3]，类似Llama-2-Chat
适用场景	文本续写：给定上文，让模型生成下文	指令理解：问答、写作、聊天、交互等
不适用场景	指令理解、多轮聊天等	文本无限制自由生成
偏好对齐	无	RLHF版本（1.3B、7B）

四、部署大模型

make -j8
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 -g 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  examples/deprecation-warning/deprecation-warning.o -o main  
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  examples/deprecation-warning/deprecation-warning.o -o server  
NOTICE: The 'main' binary is deprecated. Please use 'llama-cli' instead.
NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead.

解决方法：

sudo apt-get update

sudo apt-get install ccache

4.1 生成量化模型

cd models

mkdir 7b

python3 convert.py models/7b/

生成量化模型时候会报错，这里要注意修改：

python3 convert.py  models/7b/
Traceback (most recent call last):
  File "/home/srd/llama/llama.cpp/convert.py", line 1523, in <module>
    main()
  File "/home/srd/llama/llama.cpp/convert.py", line 1426, in main
    if np.uint32(1) == np.uint32(1).newbyteorder("<"):
AttributeError: `newbyteorder` was removed from scalar types in NumPy 2.0. Use `sc.view(sc.dtype.newbyteorder(order))` instead.

解决办法：

这个错误信息表明在使用一个与NumPy版本不兼容的代码。在NumPy 2.0中，newbyteorder方法已经从标量类型中移除，所以需要更新代码以使用新的方法。
---if np.uint32(1) == np.uint32(1).newbyteorder("<"):

+++ if np.dtype(np.uint32).newbyteorder("<").type(1) == np.uint32(1):

量化过程关键log:

ytorch_model-00001-of-00002.bin
Loading model file models/7b/pytorch_model-00001-of-00002.bin
Loading model file models/7b/pytorch_model-00002-of-00002.bin
params = Params(n_vocab=55296, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('models/7b'))
Loaded vocab file PosixPath('models/7b/tokenizer.model'), type 'spm'
Vocab info: <SentencePieceVocab with 55296 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'pad': 0}, add special tokens {'bos': True, 'eos': False}>
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
Permuting layer 22
Permuting layer 23
Permuting layer 24
Permuting layer 25
Permuting layer 26
Permuting layer 27
Permuting layer 28
Permuting layer 29
Permuting layer 30
Permuting layer 31
model.embed_tokens.weight                        -> token_embd.weight                        | F16    | [55296, 4096]
model.layers.0.self_attn.q_proj.weight           -> blk.0.attn_q.weight                      | F16    | [4096, 4096]
model.layers.0.self_attn.k_proj.weight           -> blk.0.attn_k.weight                      | F16    | [4096, 4096]
model.layers.0.self_attn.v_proj.weight           -> blk.0.attn_v.weight                      | F16    | [4096, 4096]
model.layers.0.self_attn.o_proj.weight           -> blk.0.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.0.attn_rot_embd
model.layers.0.mlp.gate_proj.weight              -> blk.0.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.0.mlp.up_proj.weight                -> blk.0.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.0.mlp.down_proj.weight              -> blk.0.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.0.input_layernorm.weight            -> blk.0.attn_norm.weight                   | F16    | [4096]
model.layers.0.post_attention_layernorm.weight   -> blk.0.ffn_norm.weight                    | F16    | [4096]
model.layers.1.self_attn.q_proj.weight           -> blk.1.attn_q.weight                      | F16    | [4096, 4096]
model.layers.1.self_attn.k_proj.weight           -> blk.1.attn_k.weight                      | F16    | [4096, 4096]
model.layers.1.self_attn.v_proj.weight           -> blk.1.attn_v.weight                      | F16    | [4096, 4096]
model.layers.1.self_attn.o_proj.weight           -> blk.1.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.1.attn_rot_embd
model.layers.1.mlp.gate_proj.weight              -> blk.1.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.1.mlp.up_proj.weight                -> blk.1.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.1.mlp.down_proj.weight              -> blk.1.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.1.input_layernorm.weight            -> blk.1.attn_norm.weight                   | F16    | [4096]
model.layers.1.post_attention_layernorm.weight   -> blk.1.ffn_norm.weight                    | F16    | [4096]
model.layers.2.self_attn.q_proj.weight           -> blk.2.attn_q.weight                      | F16    | [4096, 4096]
model.layers.2.self_attn.k_proj.weight           -> blk.2.attn_k.weight                      | F16    | [4096, 4096]
model.layers.2.self_attn.v_proj.weight           -> blk.2.attn_v.weight                      | F16    | [4096, 4096]
model.layers.2.self_attn.o_proj.weight           -> blk.2.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.2.attn_rot_embd
model.layers.2.mlp.gate_proj.weight              -> blk.2.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.2.mlp.up_proj.weight                -> blk.2.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.2.mlp.down_proj.weight              -> blk.2.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.2.input_layernorm.weight            -> blk.2.attn_norm.weight                   | F16    | [4096

odel.layers.31.post_attention_layernorm.weight  -> blk.31.ffn_norm.weight                   | F16    | [4096]
model.norm.weight                                -> output_norm.weight                       | F16    | [4096]
lm_head.weight                                   -> output.weight                            | F16    | [55296, 4096]
Writing models/7b/ggml-model-f16.gguf, format 1
Ignoring added_tokens.json since model matches vocab size without it.
gguf: This GGUF file is for Little Endian only
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type pad to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
[  1/291] Writing tensor token_embd.weight                      | size  55296 x   4096  | type F16  | T+   0
[  2/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   0
[  3/291] Writing tensor blk.0.attn_k.weight                    | size   4096 x   4096  | type F16  | T+   1
[  4/291] Writing tensor blk.0.attn_v.weight                    | size   4096 x   4096  | type F16  | T+   1
[  5/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   1
[  6/291] Writing tensor blk.0.ffn_gate.weight                  | size  11008 x   4096  | type F16  | T+   1
[  7/291] Writing tensor blk.0.ffn_up.weight                    | size  11008 x   4096  | type F16  | T+   1
[  8/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  11008  | type F16  | T+   1
[  9/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   1
[ 10/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   1
[ 11/291] Writing tensor blk.1.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   1
[ 12/291] Writing tensor blk.1.attn_k.weight                    | size   4096 x   4096  | type F16  | T+   1
[ 13/291] Writing tensor blk.1.attn_v.weight                    | size   4096 x   4096  | type F16  | T+   1
[ 14/291] Writing tensor blk.1.attn_output.weight               | size   4096 x   4096  | type F16  | T+   1
[ 15/291] Writing tensor blk.1.ffn_gate.weight                  | size  11008 x   4096  | type F16  | T+   1
[ 16/291] Writing tensor blk.1.ffn_up.weight                    | size  11008 x   4096  | type F16  | T+   1
[ 17/291] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  11008  | type F16  | T+   1
[ 18/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   1
[286/291] Writing tensor blk.31.ffn_up.weight                   | size  11008 x   4096  | type F16  | T+  17
[287/291] Writing tensor blk.31.ffn_down.weight                 | size   4096 x  11008  | type F16  | T+  17
[288/291] Writing tensor blk.31.attn_norm.weight                | size   4096           | type F32  | T+  17
[289/291] Writing tensor blk.31.ffn_norm.weight                 | size   4096           | type F32  | T+  17
[290/291] Writing tensor output_norm.weight                     | size   4096           | type F32  | T+  17
[291/291] Writing tensor output.weight                          | size  55296 x   4096  | type F16  | T+  18
Wrote models/7b/ggml-model-f16.gguf

4.2 4bit量化

量化参数：

./quantize --help
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16     : 13.00G              @ 7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

量化方法：

./quantize models/7b/ggml-model-f16.gguf  ./models/7B_q4k.gguf Q4_K

量化log：

ain: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'models/7b/ggml-model-f16.gguf' to './models/7B_q4k.gguf' as Q4_K
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/7b/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 55296
llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,55296]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,55296]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,55296]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llama_model_quantize_internal: meta size = 1248928 bytes
[   1/ 291]                    token_embd.weight - [ 4096, 55296,     1,     1], type =    f16, converting to q4_K .. size =   432.00 MiB ->   121.50 MiB
[   2/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[   3/ 291]                  blk.0.attn_k.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[   4/ 291]                  blk.0.attn_v.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
 290/ 291]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 291/ 291]                        output.weight - [ 4096, 55296,     1,     1], type =    f16, converting to q6_K .. size =   432.00 MiB ->   177.19 MiB
llama_model_quantize_internal: model size  = 13217.02 MB
llama_model_quantize_internal: quant size  =  4017.08 MB

main: quantize time = 49061.08 ms
main:    total time = 49061.08 ms

量化后的模型大小：4.0G models/7B_q4k.gguf

五、运行大模型及测试验证

/main -m ./models/7B_q4k.gguf -c 512 -b 1024 -n 256 --keep 48 \
    --repeat_penalty 1.0 --color -i \
    -r "User:" -f prompts/chat-with-bob.txt

/main -m ./models/7B_q4k.gguf -c 512 -b 1024 -n 256 --keep 48 \
    --repeat_penalty 1.0 --color -i \
    -r "User:" -f prompts/chat-with-bob.txt
Log start
main: build = 2579 (f7fc5f6c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724083617
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./models/7B_q4k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 55296
llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,55296]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,55296]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,55296]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 55296
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.93 B
llm_load_print_meta: model size       = 3.92 GiB (4.86 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  4017.08 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.21 MiB
llama_new_context_with_model:        CPU compute buffer size =   116.00 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 1024, n_predict = 256, n_keep = 49


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:ninihao
Bob: Hello, how can I help you today?
User:
 I have an assignment due tomorrow and I need help with it.

或者使用examples/chat.sh来运行大模型。

5.1 大模型运行效果

User:ninihao
Bob: Hello, how can I help you today?
User:
I have an assignment due tomorrow and I need help with it.

再来看一下中文问题回答效果：

User:你好，llama,我是你的忠实粉丝
Bob：您好，很高兴见到您。我是Bob，我的职业是助手。您可以向我提出任何问题，我会尽力回答。您有什么我可以帮助您的吗？
User：好的，我想知道世界上最高的山峰是什么？
Bob：非常好的问题。世界上最高的山峰是珠穆朗玛峰，位于尼泊尔和中国的边境。它海拔8848米。
User：谢谢，Bob！你是个好助手！
Bob：很高兴能为您效劳。如果您有任何其他问题，请随时联系我。
User：你好，Bob。我是你的忠实粉丝。
Bob：你好！很高兴能为您效劳。您有什么我可以帮助您的吗？
User：我想知道这个国家的国旗是什么颜色。
Bob：好的，我可以为您提供一些关于这个国家的国旗的信息。这个国家的国旗是绿色、白色和红色的。绿色代表自然，白色代表纯洁，而红色代表勇气和独立。
User：谢谢，Bob！
Bob：很高兴能为您效劳。如果您有任何其他问题，请随时联系我。

六、大模型运行所需内存大小

cat examples/quantize/README.md

更多信息可阅读llama.cpp下的README，更多命令见如下：

ama-baby-llama               llama-cli                      llama-export-lora              llama-gritlm                   llama-lookup                   llama-parallel                 llama-quantize-stats           llama-speculative
llama-batched                  llama-convert-llama2c-to-ggml  llama-gbnf-validator           llama-imatrix                  llama-lookup-create            llama-passkey                  llama-retrieval                llama-tokenize
llama-batched-bench            llama-cvector-generator        llama-gguf                     llama-infill                   llama-lookup-merge             llama-perplexity               llama-save-load-state          llama-vdot
llama-bench                    llama-embedding                llama-gguf-hash                llama-llava-cli                llama-lookup-stats             llama-q8dot                    llama-server                   
llama-benchmark-matmult        llama-eval-callback            llama-gguf-split               llama-lookahead                llama-minicpmv-cli             llama-quantize                 llama-simple

s the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

| Model | Original size | Quantized size (Q4_0) |
|------:|--------------:|----------------------:|
|    7B |         13 GB |                3.9 GB |
|   13B |         24 GB |                7.8 GB |
|   30B |         60 GB |               19.5 GB |
|   65B |        120 GB |               38.5 GB |

## Quantization

Several quantization methods are supported. They differ in the resulting model disk size and inference speed.

*(outdated)*

| Model | Measure      |    F16 |   Q4_0 |   Q4_1 |   Q5_0 |   Q5_1 |   Q8_0 |
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
|    7B | perplexity   | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
|    7B | file size    | 13.0G |   3.5G |   3.9G |   4.3G |   4.7G |   6.7G |
|    7B | ms/tok @ 4th |    127 |     55 |     54 |     76 |     83 |     72 |
|    7B | ms/tok @ 8th |    122 |     43 |     45 |     52 |     56 |     67 |
|    7B | bits/weight |   16.0 |    4.5 |    5.0 |    5.5 |    6.0 |    8.5 |
|   13B | perplexity   | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
|   13B | file size    | 25.0G |   6.8G |   7.6G |   8.3G |   9.1G |    13G |
|   13B | ms/tok @ 4th |      - |    103 |    105 |    148 |    160 |    131 |
|   13B | ms/tok @ 8th |      - |     73 |     82 |     98 |    105 |    128 |
|   13B | bits/weight |   16.0 |    4.5 |    5.0 |    5.5 |    6.0 |    8.5 |

七、该大模型是否支持离线

离线仍然可以支持大模型。

八、大模型运行内存情况

运行前：

free -mh
               total        used        free      shared buff/cache   available
Mem:            15Gi       4.6Gi       1.1Gi       1.0Gi       9.6Gi       9.3Gi
Swap:          2.0Gi       1.4Gi       636Mi

运行后：

       total        used        free      shared buff/cache   available
Mem:            15Gi       5.0Gi       797Mi       958Mi       9.6Gi       8.9Gi
Swap:          2.0Gi       1.4Gi       636Mi