2、runq的量化
1)
// build the Transformer via the model .bin file
Transformer transformer;
build_transformer(&transformer, checkpoint_path);
```c
> **s1:** 创建Transformer类型的实例变量transformer,包括Config(模型架构的设计蓝图)、weights(权重,这个模型有哪些权重)、state(前向传播时的当前层激活值)、fd(文件内存映射)、*data(内存指针)、file_size(模型权重大小)等。
2)
```c
void build_transformer(Transformer *t, char* checkpoint_path)
{
// read in the Config and the Weights from the checkpoint
read_checkpoint(checkpoint_path, &t->config, &t->weights, &t->fd, &t->data, &t->file_size);
// allocate the RunState buffers
malloc_run_state(&t->state, &t->config);
}
3)
void read_checkpoint(char* checkpoint, Config* config, TransformerWeights* weights,
int* fd, float** data, ssize_t* file_size)
{
void* weights_ptr = ((char*)*data) + header_size; // skip header bytes. char is 1 byte
memory_map_weights(weights, config, weights_ptr, shared_classifier);
}
**s2:**build_transfomer的read_checkpoint, //传入模型路径,runq.c自定义的config,模型有哪些权重weight,内存映射,内存指针,大小,目的是都给他们赋上值
**s2_1:**从bin文件读取一些常规信息,包括前面(1)
中所说的magic_number、version,此外,还有config、shared_classifier、group_size等,另外将version2
的head_size本身是256个字节(export.py有说明)
如(1),uint32_t magic_number;
if (fread(&magic_number, sizeof(uint32_t), 1, file) != 1) { exit(EXIT_FAILURE); }
&magic_number接收file文件读取一个为uint32_t大小的数据
如(2),if (fread(config, sizeof(Config), 1, file) != 1) { exit(EXIT_FAILURE); }
config是runq.c中Config的一个实例对象,即fread函数会尝试从文件中读取与Config结构体大小相等的数据,并将其存储到config指向的内存区域。也就是export.py中的
header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
n_kv_heads, p.vocab_size, p.max_seq_len)
*s2_2:**读完了以上参数,获取llama2_7b_q80.bin文件大小,如
fseek(file, 0, SEEK_END); // move file pointer to end of file
*file_size = ftell(file); // get the file size, in bytes
fclose(file);
file是个FILE类型的指针,指向文件末尾,然后调用ftell函数,得到文件大小。(就是Transformer结构体中的file_size)
**s2_3:**得到大小后,再读取llama2_7b_q80.bin,调用mmap函数就可以将整个文件内容映射到内存。此时, mmap函数返回被映射区的指针,将其赋给传入函数的指针变量data。(就是Transformer结构体中的内存指针) 如,
*fd = open(checkpoint, O_RDONLY); // open in read only mode (就是Transformer结构体中的fd)
*data = mmap(NULL, *file_size, PROT_READ, MAP_PRIVATE, *fd, 0);
**s2_4:**bin文件排布是header+权重 所以weight的指针=映射的data指针+header的大小
如,void* weights_ptr = ((char*)*data) + header_size; // skip header bytes. char is 1 byte
调试打印
(gdb) print (void*) *data
$13 = (void *) 0x7f4c09091000
(gdb) print weights_ptr
$14 = (void *) 0x7f4c09091100
相差25个字节
4)
memory_map_weights(weights, config, weights_ptr, shared_classifier); //传入TransformerWeights *w,模型架构参数,权重指针(地址)
void memory_map_weights(TransformerWeights *w, Config* p, void* ptr, uint8_t shared_classifier) {
**s2.5:*根据s_2的目的,还差TransformerWeights weights没有附上值了,这段就是。
s2.5.1: 用指针的方式给TransformerWeights *w(表示transformer有权重的有哪些)的权重赋值 runq.c中的代码与export.py代码对齐了的。
runq.c如下:
float* fptr = (float*) ptr; // cast our pointer to float* 通过类型转换告知编译器 ptr 指向的是连续的浮点数数组。
w->rms_att_weight = fptr;
fptr += p->n_layers * p->dim;
w->rms_ffn_weight = fptr;
fptr += p->n_layers * p->dim;
w->rms_final_weight = fptr;
fptr += p->dim;
其中,
w->rms_att_weight = fptr;
fptr += p->n_layers * p->dim;
说明权重在内存中通常是连续存储的,如果有2层,每层512,共有512*2个浮点数,那么相加就是下一个了。
**s2.5.2:**接下来就是export中的weights列表给TransformerWeights *w赋值(也是能一一对应的)。 这儿重点关注一下token_embedding层 如,
调用, //1表示创建量化张量数量,p->vocab_size *
p->dim表示token_embedding层的权重shape
w->q_tokens = init_quantized_tensors(&ptr, 1, p->vocab_size * p->dim);
init_quantized_tensors的定义,
QuantizedTensor *init_quantized_tensors(void **ptr, int n, int size_each) {
QuantizedTensor *res = malloc(n * sizeof(QuantizedTensor));
函数作用是取出映射到内存中的header+weight的weight的量化后的q和s值,并移动指针移动到下一个weight的开始地方。
参数n是指量化了多少个tensor,size_each一个tensor有多少个量化的数,QuantizedTensor的定义,
typedef struct {
int8_t* q; // quantized values
float* s; // scaling factors
} QuantizedTensor;
初始化一个QuantizedTensor的对象res,将p的地址给res的q和s。 对应export中,
serialize_int8(out_file, q) # save the tensor in int8
serialize_fp32(out_file, s) # save scale factors
当为token_embedding层时,它的q与s相隔p->vocab_size * p->dim。 export.py中的代码self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
s2.5.3:
调用
// dequantize token embedding table
w->token_embedding_table = malloc(p->vocab_size * p->dim * sizeof(float));
dequantize(w->q_tokens, w->token_embedding_table, p->vocab_size * p->dim);
定义
void dequantize(QuantizedTensor *qx, float* x, int n) {
for (int i = 0; i < n; i++) {
x[i] = qx->q[i] * qx->s[i / GS];
}
}
反量化公式:x=q*s
s[i / GS]表示当前的第i个数,以GS个分为一组,取s中第几个
总之,如w->wq = init_quantized_tensors(&ptr, p->n_layers, p->dim * (p->n_heads * head_size)); ptr指权重指针;p->n_layers指创建的量化张量数量,即模型有多少个隐藏层;
p->dim * (p->n_heads * head_size), export.py中的代码self.wq =
nn.Linear(args.dim, args.n_heads * self.head_dim, bias=False), 权重参数数量
= 输入维度 (args.dim) × 输出维度 (args.n_heads * self.head_dim) 个权重参数。 ```此时将bin文件映射到内存的config,weight,fd,data,file_size
全部赋值给TransformerWeights结构体的参数了 现在的transformer如下
$29 = {config = {dim = 4096, hidden_dim = 11008, n_layers = 2, n_heads = 32, n_kv_heads = 32, vocab_size = 32000, seq_len = 2048}, weights = {
q_tokens = 0x55f3576d2480, token_embedding_table = 0x7f4be9c90010, rms_att_weight = 0x7f4c09091100, rms_ffn_weight = 0x7f4c09099100, wq = 0x55f3576d24a0,
wk = 0x55f3576d24d0, wv = 0x55f3576d2500, wo = 0x55f3576d2530, w1 = 0x55f3576d2560, w2 = 0x55f3576d2590, w3 = 0x55f3576d25c0,
rms_final_weight = 0x7f4c090a1100, wcls = 0x55f3576d25f0}, state = {x = 0x55f3576d2610, xb = 0x55f3576d6620, xb2 = 0x55f3576da630, hb = 0x55f3576de640,
hb2 = 0x55f3576e9250, xq = {q = 0x55f3576f3e60 "", s = 0x55f3576f4e70}, hq = {q = 0x55f3576f8e80 "", s = 0x55f3576fb990}, q = 0x55f3577065a0,
k = 0x55f35770a5b0, v = 0x55f35770e5c0, att = 0x7f4be9c4f010, logits = 0x55f3577125d0, key_cache = 0x7f4be5c4e010, value_cache = 0x7f4be1c4d010}, fd = 3,
data = 0x7f4c09091000, file_size = 708657408}