系列文章目录
GLM-4 (1) - 推理+概览
GLM-4 (2) - RoPE
GLM-4 (3) - GLMBlock
GLM-4 (4) - SelfAttention
GLM-4 (5) - API & Function Calling
GLM-4 (6) - KV Cache / Prefill & Decode
前言
前两篇文章分别讲了GLM-4
推理+概览,以及旋转位置编码,这一篇主要来看一下模型架构/组件。我们知道现在的大模型都是基于Transformer
的,它又是由若干层TransformerBlock
组成。在GLM-4代码中,GLMTransformer
对应了Transformer
部分,而GLMBlock
就对应着TransformerBlock
。我们主要就来看这两部分。
一、模型架构简述
简单说一下模型组件之间的关系:ChatGLMForConditionalGeneration
是用来chat的完整模型;其重要组件是ChatGLMModel
,你可以认为它是一个完整的transformer
;ChatGLMModel
的核心组件是GLMTransformer
;而GLMTransformer
由多层GLMBlock
堆叠而成。这么一看,就整个模型架构就比较清楚了。
二、ChatGLMModel & GLMTransformer
我通过debug查看到ChatGLMModel
的信息如下,其中包含了一些配置信息,以及"embedding"
和"encoder"
,"encoder"
部分模型结构也打印出来了,就是40层GLMBlock
的堆叠。下一节会绘制出GLMBlock
架构图,并配合代码来阐述这部分结构。
{
"base_model": ChatGLMModel,
"base_model_prefix": "transformer",
"config": ChatGLMConfig,
"dtype": torch.bfloat16,
"dummy_inputs": {'input_ids': tensor([[7, 6, 0, 0, 1],
[1, 2, 3, 0, 0],
[0, 0, 0, 4, 5]])} # 为什么???
"embedding": Embedding((word_embeddings): Embedding(151552, 4096))
"encoder": GLMTransformer(
(layers): ModuleList(
(0-39): 40 x GLMBlock(
(input_layernorm): RMSNorm()
(self_attention): SelfAttention(
(query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
(core_attention): SdpaAttention(
(attention_dropout): Dropout(p=0.0, inplace=False)
)
(dense): Linear(in_features=4096, out_features=4096, bias=False)
)
(post_attention_layernorm): RMSNorm()
(mlp): MLP(
(dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
(dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
)
)
)
(final_layernorm): RMSNorm()
)
"is_gradient_checkpointing": False,
"is_parallelizable": False,
"kv_channels": 128,
"main_input_name": "input_ids",
"multi_query_group_num": 2,
"name_or_path": "/home/ubuntu/Projects_ubuntu/glm-4-9b-chat",
"num_layer": 40,
"output_layer": Linear(in_features=4096, out_features=151552, bias=False),
"rotary_pos_emb": RotaryEmbedding(),
"seq_length": 131072, # 配置中的长度,应该是预设的context长度==>128k长度
"supports_gradient_checkpointing": True,
"training": False,
}
三、GLMBlock
GLMBlock
与TransformerBlock
略有不同,根据glm-4-9b-chat
的配置,apply_residual_connection_post_layernorm=False
,也就是说残差连接来源于归一化之前(如黑色实线所示),如果设置成True
,那么就和TransformerBlock
一致(如黑色虚线所示)。
为了与原始的Transformer
对比,这边也贴一下它的架构图:
接下来我们配合代码,再说一些细节:
- 归一化可选
RMSNorm
和LayerNorm
; - 使用
multi-query attention
,这一部分我会在后面单独开一篇来讲; FFN
中实际的两层线性层维度变化并不是严格的h -> 4h, 4h -> h
,h
为隐藏层维度;同时,这边激活函数使用的是swiglu
;dropout
在上述图中没有体现。
class MLP(torch.nn.Module):
"""MLP.
MLP will take the input with h hidden state, project it to 4*h
hidden dimension, perform nonlinear transformation, and project the
state back into h hidden dimension.
"""
def __init__(self, config: ChatGLMConfig, device=None):
super(MLP, self).__init__()
self.add_bias = config.add_bias_linear
# Project to 4h. If using swiglu double the output width, see https://arxiv.org/pdf/2002.05202.pdf
self.dense_h_to_4h = nn.Linear(
config.hidden_size,
config.ffn_hidden_size * 2,
bias=self.add_bias,
device=device,
**_config_to_kwargs(config)
)
def swiglu(x):
x = torch.chunk(x, 2, dim=-1)
return F.silu(x[0]) * x[1]
self.activation_func = swiglu
# Project back to h.
self.dense_4h_to_h = nn.Linear(
config.ffn_hidden_size,
config.hidden_size,
bias=self.add_bias,
device=device,
**_config_to_kwargs(config)
)
def forward(self, hidden_states):
# [s, b, 4hp]
intermediate_parallel = self.dense_h_to_4h(hidden_states)
intermediate_parallel = self.activation_func(intermediate_parallel)
# [s, b, h]
output = self.dense_4h_to_h(intermediate_parallel)
return output
class GLMBlock(torch.nn.Module):
"""A single transformer layer.
Transformer layer takes input with size [s, b, h] and returns an
output of the same size.
"""
def __init__(self, config: ChatGLMConfig, layer_number, device=None):
super(GLMBlock, self).__init__()
self.layer_number = layer_number
self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
self.fp32_residual_connection = config.fp32_residual_connection
LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
# Layernorm on the input data.
self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
dtype=config.torch_dtype)
# Self attention.
self.self_attention = SelfAttention(config, layer_number, device=device)
self.hidden_dropout = config.hidden_dropout
# Layernorm on the attention output
self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
dtype=config.torch_dtype)
# MLP
self.mlp = MLP(config, device=device)
def forward(
self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True,
):
# hidden_states: [s, b, h]
# Layer norm at the beginning of the transformer layer.
layernorm_output = self.input_layernorm(hidden_states) # 相比于transformer block,GLMBlock上来就使用了layernorm
# Self attention.
attention_output, kv_cache = self.self_attention(
layernorm_output,
attention_mask,
rotary_pos_emb,
kv_cache=kv_cache,
use_cache=use_cache
) # (1, 8, 4096), (1, 2, 1, 2, 8, 128)
# Residual connection.
if self.apply_residual_connection_post_layernorm:
residual = layernorm_output
else:
residual = hidden_states
layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training)
layernorm_input = residual + layernorm_input
# Layer norm post the self attention.
layernorm_output = self.post_attention_layernorm(layernorm_input)
# MLP.
mlp_output = self.mlp(layernorm_output)
# Second residual connection.
if self.apply_residual_connection_post_layernorm:
residual = layernorm_output
else:
residual = layernorm_input
output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
output = residual + output
return output, kv_cache # (1, 8, 4096), (1, 2, 1, 2, 8, 128)
总结
本篇我们围绕GLM-4
核心组件GLMTransformer
和GLMBlock
来分析,并对一些细节做出了解释。下一篇将会对attention
部分进行讲解。