训练的FLOPS:
Llama2 (DGXC Benchmarking) | NVIDIA NGC
The calculations for the 7b parameter model:
model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))
model flops breakdown:
attention flops = 12 * (number of layers) * (hidden size)^2 * (1 + (number of query groups)/(number of attention heads) + (sequence length)/(hidden size))
mlp flops = 18 * (number of layers) * (FFN size) * (hidden size)
embedding flops = 6 * (vocab size) * (hidden size)
Llama 2 7b calculation:
sequence length = 4096
attention flops = 12 * 32 * 4096^2 * (1 + 32/32 + 4096/4096) = 19,327,352,832
mlp flops = 18 * 32 * 11008 * 4096 = 25,971,130,368
embedding flops = 6