Megatron-LM训练GPT2模型

Lay_XZ

已于 2023-12-15 20:37:58 修改

阅读量578

点赞数 1

文章标签：深度学习人工智能

于 2023-11-20 11:07:34 首次发布

本文链接：https://blog.csdn.net/Lay_XZ/article/details/134329365

版权

基于Megatron-LM从0到1完成GPT2模型预训练、模型评估及推理 - 知乎 (zhihu.com)

1、配置环境（太遭罪了）

先讲结论，踩坑太漫长了：

GPU：tesla P100

cuda11.8（可换其他，低点好）

pytorch2.1.0（可换其他2.1还是有点小坑）

Megatron-LM（tag2.5），最新的transformer_engine用不了，对GPU框架有要求

pytorch镜像选好版本（别用太高，gpu不行，多踩了很多坑），有apex的就行

首先进入到Megatron-LM目录，安装一下依赖，pip install -r requirements.txt

不需要tensorflow

pytorch和cuda要匹配

安装apex遇到的各种问题：

1、cuda和torch版本不匹配

原来时cuda11.4，torch版本1.12+cu113（torch没有114就离谱）

修改setup.py文件，删除验证匹配的地方即可

或者重下cuda和torch

我都做了但我卡住的地方不是这个原因

2、编译不了c++文件！为什么！（放弃了，没解决）

from /root/yjy/Megatron-LM/apex/csrc/flatten_unflatten.cpp:1:
/usr/include/c++/9/cwchar:44:10: fatal error: wchar.h: No such file or directory

gpt让我下载

sudo apt-get update
sudo apt-get install libc6-dev

但是又报错，这个linux-headers-5.4.0-165我不敢乱删

放弃了，转而使用镜像！

使用镜像配置环境：（又是曲折的换版本）

PyTorch Release 21.05 - NVIDIA Docs

下载镜像，选好版本（别用太高，不适配，多踩了很多坑），有apex的就行

docker run -dt --name pytorch_yjy --restart=always --gpus all \
--network=host \
--shm-size 8G \
-v /mnt/VMSTORE/yjy/Megatron-LM-GPT:/Megatron-LM-GPT \
-w /Megatron-LM-GPT \
nvcr.io/nvidia/pytorch:23.04-py3 \
/bin/bash

docker exec -it pytorch_yjy bash

缺少amp_C

在这里安装apex成功了，但是模型训练使用的时候又报错了！！！缺少amp_C！！！

解决办法一：（别用，后面还会报错）

用这个版本的apex成功了

NVIDIA/apex at 3303b3e7174383312a3468ef390060c26e640cb1 (github.com)

Megatron-LLaMA/megatron/model/fused_layer_norm.py

NVIDIA/apex at 3303b3e7174383312a3468ef390060c26e640cb1 (github.com)

但是会报没有_six等错误，没有inf的错误，这些修改一下就好

解决办法二：

用 python setup.py install，别用pip

其他报错：

RuntimeError: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.RuntimeError
: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.

【精选】安装apex报错_install the apex with cuda support (https://github-CSDN博客 ModuleNotFoundError: No module named ‘fused_layer_norm_cuda‘_我用k-bert的时候报错no module named 'layer_norm,但我是有这个组件的-CSDN博客

增加cuda环境变量：

export CUDA_HOME=/usr/local/cuda-11.3
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64:$LD_LIBRARY_PATH

报错：21.05没有transformer_engine

然后完全不知道为什么会报段错误

又换了22.10，没有段错误但是，缺少te.pytorch.DotProductAttention！！！！安装transformer_engine又是各种报错，是我没看文档，transformer_engine0.6以上才有这个API

23.04环境运行报错：原因就是cuda和gpu和torch版本不匹配

#cuda是否可用；
torch.cuda.is_available()
# 返回gpu数量；
torch.cuda.device_count()
# 返回gpu名字，设备索引默认从0开始；
torch.cuda.get_device_name(0)
# 返回当前设备索引；
torch.cuda.current_device()

尝试1：卸载torch然后重新下载，但是transformer_engine损坏了，链接不到了

尝试2：不要 transformer_engine

vim megatron/core/transformer/custom_layers/transformer_engine.py

不行还是需要用到的，删掉容器重来吧，回到原点的错误

WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication

查看算力：查看NVIDIA显卡计算能力-CSDN博客

就是GPU的算力和cuda不匹配

怎么在cuda的镜像上堆pytorch镜像？

我最后还是在23.04环境里重新安装了cuda11.8

因为只有cuda环境的镜像太干净了，要自己重装好多东西！！

然后重装torch

但是重装不了transformer_engine，然后我终于在找为什么的时候发现了tesla根本用不了！！！！

最后选择MegatronLM的2.5版本就不会用到它！

2、准备数据集

数据集下载地址：

MEGA

安装依赖库

pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract -i https://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn

进入openwebtext文件夹下，将下载的数据集放在urls文件夹下，然后去重，执行

python3 blacklist_urls.py urls clean_urls.txt

从url下载数据，使用工具：

yet-another-account/openwebtext at dependabot/pip/certifi-2022.12.7 (github.com)

python3 download.py clean_urls.txt --n_procs=15 --timeout=15 ----output_dir

然后报错

TypeError: 'ExtractResult' object is not iterable

Traceback (most recent call last):
  File "download.py", line 307, in <module>
    cdata = list(pool.imap(download, chunk, chunksize=1))
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
TypeError: 'ExtractResult' object is not iterable

解决1： tldextract版本不对，换个版本就可以跑了

这个问题出现在尝试使用 tldextract 库从 URL 中提取域名时。错误消息表明 tldextract.extract(url) 的结果类型不可迭代，这可能是因为库的版本更新或者在使用上有一些变化

下载完成后，之后就会生成一个scraped文件夹，每个url下载的文本就保存在data子文件夹下

使用（merge_jsons.py）来把文件夹中的所有txt合并成一个json文件

python3 tools/openwebtext/merge_jsons.py --data_path=scraped/data --output_file=data/merged_output.json

数据清洗：

python3 cleanup_dataset.py tools/openwebtext/merged_output.json data/merged_cleand.json

报错：找不到tokenizer

之前一直以为是识别不到megatron里的tokenizer，或者是pip tokenizer的包，弄了好久，结果是根本就是缺少了一个tokenizer.py

这是tokenizer.py

# coding=utf-8
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import sys
sys.path.append('..')

from megatron.tokenizer.gpt2_tokenization import GPT2Tokenizer


class Tokenizer:

    def __init__(self, cache_dir=None):
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2',
                                                       cache_dir=cache_dir)
        self.tokenizer.max_len = int(1e12)
        self.eod_token = self.tokenizer.encoder['<|endoftext|>']
        assert self.eod_token < 65535, 'vocab size will not fit in uint16'
        print('> GPT2 tokenizer with {} vocab size and eod token {} ...'.format(
            len(self.tokenizer.encoder), self.eod_token))

    def tokenize_document(self, document):
        tokens = self.tokenizer.encode(document)
        tokens.append(self.eod_token)
        return tokens

fix issue #33 missing modules by hyoo · Pull Request #89 · NVIDIA/Megatron-LM (github.com)

cannot import name 'cached_path' from 'transformers'

cannot import name 'cached_path' from 'transformers' · Issue #1475 · ThilinaRajapakse/simpletransformers (github.com)

shuffle清洗后的数据集。

shuf data/merged_cleand.json -o data/train_data.json

数据预处理：

python tools/preprocess_data.py \
       --input data/train_data_half.json \
       --output-prefix data/my-gpt2_half \
       --vocab-file model/gpt2-vocab.json\
       --tokenizer-type GPT2BPETokenizer \
       --merge-file model/gpt2-merges.txt \
       --append-eod \
       --workers 20

输出文件名为 my-gpt2_text_document.bin 和 my-gpt2_text_document.idx。在 GPT2 训练时，使用不带扩展名的名称作为 --data-path。

至此数据处理结束！

#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output

# 模型超参数
GPT_ARGS="
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 1 \
    --global-batch-size 2 \
    --lr 0.00015 \
    --train-iters 5000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16
"

# 数据集和词表路径参数
DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-impl mmap \
    --split 700,200,100
"

# 模型权重输出、评估、日志相关的参数
OUTPUT_ARGS="
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 10
"

# 启动训练任务
torchrun pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --save $MODEL_PATH \
    --load $CHECKPOINT_PATH

3、模型训练：

查看显卡占用情况：

watch -n 5 nvidia-smi

python -m gpt2 train --train_corpus           data/wikitext-103-raw/wiki.train.raw \
                       --eval_corpus            data/wikitext-103-raw/wiki.test.raw \
                       --vocab_path             build/vocab.txt \
                       --save_checkpoint_path   ckpt-gpt2.pth \
                       --save_model_path        gpt2-pretrained.pth \
                       --batch_train            128 \
                       --batch_eval             128 \
                       --seq_len                64 \
                       --total_steps            1000000 \
                       --eval_steps             500 \
                       --save_steps             5000 \
                       --use_amp \
                       --use_grad_ckpt \
                       --gpus GPUS    4

单机单卡：


CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 4 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python pretrain_gpt.py \
       $GPT_ARGS \
       $OUTPUT_ARGS \
       --save $MODEL_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \

Distributed training:

DP:

WORLD_SIZE=4
TENSOR_MP_SIZE=1
PIPELINE_MP_SIZE=1

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output/mp

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 2 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
                $GPT_ARGS \
                $OUTPUT_ARGS \
                --save $MODEL_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl torch

PP:

WORLD_SIZE=4
TENSOR_MP_SIZE=1
PIPELINE_MP_SIZE=4

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output/mp

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 2 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
                $GPT_ARGS \
                $OUTPUT_ARGS \
                --save $MODEL_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl local

TP：

WORLD_SIZE=4
TENSOR_MP_SIZE=4
PIPELINE_MP_SIZE=1

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output/tp

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 2 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
                $GPT_ARGS \
                $OUTPUT_ARGS \
                --save $MODEL_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl local

MP:380M

WORLD_SIZE=4
TENSOR_MP_SIZE=2
PIPELINE_MP_SIZE=2

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output/mp

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 2 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
                $GPT_ARGS \
                $OUTPUT_ARGS \
                --save $MODEL_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl local

MP:1.6B（参数不太对）

WORLD_SIZE=4
TENSOR_MP_SIZE=2
PIPELINE_MP_SIZE=2

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=model/model_optim_rng.pt
VOCAB_FILE=model/gpt2-vocab.json
MERGE_FILE=model/gpt2-merges.txt
DATA_PATH=data/my-gpt2_text_document
MODEL_PATH=model/output/mp

GPT_ARGS="--num-layers 48 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1600 \
          --micro-batch-size 16 \
          --global-batch-size 64 \
          --lr 0.00015 \
          --train-iters 5000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
                $GPT_ARGS \
                $OUTPUT_ARGS \
                --save $MODEL_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl local

训练llama7B：

wiki数据集的xml转成json格式了，怎么从url获取text呢？？

Wikidata 数据包下载+格式转换+入库MySQL_wiki数据库-CSDN博客

快速使用wikiextractor提取维基百科语料的简单用法-CSDN博客

wiki中文文本语料下载并处理 ubuntu + python2.7_wikismallen.txt-CSDN博客

还是用了openwebtext数据集

python Megatron-LLaMA/tools/preprocess_data.py \
       --input data/train_halfdata.json \
       --output-prefix data/openwebtexthalf \
       --dataset-impl mmap \
       --tokenizer-type PretrainedFromHF \
       --tokenizer-name-or-path llama7B_hf \
       --append-eod \
       --workers 20 \
       --chunk-size 25

将huggingface转换为megatron可用，但是这个卡跑不下7B的，砍了参数，就不适配这个转换好的了

python Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py \
--load_path "llama7B_hf" \
--save_path "llama7B_hf_ab" \
--target_tensor_model_parallel_size 2 \
--target_pipeline_model_parallel_size 4 \
--target_data_parallel_size 1 \
--target_params_dtype "fp16" \
--make_vocab_size_divisible_by 1 \
--print-checkpoint-structure \
--megatron-path "Megatron-LLaMA"

这个是将llama官网下的tokenizer.model转换为huggingface的

python convert_llama_weights_to_hf.py \
    --input_dir llama7B \
    --model_size 7B \
    --output_dir llama7B_hf

转换成功后

#!/bin/bash


DATASET="data/openwebtexthalf"

TP_SIZE=2
PP_SIZE=4
WORLD_SIZE=8
MICRO_BATCH_SIZE=1
# The int is the number of micro steps of gradient accumulation
GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
# GLOBAL_BATCH_SIZE=128

JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"

LOAD_CHECKPOINT_PATH="llama7B_hf_ab/"
SAVE_CHECKPOINT_PATH="model/llama-7/"
TOKENIZER_PATH="llama7B_hf_ab/"
TENSORBOARD_DIR="model/tensorboard/"

TRAIN_ITERS=1000
EVAL_ITERS=10
EVAL_INTERVAL=1000
SAVE_INTERVAL=100
LOG_INTERVAL=1

# Setting --tensorboard-queue-size to 1 significantly slows down the training
options=" \
    --finetune \
    --sequence-parallel \
        --tensor-model-parallel-size ${TP_SIZE} \
        --pipeline-model-parallel-size ${PP_SIZE} \
    --num-layers 32 \
        --hidden-size 4096 \
        --num-attention-heads 32 \
        --seq-length 4096 \
        --max-position-embeddings 4096 \
        --no-position-embedding \
        --use-rotary-position-embeddings \
        --swiglu \
        --ffn-hidden-size 11008\
        --disable-bias-linear \
        --RMSNorm \
        --layernorm-epsilon 1e-6 \
        --causal-lm \
    --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path $TOKENIZER_PATH \
        --make-vocab-size-divisible-by 1 \
    --init-method-std 0.01 \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --train-iters ${TRAIN_ITERS} \
    --lr 6.0e-5 \
        --lr-decay-iters 10 \
        --lr-warmup-iters 5 \
        --min-lr 6.0e-6 \
        --override-opt_param-scheduler \
        --lr-decay-style cosine \
    --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --clip-grad 1.0 \
        --weight-decay 0.1 \
        --overlapped-distributed-optimizer \
        --reduce-bucket-size=2e8 \
        --no-gradient-accumulation-fusion \
    --dataloader-type cyclic \
        --data-impl mmap \
        --data-path ${DATASET} \
        --split 98,2,0 \
    --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
    --save-interval ${SAVE_INTERVAL} \
        --save ${SAVE_CHECKPOINT_PATH} \
    --load ${LOAD_CHECKPOINT_PATH} \
        --no-load-optim \
    --log-interval ${LOG_INTERVAL} \
    --tensorboard-dir ${TENSORBOARD_DIR} \
        --tensorboard-queue-size 1000 \
        --log-timers-to-tensorboard \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
    --job-name ${JOB_NAME} \
    --bf16 \
    --recompute-activations \
        --recompute-granularity selective \
    "

torchrun --nproc_per_node=8 --master_port=6000 Megatron-LLaMA/pretrain_llama.py ${options}

终于跑起来了，就是因为 megatron-llama的fused_kernels里用了

导致编译通不过但不知道为什么

词向量维度*((4*词向量维度 + 3*FFN隐藏层维度) *层数+词表大小+窗口长度）

Lay_XZ

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Megatron-LM训练GPT2模型

又换了22.10，没有段错误但是，缺少te.pytorch.DotProductAttention！之前一直以为是识别不到megatron里的tokenizer，或者是pip tokenizer的包，弄了好久，结果是根本就是缺少了一个tokenizer.py。但是重装不了transformer_engine，然后我终于在找为什么的时候发现了tesla根本用不了！在这里安装apex成功了，但是模型训练使用的时候又报错了！我应该先看文档的居然是因为这个！因为只有cuda环境的镜像太干净了，要自己重装好多东西！
复制链接

扫一扫