深度学习计算框架PyTorch训练不同网络方法&示例

目录

Classification-bench

运行

单卡

单机多卡

分布式多卡

参考

Classification-Acc

运行示例

fp32

fp16

参考

Objection-Faster-rcnn

运行

单卡

单机多卡

多机多卡

参考

Objection-MaskRCNN

运行指令

单卡

多卡

参考

Objection-SSD

运行

安装依赖库

下载数据集

运行训练脚本

单机单卡(FP32)

单机多卡(FP32)

多机多卡(FP32)

单机单卡(FP16)

单机多卡(FP16)

多机多卡(FP16)

数据集

Publiction/Attribution.

Training and test data separation

模型

评价指标

参考

Objection-YOLOv3

测试流程

数据预处理

下载预训练模型

运行示例

训练

单卡

多卡

推理

检测

参考

NAS-darts

测试流程

准备数据

预训练模型

运行指令

架构搜索

格式转化

架构评估

参考

NLP-bert

运行示例

pre-train phrase1

单卡

多卡

pre-train phrase2

单卡

多卡

fine-tune 训练

单卡

多卡

参考资料

NLP-gnmt

运行

安装依赖项

数据集下载

预处理

单机单卡

单机多卡

多机多卡

模型

Publication/Attribution

Structure

Loss function

Optimizer

Learning rate schedule

评估

Quality metric

Quality target

Evaluation frequency

Evaluation thoroughness

Recommendation

测试流程

数据处理

运行指令

参考


Classification-bench

该测试用例用于PyTorch分类模型性能测试

  • 该脚本可支持PyTorch的nccl和gloo分布式通信库方式

运行

单卡
python3 `pwd`/main_bench.py --batch-size=64 --a=resnet50 -j 24 --epochs=1 --synthetic /path/to/any/existing/folder
单机多卡
mpirun -np 4  --bind-to none `pwd`/single_process.sh localhost inception_v3 64
分布式多卡
mpirun -np $np --hostfile hostfile --bind-to none `pwd`/single_process.sh $dist_url resnet50 64

hostfile格式参考:

node1 slots=4  
node2 slots=4

参考

examples/imagenet at main · pytorch/examples · GitHub

Classification-Acc

该测试用例用于ResNet50精度验证,单卡运行指令如下

运行示例

fp32
python3 main_acc.py --batch-size=64 --arch=resnet50 -j 6 --epochs=90 --save-path=/path/to/{save_model_dir} /path/to/{ImageNet_pytorch_data_dir}/
fp16
python3 main_acc.py --batch-size=64 --arch=resnet50 -j 6 --epochs=90 --amp --opt-level O1 --loss-scale=dynamic --save-path=/path/to/{save_model_dir} /path/to/{ImageNet_pytorch_data_dir}/

参考

examples/imagenet at main · pytorch/examples · GitHub

Objection-Faster-rcnn

该测试用例用于PyTorch目标检测模型Fasterrcnn测试。

运行

  • train.py中get_dataset函数需要根据实际数据集情况设置json文件的位置、类别数目等。
单卡
python3 train.py  --batch-size=2 -j 8 --epochs=26 --data-path=/path/to/datasets/folder --output-dir=/path/to/result/save/folder
单机多卡
mpirun -np 4 --hostfile hostfile --bind-to none `pwd`/single_process.sh localhost
多机多卡
mpirun -np $np --hostfile hostfile --bind-to none `pwd`/single_process.sh ${master_ip}

参考

vision/references/detection at main · pytorch/vision · GitHub

Objection-MaskRCNN

该测试用例用于PyTorch目标检测模型Fasterrcnn测试。

运行指令

单卡
python3 train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 \
     --lr-steps 16 22 --aspect-ratio-group-factor 3 \
     --data-path /path/to/{COCO2017_data_dir}  

若报错Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to .cache/torch/checkpoints/resnet50-19c8e357.pth失败,则需提前下载resnet50-19c8e357.pth,拷贝至.cache/torch/checkpoints/。

多卡
python3 -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.005 --data-path /path/to/{COCO2017_data_dir} > train_2gpu_lr0.005.log 2>&1 &

注意:多卡运行时,学习率与卡数的对应关系为0.02/8*$NGPU,例如,lr_4gpu=0.01,lr_2gpu=0.005,lr_1gpu=0.0025。

参考

vision/references/detection at main · pytorch/vision · GitHub

Objection-SSD

该脚本是基于目标检测模型SSD_ResNet34的功能测试用例,参考mlperf工程,当mAP值达到0.23时,视为模型收敛并成功结束作业运行。

运行

安装依赖库
Cython==0.28.4
mlperf-compliance==0.0.10
cycler==0.10.0
kiwisolver==1.0.1
matplotlib==2.2.2
numpy==1.14.5
Pillow==5.2.0
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.5
six==1.11.0
torchvision(if installed, ignore it)
apex(if installed, ignore it)
下载数据集
bash download_dataset.sh
运行训练脚本
  • 单节点环境配置、系统超参设置脚本为config_singlenode.sh,可根据实际情况对应修改
  • 多节点环境配置、系统超参设置脚本为config_multinode.sh,可根据实际情况对应修改
单机单卡(FP32)
python3 train_fp32.py \
                  --epochs "${NUMEPOCHS}" \
                  --warmup-factor 0 \
                  --lr "${LR}" \
                  --no-save \
                  --threshold=0.23 \
                  --data ${DATASET_DIR} \
                  --batch-size ${BATCH_SIZE}
                  --warmup-factor 0
                  --warmup ${WARMUP}
单机多卡(FP32)
python3 -m bind_launch --nsockets_per_node ${NSOCKET} \
                  --ncores_per_socket ${SOCKETCORES} \
                  --nproc_per_node ${NTASKS_PER_NODE} \
                  --no_hyperthreads \
                  --no_membind \
                  train_fp32.py \
                  --epochs "${NUMEPOCHS}" \
                  --warmup-factor 0 \
                  --lr "${LR}" \
                  --no-save \
                  --threshold=0.23 \
                  --data ${DATASET_DIR} \
                  --batch-size ${BATCH_SIZE}
                  --warmup-factor 0
                  --warmup ${WARMUP}
  • 可参考作业提交脚本 run_fp32_single.sh
多机多卡(FP32)
sh run_fp32_multi.sh
  • 参考run_fp32_multi.sh脚本,其中hostfile文件内容格式参考如下:
  node1 slots=4  
  node2 slots=4
单机单卡(FP16)
python3  train_fp16.py \
                  --epochs "${NUMEPOCHS}" \
                  --warmup-factor 0 \
                  --lr "${LR}" \
                  --no-save \
                  --threshold=0.23 \
                  --data ${DATASET_DIR} \
                  --opt-level O3 --loss-scale="dynamic" --keep-batchnorm-fp32 True \
                  --batch-size 180 \
                  --warmup ${WARMUP}
单机多卡(FP16)
python3 -m bind_launch --nsockets_per_node ${NSOCKET} \
                  --ncores_per_socket ${SOCKETCORES} \
                  --nproc_per_node ${NTASKS_PER_NODE} \
                  --no_hyperthreads \
                  --no_membind \
                  train_fp16.py \
                  --epochs "${NUMEPOCHS}" \
                  --warmup-factor 0 \
                  --lr "${LR}" \
                  --no-save \
                  --threshold=0.23 \
                  --data ${DATASET_DIR} \
                  --opt-level O3 --loss-scale="dynamic" --keep-batchnorm-fp32 True \
                  --batch-size 180 \
                  --warmup ${WARMUP}
  • 可参考作业提交脚本 run_fp16_single.sh
多机多卡(FP16)
sh run_fp16_multi.sh
  • 类似地, hostfile文件的设置可参考上文部分

数据集

Publiction/Attribution.

Microsoft COCO: COmmon Objects in Context. 2017.

Training and test data separation

Train on 2017 COCO train data set, compute mAP on 2017 COCO val data set.

模型

Publication/Attribution

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. In the Proceedings of the European Conference on Computer Vision (ECCV), 2016.

Backbone is ResNet34 pretrained on ILSVRC 2012 (from torchvision). Modifications to the backbone networks: remove conv_5x residual blocks, change the first 3x3 convolution of the conv_4x block from stride 2 to stride1 (this increases the resolution of the feature map to which detector heads are attached), attach all 6 detector heads to the output of the last conv_4x residual block. Thus detections are attached to 38x38, 19x19, 10x10, 5x5, 3x3, and 1x1 feature maps.

评价指标

Quality metric

Metric is COCO box mAP (averaged over IoU of 0.5:0.95), computed over 2017 COCO val data.

Quality target

mAP of 0.23

Evaluation frequency

Evaluation thoroughness

All the images in COCO 2017 val data set.

参考

training/single_stage_detector/ssd at master · mlcommons/training · GitHub

Objection-YOLOv3

本测试用例用于测试目标检测YOLOv3模型在ROCm平台PyTorch框架下的训练性能、推理性能和检测准确性,测试流程如下

测试流程

数据预处理

使用本算例进行测试前,需要将coco数据转化为符合yolov3模型输入要求的格式,即将数据集中的annotation json文件转化为label。操作流程如下
1、下载coco-to-yolo工具
git clone Bitbucket
cd coco-to-yolo
2、下载cocotoyolo.jar
wget http://commecica.com/wp-content/uploads/2018/07/cocotoyolo.jar
3、转换格式
(1) train json to label
java -jar cocotoyolo.jar "/path/to/{COCO2017_data_dir}/annotations/instances_train2017.json" "/path/to/{COCO2017_data_dir}/images/train2017" "all" "coco/yolo/"
(2) val json to label
java -jar cocotoyolo.jar "/path/to/{COCO2017_data_dir}/annotations/instances_val2017.json" "/path/to/{COCO2017_data_dir}/images/val2017" "all" "coco/yolo/"
4、 步骤3会生成coco/yolo/.txt文件,其中list.txt文件需重命名为train2017.txt和val2017.txt,其余txt文件为对应images/train2017&val2017图片的label。

下载预训练模型

下载链接
https://drive.google.com/drive/folders/1LezFG5g3BCW6iYaV89B2i64cqEUZD7e0
下载完成后放入weight目录

运行示例

训练
单卡
python3 train.py --cfg cfg/yolov3.cfg --weights weights/yolov3.pt --data data/coco2017.data --batch 32 --accum 2 --device 0  

运行前需确认coco2017.data中train2017.txt和val2017.txt中的数据路径

多卡
python3 train.py --cfg cfg/yolov3.cfg --weights weights/yolov3.weights --data data/coco2017.data --batch 64 --accum 1 --device 0,1
推理
python3 test.py --cfg cfg/yolov3.cfg --weights weights/yolov3.pt --task benchmark --augment --device 1  

运行完成后会生成benchmark.txt和benchmark_yolov3.log文件,benchmark.txt文件记录了5种图片输入尺寸、2种iou阈值下的mAP@0.5...0.9和mAP@0.5值,benchmark_yolov3.log文件记录了每张图片的inference/NMS/total时间。

检测

使用detect.py文件进行测试,是yolov3模型的的实际应用,测试内容是指定一张图片,检测图片中物体,观察准确率。运行指令如下: python3 detect.py --cfg cfg/yolov3.cfg --weights weights/yolov3.pt
运行完成后会生成带有检测框的图片。

参考

GitHub - ultralytics/yolov3: YOLOv3 in PyTorch > ONNX > CoreML > TFLite

NAS-darts

本用例用于神经网络架构搜索(NAS)领域darts算法在ROCm平台PyTorch框架下的测试,包括架构搜索和架构评估两部分内容,测试流程如下。

测试流程

准备数据

cifar-10为例进行测试说明,也可以自行下载PTB和ImageNet数据集。

预训练模型

可以选择在已有的预训练模型上进行训练,下载地址如下:
CIFAR-10
PTB
ImageNet

运行指令
架构搜索
python3 cnn/train_search.py --batch_size 100  

运行结束后会在当前目录下生成./search * /log.txt文件,架构格式如下:

genotype = Genotype(normal=[('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_5x5', 1), ('dil_conv_5x5', 3), ('sep_conv_3x3', 1), ('sep_conv_3x3', 3)], normal_concat=range(2, 6), reduce=[('skip_connect', 0), ('skip_connect', 1), ('max_pool_3x3', 0), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2)], reduce_concat=range(2, 6))
格式转化

需使用nasnet/protoc工具将上述得到的genotype转换为protobuf格式,操作如下:

cd nasnet/protoc  

更改util.py中的main()函数,将架构描述填入LegacyGenotype()

def main():
    PDARTS = LegacyGenotype(normal=[('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_5x5', 1), ('dil_conv_5x5', 3), ('sep_conv_3x3', 1), ('sep_conv_3x3', 3)], normal_concat=range(2, 6), reduce=[('skip_connect', 0), ('skip_connect', 1), ('max_pool_3x3', 0), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2)], reduce_concat=range(3, 6))
    new_PDARTS = convert_legacy_format_to_protobuf(PDARTS)
    save_genotype_to_file('pdarts.txt', new_PDARTS)

执行如下指令,生成pdarts.txt文件

python3 main.py
架构评估

运行示例

cd evaluation

./evaluate.sh {node_name} 1 0 /path/to/pdarts.txt /path/to/{save_dir}

参考

GitHub - quark0/darts: Differentiable architecture search for convolutional and recurrent networks

NLP-bert

使用PyTorch框架计算Bert网络。

  • BERT 的训练分为pre-train和fine-tune两种,pre-train训练分为两个phrase。

  • BERT 的推理可基于不同数据集进行精度验证

  • 数据生成、模型转换相关细节见 [README.md]

运行示例

目前提供基于wiki英文数据集 pre-train 两个阶段的训练和基于squad数据集fine-tune 训练的代码示例,

pre-train phrase1
参数名解释示例
PATH_PHRASE1第一阶段训练数据集路径/workspace/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
OUTPUT_DIR输出路径/workspace/results
PATH_CONFIGconfing路径/workspace/bert_large_uncased
PATH_PHRASE2第一阶段训练数据集路径/workspace/lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10
单卡
export HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py  \
    --input_dir=${PATH_PHRASE1}    \
    --output_dir=${OUTPUT_DIR}/checkpoints1 \
    --config_file=${PATH_CONFIG}bert_config.json \
    --bert_model=bert-large-uncased \
    --train_batch_size=16 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --max_steps=100000 \
    --warmup_proportion=0.0 \
    --num_steps_per_checkpoint=20000 \
    --learning_rate=4.0e-4 \
    --seed=12439 \
    --gradient_accumulation_steps=1 \
    --allreduce_post_accumulation \
    --do_train \
    --json-summary dllogger.json
多卡
  • 方法一
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py  \
    --input_dir=${PATH_PHRASE1}    \
    --output_dir=${OUTPUT_DIR}/checkpoints \
    --config_file=${PATH_CONFIG}bert_config.json \
    --bert_model=bert-large-uncased \
    --train_batch_size=16 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --max_steps=100000 \
    --warmup_proportion=0.0 \
    --num_steps_per_checkpoint=20000 \
    --learning_rate=4.0e-4 \
    --seed=12439 \
    --gradient_accumulation_steps=1 \
    --allreduce_post_accumulation \
    --do_train \
    --json-summary dllogger.json
  • 方法二

hostfile:

node1 slots=4
node2 slots=4
#scripts/run_pretrain.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain.sh
pre-train phrase2
单卡
HIP_VISIBLE_DEVICES=0
python3 run_pretraining_v1.py
   --input_dir=${PATH_PHRASE2} \
   --output_dir=${OUTPUT_DIR}/checkpoints2 \
   --config_file=${PATH_CONFIG}bert_config.json \
   --bert_model=bert-large-uncased \
   --train_batch_size=4 \
   --max_seq_length=512 \
   --max_predictions_per_seq=80 \
   --max_steps=400000 \
   --warmup_proportion=0.128 \
   --num_steps_per_checkpoint=200000 \
   --learning_rate=4e-3 \
   --seed=12439 \
   --gradient_accumulation_steps=1 \
   --allreduce_post_accumulation \
   --do_train \
   --phase2 \
   --phase1_end_step=0 \
   --json-summary dllogger.json
多卡
  • 方法一
export HIP_VISIBLE_DEVICES=0,1,2,3
python3 run_pretraining_v1.py
   --input_dir=${PATH_PHRASE2} \
   --output_dir=${OUTPUT_DIR}/checkpoints2 \
   --config_file=${PATH_CONFIG}bert_config.json \
   --bert_model=bert-large-uncased \
   --train_batch_size=4 \
   --max_seq_length=512 \
   --max_predictions_per_seq=80 \
   --max_steps=400000 \
   --warmup_proportion=0.128 \
   --num_steps_per_checkpoint=200000 \
   --learning_rate=4e-3 \
   --seed=12439 \
   --gradient_accumulation_steps=1 \
   --allreduce_post_accumulation \
   --do_train \
   --phase2 \
   --phase1_end_step=0 \
   --json-summary dllogger.json
  • 方法二

hostfile:

node1 slots=4
node2 slots=4
#scripts/run_pretrain2.sh 脚本默认每个节点四块卡
cd scripts; bash run_pretrain2.sh
fine-tune 训练
单卡
python3 run_squad_v1.py \
  --train_file squad/v1.1/train-v1.1.json \
  --init_checkpoint model.ckpt-28252.pt \
  --vocab_file vocab.txt \
  --output_dir SQuAD \
  --config_file bert_config.json \
  --bert_model=bert-large-uncased \
  --do_train \
  --train_batch_size 1 \
  --gpus_per_node 1
多卡

hostfile:

node1 slots=4
node2 slots=4
#scripts/run_squad_1.sh 脚本默认每个节点四块卡
bash run_squad_1.sh

参考资料

training_results_v0.7/NVIDIA/benchmarks/bert/implementations/pytorch at master · mlperf/training_results_v0.7 · GitHub DeepLearningExamples/PyTorch/LanguageModeling/BERT at master · NVIDIA/DeepLearningExamples · GitHub

NLP-gnmt

该脚本是基于NLP领域gnmt模型的功能测试用例,参考mlperf工程,当target-bleu指标达到24.0时,视为模型达到收敛标准并成功结束作业运行。

运行

安装依赖项
pip install sacrebleu==1.2.10
pip3 install --no-cache-dir https://github.com/mlperf/logging/archive/9ea0afa.zip
apex
seq2seq中gpu相关依赖 CC=hipcc CXX=hipcc python3 setup.py install
数据集下载
bash scripts/wmt16_en_de.sh
  • 关于数据集的更详细介绍可以参考README_orgin.md中第3部分
预处理
python3 preprocess_data.py --dataset-dir /path/to/download/wmt16_de_en/ --preproc-data-dir =/path/to/save/preprocess/data --max-length-train "75" --math fp32
单机单卡
HIP_VISIBLE_DEVICES=0 python3 train.py \
    --save ${RESULTS_DIR} \
    --dataset-dir ${DATASET_DIR} \
    --preproc-data-dir ${PREPROC_DATADIR}/${MAX_SEQ_LEN} \
    --target-bleu $TARGET \
    --epochs "${NUMEPOCHS}" \
    --math ${MATH} \
    --max-length-train ${MAX_SEQ_LEN} \
    --print-freq 10 \
    --train-batch-size $TRAIN_BATCH_SIZE \
    --test-batch-size $TEST_BATCH_SIZE \
    --optimizer Adam \
    --lr $LR \
    --warmup-steps $WARMUP_STEPS \
    --remain-steps $REMAIN_STEPS \
    --decay-interval $DECAY_INTERVAL \
    --no-log-all-ranks
  • 可参考run_fp32_singleCard.sh
单机多卡
bash run_fp32_node.sh
  • 可参考run_fp32_node.sh
多机多卡
bash run_fp32_multi.sh

模型

Publication/Attribution

Implemented model is similar to the one from Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper.

Most important difference is in the attention mechanism. This repository implements gnmt_v2 attention: output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with inputs to all subsequent LSTM layers in decoder at current timestep.

The same attention mechanism is also implemented in default GNMT-like models from tensorflow/nmt and NVIDIA/OpenSeq2Seq.

Structure
  • general:
    • encoder and decoder are using shared embeddings
    • data-parallel multi-gpu training
    • trained with label smoothing loss (smoothing factor 0.1)
  • encoder:
    • 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest of layers are unidirectional
    • with residual connections starting from 3rd LSTM layer
    • uses standard pytorch nn.LSTM layer
    • dropout is applied on input to all LSTM layers, probability of dropout is set to 0.2
    • hidden state of LSTM layers is initialized with zeros
    • weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1) distribution
  • decoder:
    • 4-layer unidirectional LSTM with hidden size 1024 and fully-connected classifier
    • with residual connections starting from 3rd LSTM layer
    • uses standard pytorch nn.LSTM layer
    • dropout is applied on input to all LSTM layers, probability of dropout is set to 0.2
    • hidden state of LSTM layers is initialized with zeros
    • weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1) distribution
    • weights and bias of fully-connected classifier is initialized with uniform(-0.1, 0.1) distribution
  • attention:
    • normalized Bahdanau attention
    • model uses gnmt_v2 attention mechanism
    • output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with the input to all subsequent LSTM layers in decoder at the current timestep
    • linear transform of keys and queries is initialized with uniform(-0.1, 0.1), normalization scalar is initialized with 1.0 / sqrt(1024), normalization bias is initialized with zero
  • inference:
    • beam search with beam size of 5
    • with coverage penalty and length normalization, coverage penalty factor is set to 0.1, length normalization factor is set to 0.6 and length normalization constant is set to 5.0
    • BLEU computed by sacrebleu

Implementation:

  • base Seq2Seq model: pytorch/seq2seq/models/seq2seq_base.py, class Seq2Seq
  • GNMT model: pytorch/seq2seq/models/gnmt.py, class GNMT
  • encoder: pytorch/seq2seq/models/encoder.py, class ResidualRecurrentEncoder
  • decoder: pytorch/seq2seq/models/decoder.py, class ResidualRecurrentDecoder
  • attention: pytorch/seq2seq/models/attention.py, class BahdanauAttention
  • inference (including BLEU evaluation and detokenization): pytorch/seq2seq/inference/inference.py, class Translator
  • beam search: pytorch/seq2seq/inference/beam_search.py, class SequenceGenerator
Loss function

Cross entropy loss with label smoothing (smoothing factor = 0.1), padding is not considered part of the loss.

Loss function is implemented in pytorch/seq2seq/train/smoothing.py, class LabelSmoothing.

Optimizer

Adam optimizer with learning rate 1e-3, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8 and no weight decay. Network is trained with gradient clipping, max L2 norm of gradients is set to 5.0.

Optimizer is implemented in pytorch/seq2seq/train/fp_optimizers.py, class Fp32Optimizer.

Learning rate schedule

Model is trained with exponential learning rate warmup for 200 steps and with step learning rate decay. Decay is started after 2/3 of training steps, decays for a total of 4 times, at regularly spaced intervals, decay factor is 0.5.

Learning rate scheduler is implemented in pytorch/seq2seq/train/lr_scheduler.py, class WarmupMultiStepLR.

评估

Quality metric

Uncased BLEU score on newstest2014 en-de dataset. BLEU scores reported by sacrebleu package (version 1.2.10). Sacrebleu is executed with the following flags: --score-only -lc --tokenize intl.

Quality target

Uncased BLEU score of 24.00.

Evaluation frequency

Evaluation of BLEU score is done after every epoch.

Evaluation thoroughness

Evaluation uses all of newstest2014.en (3003 sentences).

Recommendation

本用例用于推荐领域NCF模型在ROCm平台PyTorch框架下的性能测试,已在rocm3.3 pytorch1.5版本下进行验证,测试流程如下。

测试流程

数据处理

数据集下载地址
MovieLens | GroupLens
数据转换格式
ml-1m

python3 convert.py --path /path/to/{ml-1m_dir}/ratings.dat --output dataset/ml-1m  

ml-20m

python3 convert.py --path /path/to/{ml-20m_dir}/ratings.csv --output dataset/ml-20m  
运行指令
python3 -m torch.distributed.launch --nproc_per_node=<number_of_gpus> --use_env ncf.py --data <path_to_dataset> [other_parameters]  

单卡示例

python3 -m torch.distributed.launch --nproc_per_node=1 --use_env ncf.py --data=./dataset/ml-1m --checkpoint_dir=/path/to/{check_save_dir}  

4卡示例

python3 -m torch.distributed.launch --nproc_per_node=4 --use_env ncf.py --data=./dataset/ml-1m --checkpoint_dir=/path/to/{check_save_dir}

参考

DeepLearningExamples/PyTorch/Recommendation/NCF at 92829376a126286932496ff10d7cc655cb79af05 · NVIDIA/DeepLearningExamples · GitHub

  • 9
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

技术瘾君子1573

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值