将MLPerf训练结果库拷到本地
使用的是training_results_v0.6,而不是mlperf / training存储库中提供的参考实现。请注意,这些实现有效地用作基准实现的起点,但尚未完全优化,并且不打算用于软件框架或硬件的“实际”性能评估。
git clone https://github.com/Caiyishuai/training_results_v0.6
在此存储库中,有每个供应商提交的目录(Google,Intel,NVIDIA等),其中包含用于生成结果的代码和脚本。在NVIDIA GPU上运行基准测试。
[root@2 ~]# cd training_results_v0.6/
[root@2 training_results_v0.6]# ls
Alibaba CONTRIBUTING.md Fujitsu Google Intel LICENSE NVIDIA README.md
[root@2 training_results_v0.6]# cd NVIDIA/; ls
benchmarks LICENSE.md README.md results systems
[root@2 NVIDIA]# cd benchmarks/; ls
gnmt maskrcnn minigo resnet ssd transformer
下载并验证数据集
[root@2 implementations]# pwd
/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations
[root@2 implementations]# ls
data download_dataset2.sh download_dataset3.sh download_dataset.sh pytorch verify_dataset.sh wget-log
[root@2 implementations]# bash download_dataset.sh
查看download_dataset.sh,可以查看数据的具体链接,如果网速较慢,可以将链接复制到其它下载器中下载,然后更改download_dataset.sh
[root@2 implementations]# cat download_dataset.sh
#! /usr/bin/env bash
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
export LANG=C.UTF-8
export LC_ALL=C.UTF-8
OUTPUT_DIR=${1:-"data"}
echo "Writing to ${OUTPUT_DIR}. To change this, set the OUTPUT_DIR environment variable."
OUTPUT_DIR_DATA="${OUTPUT_DIR}/data"
mkdir -p $OUTPUT_DIR_DATA
echo "Downloading Europarl v7. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \
http://www.statmt.org/europarl/v7/de-en.tgz
echo "Downloading Common Crawl corpus. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/common-crawl.tgz \
http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
echo "Downloading News Commentary v11. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/nc-v11.tgz \
http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
echo "Downloading dev/test sets"
wget -nc -nv -O ${OUTPUT_DIR_DATA}/dev.tgz \
http://data.statmt.org/wmt16/translation-task/dev.tgz
wget -nc -nv -O ${OUTPUT_DIR_DATA}/test.tgz \
http://data.statmt.org/wmt16/translation-task/test.tgz
………………
done
echo "All done."
如果通过其它方式已经下载了文件在本目录下,可以更改上述wegt代码
echo "Downloading Europarl v7. This may take a while..."
mv -i data/de-en.tgz ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \
echo "Downloading Common Crawl corpus. This may take a while..."
mv -i data/training-parallel-commoncrawl.tgz ${OUTPUT_DIR_DATA}/common-crawl.tgz \
echo "Downloading News Commentary v11. This may take a while..."
mv -i data/training-parallel-nc-v11.tgz ${OUTPUT_DIR_DATA}/nc-v11.tgz \
echo "Downloading dev/test sets"
mv -i data/dev.tgz ${OUTPUT_DIR_DATA}/dev.tgz \
mv -i data/test.tgz ${OUTPUT_DIR_DATA}/test.tgz \
执行脚本以验证是否已正确下载数据集。
[root@2 implementations]# du -sh data/
13G data/
配置文件开始准备训练
用于执行训练作业的脚本和代码位于pytorch目录中。
[root@2 implementations]# cd pytorch/
[root@2 pytorch]# ll
total 124
-rw-r--r-- 1 root root 5047 Jan 22 15:45 bind_launch.py
-rwxr-xr-x 1 root root 1419 Jan 22 15:45 config_DGX1_multi.sh
-rwxr-xr-x 1 root root 718 Jan 25 10:50 config_DGX1.sh
-rwxr-xr-x 1 root root 1951 Jan 22 15:45 config_DGX2_multi_16x16x32.sh
-rwxr-xr-x 1 root root 1950 Jan 22 15:45 config_DGX2_multi.sh
-rwxr-xr-x 1 root root 718 Jan 22 15:45 config_DGX2.sh
-rw-r--r-- 1 root root 1372 Jan 22 15:45 Dockerfile
-rw-r--r-- 1 root root 1129 Jan 22 15:45 LICENSE
-rw-r--r-- 1 root root 6494 Jan 22 15:45 mlperf_log_utils.py
-rw-r--r-- 1 root root 4145 Jan 22 15:45 preprocess_data.py
-rw-r--r-- 1 root root 12665 Jan 22 15:45 README.md
-rw-r--r-- 1 root root 43 Jan 22 15:45 requirements.txt
-rwxr-xr-x 1 root root 2220 Jan 22 15:45 run_and_time.sh
-rwxr-xr-x 1 root root 7173 Jan 25 10:56 run.sub
drwxr-xr-x 3 root root 45 Jan 22 15:45 scripts
drwxr-xr-x 7 root root 90 Jan 22 15:45 seq2seq
-rw-r--r-- 1 root root 1082 Jan 22 15:45 setup.py
-rw-r--r-- 1 root root 25927 Jan 22 15:45 train.py
-rw-r--r-- 1 root root 8056 Jan 22 15:45 translate.py
需要配置config_ <system> .sh以反映您的系统配置。如果系统具有8个或16个GPU,则可以使用现有的config_DGX1.sh或config_DGX2.sh配置文件来启动训练作业。
要编辑的参数:DGXNGPU = 8DGXSOCKETCORES = 18DGXNSOCKET = 2
您可以使用nvidia-smi
命令获取GPU信息,并使用lscpu
命令获取CPU信息,尤其是:
Core(s) per socket: 18
Socket(s): 2
下载docker镜像
docker build -t mlperf-nvidia:rnn_translator .
需要不少时间
[root@2 pytorch]# docker build -t mlperf-nvidia:rnn_translator .
Sending build context to Docker daemon 279kB
Step 1/12 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3
Step 2/12 : FROM ${FROM_IMAGE_NAME}
19.05-py3: Pulling from nvidia/pytorch
7e6591854262: Pulling fs layer
089d60cb4e0a: Pulling fs layer
7e6591854262: Downloading [> ] 452.4kB/43.75MB
45085432511a: Waiting
6ca460804a89: Waiting
2631f04ebf64: Pulling fs layer
86f56e03e071: Pulling fs layer
234646620160: Waiting
7f717cd17058: Waiting
e69a2ba99832: Waiting
bc9bca17b13c: Waiting
1870788e477f: Waiting
603e0d586945: Waiting
717dfedf079c: Waiting
2631f04ebf64: Waiting
c5bd7559c3ad: Waiting
9c461696bc09: Download complete
059d4f560014: Waiting
f3f14cff44df: Waiting
603e0d586945: Downloading [==========> ] 102.7MB/492.7MB
c5bd7559c3ad: Pull complete
d82c679b8708: Pull complete
059d4f560014: Pull complete
f3f14cff44df: Pull complete
96502bde320c: Pull complete
bc5bb9379810: Pull complete
e4d8bb046bc2: Pull complete
4e2187010a7c: Pull complete
9d62684b94c3: Pull complete
e70e61e48991: Pull complete
683f2d0d75c5: Pull complete
d91684765fac: Pull complete
ceb6cf7ee657: Pull complete
8d2533535f88: Pull complete
15c2061baa94: Pull complete
fe35706ec086: Pull complete
ef06e50267e2: Pull complete
24569ba3e1d3: Pull complete
c49dc7cbf15c: Pull complete
34e55507c797: Pull complete
c26e49a3c2c6: Pull complete
7f6410878ec9: Pull complete
97f3bcccbcdf: Pull complete
3f9a50c314fa: Pull complete
d6c800c70bb2: Pull complete
9c785de98406: Pull complete
acb71385d77d: Pull complete
ea9fb98cc638: Pull complete
08e43405860a: Pull complete
02899df1d7b5: Pull complete
66e5d0f2b0fa: Pull complete
46bb7884fc3b: Pull complete
af50c16f8064: Pull complete
a8c14d818405: Pull complete
8c3f313defdf: Pull complete
Digest: sha256:6614fa29720fc253bcb0e99c29af2f93caff16976661f241ec5ed5cf08e7c010
Status: Downloaded newer image for nvcr.io/nvidia/pytorch:19.05-py3
---> 7e98758d4777
Step 3/12 : RUN apt-get update && apt-get install -y --no-install-recommends infiniband-diags pciutils && rm -rf /var/lib/apt/lists/*
---> Running in 7b374edf0b57
Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:2 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [1905 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:8 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:9 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [15.9 kB]
Get:10 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [982 kB]
Get:11 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [8820 B]
Get:12 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [2414 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [16.4 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [1534 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [26.4 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [10.9 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [12.6 kB]
Fetched 19.1 MB in 30s (621 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
libibmad5 libibnetdisc5 libibumad3 libosmcomp3 libpci3
The following NEW packages will be installed:
infiniband-diags libibmad5 libibnetdisc5 libibumad3 libosmcomp3 libpci3
pciutils
0 upgraded, 7 newly installed, 0 to remove and 120 not upgraded.
Need to get 574 kB of archives.
After this operation, 2638 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libpci3 amd64 1:3.3.1-1.1ubuntu1.3 [24.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 pciutils amd64 1:3.3.1-1.1ubuntu1.3 [254 kB]
Get:3 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibumad3 amd64 1.3.10.2-1 [16.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibmad5 amd64 1.3.12-1 [29.9 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libosmcomp3 amd64 3.3.19-1 [22.2 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibnetdisc5 amd64 1.6.6-1 [22.8 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/universe amd64 infiniband-diags amd64 1.6.6-1 [205 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 574 kB in 5s (112 kB/s)
Selecting previously unselected package libpci3:amd64.
(Reading database ... 21560 files and directories currently installed.)
Preparing to unpack .../libpci3_1%3a3.3.1-1.1ubuntu1.3_amd64.deb ...
Unpacking libpci3:amd64 (1:3.3.1-1.1ubuntu1.3) ...
Selecting previously unselected package pciutils.
Preparing to unpack .../pciutils_1%3a3.3.1-1.1ubuntu1.3_amd64.deb ...
Unpacking pciutils (1:3.3.1-1.1ubuntu1.3) ...
Selecting previously unselected package libibumad3.
Preparing to unpack .../libibumad3_1.3.10.2-1_amd64.deb ...
Unpacking libibumad3 (1.3.10.2-1) ...
Selecting previously unselected package libibmad5.
Preparing to unpack .../libibmad5_1.3.12-1_amd64.deb ...
Unpacking libibmad5 (1.3.12-1) ...
Selecting previously unselected package libosmcomp3.
Preparing to unpack .../libosmcomp3_3.3.19-1_amd64.deb ...
Unpacking libosmcomp3 (3.3.19-1) ...
Selecting previously unselected package libibnetdisc5.
Preparing to unpack .../libibnetdisc5_1.6.6-1_amd64.deb ...
Unpacking libibnetdisc5 (1.6.6-1) ...
Selecting previously unselected package infiniband-diags.
Preparing to unpack .../infiniband-diags_1.6.6-1_amd64.deb ...
Unpacking infiniband-diags (1.6.6-1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Setting up libpci3:amd64 (1:3.3.1-1.1ubuntu1.3) ...
Setting up pciutils (1:3.3.1-1.1ubuntu1.3) ...
Setting up libibumad3 (1.3.10.2-1) ...
Setting up libibmad5 (1.3.12-1) ...
Setting up libosmcomp3 (3.3.19-1) ...
Setting up libibnetdisc5 (1.6.6-1) ...
Setting up infiniband-diags (1.6.6-1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Removing intermediate container 7b374edf0b57
---> 91942ef1e039
Step 4/12 : WORKDIR /workspace/rnn_translator
---> Running in 150b2d9df1cc
Removing intermediate container 150b2d9df1cc
---> 17720ab57857
Step 5/12 : COPY requirements.txt .
---> fc25fbdf0006
Step 6/12 : RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance && pip install --no-cache-dir -r requirements.txt
---> Running in 88b21caded36
Collecting https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance
Downloading https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip
Building wheels for collected packages: mlperf-compliance
Building wheel for mlperf-compliance (setup.py): started
Building wheel for mlperf-compliance (setup.py): finished with status 'done'
Stored in directory: /tmp/pip-ephem-wheel-cache-c_6ttc8p/wheels/9e/73/0a/3c481ccbda248a195828b8ea5173e83b8394051d8c40e08660
Successfully built mlperf-compliance
Installing collected packages: mlperf-compliance
Found existing installation: mlperf-compliance 0.0.10
Uninstalling mlperf-compliance-0.0.10:
Successfully uninstalled mlperf-compliance-0.0.10
Successfully installed mlperf-compliance-0.6.0
Requirement already satisfied: mlperf-compliance==0.6.0 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (0.6.0)
Requirement already satisfied: sacrebleu==1.2.10 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (1.2.10)
Requirement already satisfied: typing in /opt/conda/lib/python3.6/site-packages (from sacrebleu==1.2.10->-r requirements.txt (line 2)) (3.6.6)
Removing intermediate container 88b21caded36
---> 346646500f0f
Step 7/12 : COPY seq2seq/csrc seq2seq/csrc
---> 936e5bc1a41e
Step 8/12 : COPY setup.py .
---> 090cc90c4cb5
Step 9/12 : RUN pip install .
---> Running in 0547065d6492
Processing /workspace/rnn_translator
Requirement already satisfied: mlperf-compliance==0.6.0 in /opt/conda/lib/python3.6/site-packages (from gnmt==0.6.0) (0.6.0)
Requirement already satisfied: sacrebleu==1.2.10 in /opt/conda/lib/python3.6/site-packages (from gnmt==0.6.0) (1.2.10)
Requirement already satisfied: typing in /opt/conda/lib/python3.6/site-packages (from sacrebleu==1.2.10->gnmt==0.6.0) (3.6.6)
Building wheels for collected packages: gnmt
Building wheel for gnmt (setup.py): started
Building wheel for gnmt (setup.py): still running...
Building wheel for gnmt (setup.py): finished with status 'done'
Stored in directory: /tmp/pip-ephem-wheel-cache-_jrlxic9/wheels/84/b6/f1/20addc378b275e39e227da5ee58c19f8e2433a88fd6e5fbf7b
Successfully built gnmt
Installing collected packages: gnmt
Successfully installed gnmt-0.6.0
Removing intermediate container 0547065d6492
---> 7a7bb07a7855
Step 10/12 : COPY . .
---> dfa84645d44d
Step 11/12 : ENV LANG C.UTF-8
---> Running in 992046e4ef3e
Removing intermediate container 992046e4ef3e
---> d1e6862fe916
Step 12/12 : ENV LC_ALL C.UTF-8
---> Running in c67514666b6d
Removing intermediate container c67514666b6d
---> 2d4231f91c86
Successfully built 2d4231f91c86
Successfully tagged mlperf-nvidia:rnn_translator
我们可以查看一下dockfile文件
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3
FROM ${FROM_IMAGE_NAME}
# Install dependencies for system configuration logger
RUN apt-get update && apt-get install -y --no-install-recommends \
infiniband-diags \
pciutils && \
rm -rf /var/lib/apt/lists/*
# Install Python dependencies
WORKDIR /workspace/rnn_translator
COPY requirements.txt .
RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance \
&& pip install --no-cache-dir -r requirements.txt
# Copy & build extensions
COPY seq2seq/csrc seq2seq/csrc
COPY setup.py .
RUN pip install .
# Copy GNMT code
COPY . .
# Configure environment variables
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
Docker问题:Dockerfile的From之前不能使用ARG,允许这种用法是在docker 17.05.0-ce (2017-05-04)
之后才引入的,查看我的docker版本
[root@2 pytorch]# docker version
Client: Docker Engine - Community
Version: 20.10.2
API version: 1.41
Go version: go1.13.15
Git commit: 2291f61
Built: Mon Dec 28 16:17:48 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
…………
对于本测试,将使用config_DGX1.sh并因此将DGXSYTEM指定为DGX1。还要将PULL设置为0,以指示使用本地映像而不是从存储库中提取docker映像。创建了一个新目录“ logs”来存储基准日志文件,并在启动基准运行时提供数据目录路径,如下所示:
[root@2 pytorch]# DATADIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/data
[root@2 pytorch]# LOGDIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/logs
[root@2 pytorch]# PULL=0 DGXSYSTEM=DGX1
[root@2 pytorch]# ./run.sub
如果报错以下,表示计算机没有GPU
[root@2 pytorch]# ./run.sub
mlperf-nvidia:rnn_translator
nvidia-docker | 2021/01/25 13:48:39 Error: Could not load UVM kernel module. Is nvidia-modprobe installed?
ERR: Base container launch failed.
查看GPU信息nvidia-smi
如果一切顺利,它将执行基准测试10次,并将日志文件存储在指定目录中。由于在配置文件中指定了8个GPU,因此将看到所有8个GPU被用于训练GNMT模型。可以使用此命令watch -d -n 1 nvidia-smi
来定期监视GPU使用情况。
run.sub里也可以设置PULL=0
# Pull latest image
PULL=0
if [[ "${PULL}" != "0" ]]; then
DOCKERPULL="docker pull $CONT"
pids=();
for hostn in ${hosts[@]}; do
timeout -k 600s 600s \
$(eval echo $SRUN) $DOCKERPULL &
pids+=($!);
done
wait "${pids[@]}"
success=$? ; if [ $success -ne 0 ]; then echo "ERR: Image pull failed."; exit $success ; fi
fi
参考链接:https://segmentfault.com/a/1190000022834920?utm_source=tag-newest