MLPerf 机器学习基准测试实战入门（一）NAVIDA-GNMT

最新推荐文章于 2024-09-07 23:13:35 发布

蔡军帅

最新推荐文章于 2024-09-07 23:13:35 发布

阅读量3.4k

点赞数

文章标签： linux 深度学习 docker tensorflow java

本文链接：https://blog.csdn.net/qq_40875849/article/details/119955984

版权

将MLPerf训练结果库拷到本地

使用的是training_results_v0.6，而不是mlperf / training存储库中提供的参考实现。请注意，这些实现有效地用作基准实现的起点，但尚未完全优化，并且不打算用于软件框架或硬件的“实际”性能评估。

git clone https://github.com/Caiyishuai/training_results_v0.6

在此存储库中，有每个供应商提交的目录(Google，Intel，NVIDIA等)，其中包含用于生成结果的代码和脚本。在NVIDIA GPU上运行基准测试。

[root@2 ~]# cd training_results_v0.6/

[root@2 training_results_v0.6]# ls
Alibaba  CONTRIBUTING.md  Fujitsu  Google  Intel  LICENSE  NVIDIA  README.md

[root@2 training_results_v0.6]# cd NVIDIA/; ls
benchmarks  LICENSE.md  README.md  results  systems

[root@2 NVIDIA]# cd benchmarks/; ls
gnmt  maskrcnn  minigo  resnet  ssd  transformer

下载并验证数据集

[root@2 implementations]# pwd
/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations

[root@2 implementations]# ls
data  download_dataset2.sh  download_dataset3.sh  download_dataset.sh  pytorch  verify_dataset.sh  wget-log

[root@2 implementations]# bash download_dataset.sh

查看download_dataset.sh，可以查看数据的具体链接，如果网速较慢，可以将链接复制到其它下载器中下载，然后更改download_dataset.sh

[root@2 implementations]# cat download_dataset.sh
#! /usr/bin/env bash

# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -e

export LANG=C.UTF-8
export LC_ALL=C.UTF-8

OUTPUT_DIR=${1:-"data"}
echo "Writing to ${OUTPUT_DIR}. To change this, set the OUTPUT_DIR environment variable."

OUTPUT_DIR_DATA="${OUTPUT_DIR}/data"

mkdir -p $OUTPUT_DIR_DATA

echo "Downloading Europarl v7. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \
  http://www.statmt.org/europarl/v7/de-en.tgz

echo "Downloading Common Crawl corpus. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/common-crawl.tgz \
  http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz

echo "Downloading News Commentary v11. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/nc-v11.tgz \
  http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz

echo "Downloading dev/test sets"
wget -nc -nv -O  ${OUTPUT_DIR_DATA}/dev.tgz \
  http://data.statmt.org/wmt16/translation-task/dev.tgz
wget -nc -nv -O  ${OUTPUT_DIR_DATA}/test.tgz \
  http://data.statmt.org/wmt16/translation-task/test.tgz

………………

done

echo "All done."

如果通过其它方式已经下载了文件在本目录下，可以更改上述wegt代码

echo "Downloading Europarl v7. This may take a while..."
mv -i data/de-en.tgz  ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \
  

echo "Downloading Common Crawl corpus. This may take a while..."
mv -i data/training-parallel-commoncrawl.tgz  ${OUTPUT_DIR_DATA}/common-crawl.tgz \
  
echo "Downloading News Commentary v11. This may take a while..."
mv -i data/training-parallel-nc-v11.tgz  ${OUTPUT_DIR_DATA}/nc-v11.tgz \
  

echo "Downloading dev/test sets"
mv -i data/dev.tgz  ${OUTPUT_DIR_DATA}/dev.tgz \
  
mv -i data/test.tgz  ${OUTPUT_DIR_DATA}/test.tgz \

执行脚本以验证是否已正确下载数据集。

[root@2 implementations]# du -sh data/
13G     data/

配置文件开始准备训练

用于执行训练作业的脚本和代码位于pytorch目录中。

[root@2 implementations]# cd pytorch/
[root@2 pytorch]# ll
total 124
-rw-r--r-- 1 root root  5047 Jan 22 15:45 bind_launch.py
-rwxr-xr-x 1 root root  1419 Jan 22 15:45 config_DGX1_multi.sh
-rwxr-xr-x 1 root root   718 Jan 25 10:50 config_DGX1.sh
-rwxr-xr-x 1 root root  1951 Jan 22 15:45 config_DGX2_multi_16x16x32.sh
-rwxr-xr-x 1 root root  1950 Jan 22 15:45 config_DGX2_multi.sh
-rwxr-xr-x 1 root root   718 Jan 22 15:45 config_DGX2.sh
-rw-r--r-- 1 root root  1372 Jan 22 15:45 Dockerfile
-rw-r--r-- 1 root root  1129 Jan 22 15:45 LICENSE
-rw-r--r-- 1 root root  6494 Jan 22 15:45 mlperf_log_utils.py
-rw-r--r-- 1 root root  4145 Jan 22 15:45 preprocess_data.py
-rw-r--r-- 1 root root 12665 Jan 22 15:45 README.md
-rw-r--r-- 1 root root    43 Jan 22 15:45 requirements.txt
-rwxr-xr-x 1 root root  2220 Jan 22 15:45 run_and_time.sh
-rwxr-xr-x 1 root root  7173 Jan 25 10:56 run.sub
drwxr-xr-x 3 root root    45 Jan 22 15:45 scripts
drwxr-xr-x 7 root root    90 Jan 22 15:45 seq2seq
-rw-r--r-- 1 root root  1082 Jan 22 15:45 setup.py
-rw-r--r-- 1 root root 25927 Jan 22 15:45 train.py
-rw-r--r-- 1 root root  8056 Jan 22 15:45 translate.py

需要配置config_ <system> .sh以反映您的系统配置。如果系统具有8个或16个GPU，则可以使用现有的config_DGX1.sh或config_DGX2.sh配置文件来启动训练作业。

要编辑的参数：DGXNGPU = 8DGXSOCKETCORES = 18DGXNSOCKET = 2

您可以使用nvidia-smi命令获取GPU信息，并使用lscpu命令获取CPU信息，尤其是：

Core(s) per socket: 18
Socket(s): 2

下载docker镜像

docker build -t mlperf-nvidia:rnn_translator .

需要不少时间

[root@2 pytorch]# docker build -t mlperf-nvidia:rnn_translator .
Sending build context to Docker daemon    279kB
Step 1/12 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3
Step 2/12 : FROM ${FROM_IMAGE_NAME}
19.05-py3: Pulling from nvidia/pytorch
7e6591854262: Pulling fs layer 
089d60cb4e0a: Pulling fs layer 
7e6591854262: Downloading [>                                                  ]  452.4kB/43.75MB
45085432511a: Waiting 
6ca460804a89: Waiting 
2631f04ebf64: Pulling fs layer 
86f56e03e071: Pulling fs layer 
234646620160: Waiting 
7f717cd17058: Waiting 
e69a2ba99832: Waiting 
bc9bca17b13c: Waiting 
1870788e477f: Waiting 
603e0d586945: Waiting 
717dfedf079c: Waiting 
2631f04ebf64: Waiting 
c5bd7559c3ad: Waiting 
9c461696bc09: Download complete 
059d4f560014: Waiting 
f3f14cff44df: Waiting 
603e0d586945: Downloading [==========>                                        ]  102.7MB/492.7MB
c5bd7559c3ad: Pull complete 
d82c679b8708: Pull complete 
059d4f560014: Pull complete 
f3f14cff44df: Pull complete 
96502bde320c: Pull complete 
bc5bb9379810: Pull complete 
e4d8bb046bc2: Pull complete 
4e2187010a7c: Pull complete 
9d62684b94c3: Pull complete 
e70e61e48991: Pull complete 
683f2d0d75c5: Pull complete 
d91684765fac: Pull complete 
ceb6cf7ee657: Pull complete 
8d2533535f88: Pull complete 
15c2061baa94: Pull complete 
fe35706ec086: Pull complete 
ef06e50267e2: Pull complete 
24569ba3e1d3: Pull complete 
c49dc7cbf15c: Pull complete 
34e55507c797: Pull complete 
c26e49a3c2c6: Pull complete 
7f6410878ec9: Pull complete 
97f3bcccbcdf: Pull complete 
3f9a50c314fa: Pull complete 
d6c800c70bb2: Pull complete 
9c785de98406: Pull complete 
acb71385d77d: Pull complete 
ea9fb98cc638: Pull complete 
08e43405860a: Pull complete 
02899df1d7b5: Pull complete 
66e5d0f2b0fa: Pull complete 
46bb7884fc3b: Pull complete 
af50c16f8064: Pull complete 
a8c14d818405: Pull complete 
8c3f313defdf: Pull complete 
Digest: sha256:6614fa29720fc253bcb0e99c29af2f93caff16976661f241ec5ed5cf08e7c010
Status: Downloaded newer image for nvcr.io/nvidia/pytorch:19.05-py3
 ---> 7e98758d4777
Step 3/12 : RUN apt-get update && apt-get install -y --no-install-recommends         infiniband-diags         pciutils &&     rm -rf /var/lib/apt/lists/*
 ---> Running in 7b374edf0b57
Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:2 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [1905 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:8 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:9 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [15.9 kB]
Get:10 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [982 kB]
Get:11 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [8820 B]
Get:12 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [2414 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [16.4 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [1534 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [26.4 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [10.9 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [12.6 kB]
Fetched 19.1 MB in 30s (621 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libibmad5 libibnetdisc5 libibumad3 libosmcomp3 libpci3
The following NEW packages will be installed:
  infiniband-diags libibmad5 libibnetdisc5 libibumad3 libosmcomp3 libpci3
  pciutils
0 upgraded, 7 newly installed, 0 to remove and 120 not upgraded.
Need to get 574 kB of archives.
After this operation, 2638 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libpci3 amd64 1:3.3.1-1.1ubuntu1.3 [24.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 pciutils amd64 1:3.3.1-1.1ubuntu1.3 [254 kB]
Get:3 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibumad3 amd64 1.3.10.2-1 [16.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibmad5 amd64 1.3.12-1 [29.9 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libosmcomp3 amd64 3.3.19-1 [22.2 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibnetdisc5 amd64 1.6.6-1 [22.8 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/universe amd64 infiniband-diags amd64 1.6.6-1 [205 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Fetched 574 kB in 5s (112 kB/s)
Selecting previously unselected package libpci3:amd64.
(Reading database ... 21560 files and directories currently installed.)
Preparing to unpack .../libpci3_1%3a3.3.1-1.1ubuntu1.3_amd64.deb ...
Unpacking libpci3:amd64 (1:3.3.1-1.1ubuntu1.3) ...
Selecting previously unselected package pciutils.
Preparing to unpack .../pciutils_1%3a3.3.1-1.1ubuntu1.3_amd64.deb ...
Unpacking pciutils (1:3.3.1-1.1ubuntu1.3) ...
Selecting previously unselected package libibumad3.
Preparing to unpack .../libibumad3_1.3.10.2-1_amd64.deb ...
Unpacking libibumad3 (1.3.10.2-1) ...
Selecting previously unselected package libibmad5.
Preparing to unpack .../libibmad5_1.3.12-1_amd64.deb ...
Unpacking libibmad5 (1.3.12-1) ...
Selecting previously unselected package libosmcomp3.
Preparing to unpack .../libosmcomp3_3.3.19-1_amd64.deb ...
Unpacking libosmcomp3 (3.3.19-1) ...
Selecting previously unselected package libibnetdisc5.
Preparing to unpack .../libibnetdisc5_1.6.6-1_amd64.deb ...
Unpacking libibnetdisc5 (1.6.6-1) ...
Selecting previously unselected package infiniband-diags.
Preparing to unpack .../infiniband-diags_1.6.6-1_amd64.deb ...
Unpacking infiniband-diags (1.6.6-1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Setting up libpci3:amd64 (1:3.3.1-1.1ubuntu1.3) ...
Setting up pciutils (1:3.3.1-1.1ubuntu1.3) ...
Setting up libibumad3 (1.3.10.2-1) ...
Setting up libibmad5 (1.3.12-1) ...
Setting up libosmcomp3 (3.3.19-1) ...
Setting up libibnetdisc5 (1.6.6-1) ...
Setting up infiniband-diags (1.6.6-1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Removing intermediate container 7b374edf0b57
 ---> 91942ef1e039
Step 4/12 : WORKDIR /workspace/rnn_translator
 ---> Running in 150b2d9df1cc
Removing intermediate container 150b2d9df1cc
 ---> 17720ab57857
Step 5/12 : COPY requirements.txt .
 ---> fc25fbdf0006
Step 6/12 : RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance  && pip install --no-cache-dir -r requirements.txt
 ---> Running in 88b21caded36
Collecting https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance
  Downloading https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip
Building wheels for collected packages: mlperf-compliance
  Building wheel for mlperf-compliance (setup.py): started
  Building wheel for mlperf-compliance (setup.py): finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-c_6ttc8p/wheels/9e/73/0a/3c481ccbda248a195828b8ea5173e83b8394051d8c40e08660
Successfully built mlperf-compliance
Installing collected packages: mlperf-compliance
  Found existing installation: mlperf-compliance 0.0.10
    Uninstalling mlperf-compliance-0.0.10:
      Successfully uninstalled mlperf-compliance-0.0.10
Successfully installed mlperf-compliance-0.6.0
Requirement already satisfied: mlperf-compliance==0.6.0 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (0.6.0)
Requirement already satisfied: sacrebleu==1.2.10 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (1.2.10)
Requirement already satisfied: typing in /opt/conda/lib/python3.6/site-packages (from sacrebleu==1.2.10->-r requirements.txt (line 2)) (3.6.6)
Removing intermediate container 88b21caded36
 ---> 346646500f0f
Step 7/12 : COPY seq2seq/csrc seq2seq/csrc
 ---> 936e5bc1a41e
Step 8/12 : COPY setup.py .
 ---> 090cc90c4cb5
Step 9/12 : RUN pip install .
 ---> Running in 0547065d6492
Processing /workspace/rnn_translator
Requirement already satisfied: mlperf-compliance==0.6.0 in /opt/conda/lib/python3.6/site-packages (from gnmt==0.6.0) (0.6.0)
Requirement already satisfied: sacrebleu==1.2.10 in /opt/conda/lib/python3.6/site-packages (from gnmt==0.6.0) (1.2.10)
Requirement already satisfied: typing in /opt/conda/lib/python3.6/site-packages (from sacrebleu==1.2.10->gnmt==0.6.0) (3.6.6)
Building wheels for collected packages: gnmt
  Building wheel for gnmt (setup.py): started
  Building wheel for gnmt (setup.py): still running...
  Building wheel for gnmt (setup.py): finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-_jrlxic9/wheels/84/b6/f1/20addc378b275e39e227da5ee58c19f8e2433a88fd6e5fbf7b
Successfully built gnmt
Installing collected packages: gnmt
Successfully installed gnmt-0.6.0
Removing intermediate container 0547065d6492
 ---> 7a7bb07a7855
Step 10/12 : COPY . .
 ---> dfa84645d44d
Step 11/12 : ENV LANG C.UTF-8
 ---> Running in 992046e4ef3e
Removing intermediate container 992046e4ef3e
 ---> d1e6862fe916
Step 12/12 : ENV LC_ALL C.UTF-8
 ---> Running in c67514666b6d
Removing intermediate container c67514666b6d
 ---> 2d4231f91c86
Successfully built 2d4231f91c86
Successfully tagged mlperf-nvidia:rnn_translator

View Code

我们可以查看一下dockfile文件

# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3
FROM ${FROM_IMAGE_NAME}

# Install dependencies for system configuration logger
RUN apt-get update && apt-get install -y --no-install-recommends \
        infiniband-diags \
        pciutils && \
    rm -rf /var/lib/apt/lists/*

# Install Python dependencies
WORKDIR /workspace/rnn_translator

COPY requirements.txt .
RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance \
 && pip install --no-cache-dir -r requirements.txt

# Copy & build extensions
COPY seq2seq/csrc seq2seq/csrc
COPY setup.py .
RUN pip install .

# Copy GNMT code
COPY . .

# Configure environment variables
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

Docker问题：Dockerfile的From之前不能使用ARG，允许这种用法是在`docker 17.05.0-ce (2017-05-04)`之后才引入的，查看我的docker版本

[root@2 pytorch]# docker version
Client: Docker Engine - Community
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        2291f61
 Built:             Mon Dec 28 16:17:48 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
…………

对于本测试，将使用config_DGX1.sh并因此将DGXSYTEM指定为DGX1。还要将PULL设置为0，以指示使用本地映像而不是从存储库中提取docker映像。创建了一个新目录“ logs”来存储基准日志文件，并在启动基准运行时提供数据目录路径，如下所示：

[root@2 pytorch]# DATADIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/data 

[root@2 pytorch]# LOGDIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/logs 

[root@2 pytorch]# PULL=0 DGXSYSTEM=DGX1 

[root@2 pytorch]# ./run.sub

如果报错以下，表示计算机没有GPU

[root@2 pytorch]# ./run.sub
mlperf-nvidia:rnn_translator
nvidia-docker | 2021/01/25 13:48:39 Error: Could not load UVM kernel module. Is nvidia-modprobe installed?
ERR: Base container launch failed.

查看GPU信息nvidia-smi

如果一切顺利，它将执行基准测试10次，并将日志文件存储在指定目录中。由于在配置文件中指定了8个GPU，因此将看到所有8个GPU被用于训练GNMT模型。可以使用此命令watch -d -n 1 nvidia-smi来定期监视GPU使用情况。

run.sub里也可以设置PULL=0

# Pull latest image
PULL=0
if [[ "${PULL}" != "0" ]]; then
  DOCKERPULL="docker pull $CONT"
  pids=();
  for hostn in ${hosts[@]}; do
    timeout -k 600s 600s \
      $(eval echo $SRUN) $DOCKERPULL &
    pids+=($!);
  done
  wait "${pids[@]}"
  success=$? ; if [ $success -ne 0 ]; then echo "ERR: Image pull failed."; exit $success ; fi
fi

参考链接：https://segmentfault.com/a/1190000022834920?utm_source=tag-newest