大家好,我是刘明,明志科技创始人,华为昇思MindSpore布道师。
技术上主攻前端开发、鸿蒙开发和AI算法研究。
努力为大家带来持续的技术分享,如果你也喜欢我的文章,就点个关注吧
缓存共享
对于单机多卡的分布式训练的场景,缓存还允许多个相同的训练脚本共享同一个缓存,共同从缓存中读写数据。
- 启动缓存服务器
$cache_admin --start
Cache server startup completed successfully!
The cache server daemon has been created as process id 39337 and listening on port 50052
Recommendation:
Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup
- 创建缓存会话
创建启动Python训练的Shell脚本cache.sh,通过以下命令生成一个缓存会话id:
#!/bin/bash
# This shell script will launch parallel pipelines
# get path to dataset directory
if [ $# != 1 ]
then
echo "Usage: sh cache.sh DATASET_PATH"
exit 1
fi
dataset_path=$1
# generate a session id that these parallel pipelines can share
result=$(cache_admin -g 2>&1)
rc=$?
if [ $rc -ne 0 ]; then
echo "some error"
exit 1
fi
# grab the session id from the result string
session_id=$(echo $result | awk '{print $NF}')
- 会话id传入训练脚本
继续编写Shell脚本,添加以下命令在启动Python训练时将session_id以及其他参数传入:
# make the session_id available to the python scripts
num_devices=4
for p in $(seq 0 $((${num_devices}-1))); do
python my_training_script.py --num_devices "$num_devices" --device "$p" --session_id $session_id --dataset_path $dataset_path
done
- 创建并应用缓存实例
下面样例中使用到CIFAR-10数据集。
├─cache.sh
├─my_training_script.py
└─cifar-10-batches-bin
├── batches.meta.txt
├── data_batch_1.bin
├── data_batch_2.bin
├── data_batch_3.bin
├── data_batch_4.bin
├── data_batch_5.bin
├── readme.html
└── test_batch.bin
创建并编写Python脚本my_training_script.py,通过以下代码接收传入的session_id,并在定义缓存实例时将其作为参数传入。
import argparse
import mindspore.dataset as ds
parser = argparse.ArgumentParser(description='Cache Example')
parser.add_argument('--num_devices', type=int, default=1, help='Device num.')
parser.add_argument('--device', type=int, default=0, help='Device id.')
parser.add_argument('--session_id', type=int, default=1, help='Session id.')
parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
args_opt = parser.parse_args()
# apply cache to dataset
test_cache = ds.DatasetCache(session_id=args_opt.session_id, size=0, spilling=False)
dataset = ds.Cifar10Dataset(dataset_dir=args_opt.dataset_path, num_samples=4, shuffle=False, num_parallel_workers=1,
num_shards=args_opt.num_devices, shard_id=args_opt.device, cache=test_cache)
num_iter = 0
for _ in dataset.create_dict_iterator():
num_iter += 1
print("Got {} samples on device {}".format(num_iter, args_opt.device))
- 运行训练脚本
运行Shell脚本cache.sh开启分布式训练:
$ sh cache.sh cifar-10-batches-bin/
Got 4 samples on device 0
Got 4 samples on device 1
Got 4 samples on device 2
Got 4 samples on device 3
通过cache_admin --list_sessions命令可以查看当前会话中只有一组数据,说明缓存共享成功。
$ cache_admin --list_sessions
Listing sessions for server on port 50052
Session Cache Id Mem cached Disk cached Avg cache size Numa hit
3392558708 821590605 16 n/a 3227 16
- 销毁缓存会话
在训练结束后,可以选择将当前的缓存销毁并释放内存:
$ cache_admin --destroy_session 3392558708
Drop session successfully for server on port 50052
- 关闭缓存服务器
使用完毕后,可以选择关闭缓存服务器:
$ cache_admin --stop
Cache server on port 50052 has been stopped successfully.