启智
调试任务下载文件
wget -O 102flowers.tgz ‘https://open-data.obs.cn-south-222.ai.pcl.cn:443/attachment/8/2/821d7c66-c47b-4a99-8ede-c840bbef208c/102flowers.tgz?response-content-disposition=attachment%3B+filename%3D%22102flowers.tgz%22&AWSAccessKeyId=ZSCXA9TLRN1USYWIF7A5&Expires=1682262571&Signature=k%2B6e7uFeU2aw%2FWd4ado7NHPa7EU%3D’
tar zxvf 102flowers.tgz
@学致 课件代码是基于1.8版本的哦,大家在启智上选镜像选mindspore1.8.1可以加这两行试一下:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context@学train_dataset, valid_dataset, test_dataset = Multi30K(train_path), Multi30K(valid_path), Multi30K(test_path)致 在这里加encoding=‘utf-8’@学致 是这里的问题,在train和evaluate这两个函数里你会看到有个break,把他们删掉就好了另外还有个问题请教,def calculate_bleu(dataset, max_len=50):这里用的是max_len=50,我们在训练的时候用的是max_len=32,这会不会影响到推理的准确性呀。另外32这个数值是怎么得来的呀?这个参数个人觉得会对模型精度有影响吗?一个数据集就几条特别长的 为了这个去padding 不如丢弃Q:使用mindspore.dataset.Dataset.save存成多个文件好慢啊,有什么好办法嘛
A:也可以试试这样的手动写的方式哦 @小马Running GO
ds_iter = data_set.create_dict_iterator(output_numpy=True, num_epochs=1)
fw = FileWriter(…, shard_num=8)
add schema
for item in ds_iter:
fw.write_raw_data(data, parallel_writer=True)
fw.commit()
- mindspore版本使用最新的master
- FileWriter(…, shard_num=8) 这里多个文件分片
- fw.write_raw_data(data, parallel_writer=True) 第二个参数为True文件保存在work目录下不会丢,其它地方重启后会被清空容量是100g,超过就存不下了可以保存然后重启后加载之前的训练结果继续跑,存在work目录下就行,30天内不会丢这速度有问题 肯定没开amp想请问一下 mindspore官方模型库里的代码(https://gitee.com/mindspore/models/tree/master/official/nlp/Pangu_alpha)
这里input的格式为 :问{question}\n答{choice}\n该答案来自对话{context}
这里prompt的格式为: 问{question}\n答{choice}\n
假设: 问,答,上下文 分别为1个token,总共5个token,后面padding。
根据代码(not_equal),假设input_mask为 [1, 1, 1, 0]
根据代码(equal),假设input_mask_a为[0, 0, 1, 1]
则,input_mask_b应该为[0,0,1,0],相当于只关注了context部分,忽略了question和answer部分,直觉上感觉应该考虑answer部分。请问这个理解正确嘛?
input_mask_b的应用在compute_loss这里:@CQU弟中弟 如何开amp? https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/advanced/mixedmodel = auto_mixed_precision(model, ‘O2’)_precision.htmlmodel = auto_mixed_precision(model, ‘O2’)@学致1.8没有返回值 意思是不用model = auto_mixed_precision(model),而是直接auto_mixed_precision(model)2.0的话 大概率是因为没开图模式是哟我为了调试方便,使用了动态图。2.0默认动态图Ascend上因为启智的环境都还没升级cann包 没二进制算子所以动态图还是慢但还是达不到瑟琳娜 老师提供的代码训练一轮都只有2分钟的效果,而且它这个还是1.8版本的。@CQU弟中弟 copy了几行ddpm的代码,ms2.0 Ascend环境,升级到30秒了ĕ看到了是2.0的1.8跑的话慢一些训练中的eval就是推理,看那个速度,第一次也慢,可能还会卡顿,后面就快了class _OutputTo32(nn.Cell):
“Wrap cell for amp. Cast network output back to float32”
def __init__(self, op):
super().__init__(auto_prefix=False)
self._op = op
def construct(self, *x):
return ops.cast(self._op(*x), mstype.float32)
white_list = (nn.Dense, nn.MatMul)
def auto_white_list(network):
“”“auto cast based on white list”“”
cells = network.name_cells()
change = False
for name in cells:
subcell = cells[name]
if subcell == network:
continue
if isinstance(subcell, white_list):
network._cells[name] = _OutputTo32(subcell.to_float(
mstype.float16))
change = True
else:
auto_white_list(subcell)
if isinstance(network, nn.SequentialCell) and change:
network.cell_list = list(network.cells())
if isinstance(network, nn.SequentialCell) and change:
network.cell_list = list(network.cells())
return network
if device_target==‘Ascend’:
model = auto_white_list(model)这得自己摸索,没啥特定教程,涉及内容的话,和混合精度有关,可以看看这方面的用2.0就直接看官网教程就好了这段代码其实核心就是先把模型计算转成fp16,输出时再转成fp32,具体每行代码是什么意思,在2.0中混合精度教程中有说明吗?@CQU弟中弟2.0你就压根不需要关注这个auto_mixed_precision的意思就是自动混合精度2.0有O1了 无脑O1 就可以了https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/advanced/mixed_precision.htmlm[1].asnumpy() @CQU弟中弟 才师你好 if token_type_ids is None:
token_type_ids = ops.zeros_like(input_ids)这行代中的 token_type_ids 如果不为空,是怎么理解呀?是不是很多英文字组成的词,如果是中文要怎么办呢?这个是区分两句话的只有0和1 token_type_ids是起什么作用呢?你看一下bert的任务…nsp前面句子是零,后面是1,是这样理解吗?bert_introduction.ipynb这个文件有语料吗?在代码中没看到语料,想跑下这个代码要跑的话 去pretrain目录这个只是把代码截取出来了其实是个ppt例子在这 我还没整理https://github.com/mindspore-lab/mindnlp/tree/master/examples/LLM/Bert-PET jit=True请问这个参数起什么作用呀?即时编译可以看mindnlp的run_pretrain.py这个代码需要在什么环境下才能跑起来,8卡910最低2卡或者2卡GPUoutputs = (prediction_scores, seq_relationship_score,) + outputs[2:]这行代码大致是起什么作用呀?
@学致 课件代码是基于1.8版本的哦,大家在启智上选镜像选mindspore1.8.1可以加这两行试一下:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context@学train_dataset, valid_dataset, test_dataset = Multi30K(train_path), Multi30K(valid_path), Multi30K(test_path)致 在这里加encoding=‘utf-8’@学致 是这里的问题,在train和evaluate这两个函数里你会看到有个break,把他们删掉就好了另外还有个问题请教,def calculate_bleu(dataset, max_len=50):这里用的是max_len=50,我们在训练的时候用的是max_len=32,这会不会影响到推理的准确性呀。另外32这个数值是怎么得来的呀?这个参数个人觉得会对模型精度有影响吗?一个数据集就几条特别长的 为了这个去padding 不如丢弃Q:使用mindspore.dataset.Dataset.save存成多个文件好慢啊,有什么好办法嘛
A:也可以试试这样的手动写的方式哦 @小马Running GO
ds_iter = data_set.create_dict_iterator(output_numpy=True, num_epochs=1)
fw = FileWriter(…, shard_num=8)
add schema
for item in ds_iter:
fw.write_raw_data(data, parallel_writer=True)
fw.commit()
- mindspore版本使用最新的master
- FileWriter(…, shard_num=8) 这里多个文件分片
- fw.write_raw_data(data, parallel_writer=True) 第二个参数为True文件保存在work目录下不会丢,其它地方重启后会被清空容量是100g,超过就存不下了可以保存然后重启后加载之前的训练结果继续跑,存在work目录下就行,30天内不会丢这速度有问题 肯定没开amp想请问一下 mindspore官方模型库里的代码(https://gitee.com/mindspore/models/tree/master/official/nlp/Pangu_alpha)
这里input的格式为 :问{question}\n答{choice}\n该答案来自对话{context}
这里prompt的格式为: 问{question}\n答{choice}\n
假设: 问,答,上下文 分别为1个token,总共5个token,后面padding。
根据代码(not_equal),假设input_mask为 [1, 1, 1, 0]
根据代码(equal),假设input_mask_a为[0, 0, 1, 1]
则,input_mask_b应该为[0,0,1,0],相当于只关注了context部分,忽略了question和answer部分,直觉上感觉应该考虑answer部分。请问这个理解正确嘛?
input_mask_b的应用在compute_loss这里:@CQU弟中弟 如何开amp? https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/advanced/mixedmodel = auto_mixed_precision(model, ‘O2’)_precision.htmlmodel = auto_mixed_precision(model, ‘O2’)@学致1.8没有返回值 意思是不用model = auto_mixed_precision(model),而是直接auto_mixed_precision(model)2.0的话 大概率是因为没开图模式是哟我为了调试方便,使用了动态图。2.0默认动态图Ascend上因为启智的环境都还没升级cann包 没二进制算子所以动态图还是慢但还是达不到瑟琳娜 老师提供的代码训练一轮都只有2分钟的效果,而且它这个还是1.8版本的。@CQU弟中弟 copy了几行ddpm的代码,ms2.0 Ascend环境,升级到30秒了ĕ看到了是2.0的1.8跑的话慢一些训练中的eval就是推理,看那个速度,第一次也慢,可能还会卡顿,后面就快了class _OutputTo32(nn.Cell):
“Wrap cell for amp. Cast network output back to float32”
def __init__(self, op):
super().__init__(auto_prefix=False)
self._op = op
def construct(self, *x):
return ops.cast(self._op(*x), mstype.float32)
white_list = (nn.Dense, nn.MatMul)
def auto_white_list(network):
“”“auto cast based on white list”“”
cells = network.name_cells()
change = False
for name in cells:
subcell = cells[name]
if subcell == network:
continue
if isinstance(subcell, white_list):
network._cells[name] = _OutputTo32(subcell.to_float(
mstype.float16))
change = True
else:
auto_white_list(subcell)
if isinstance(network, nn.SequentialCell) and change:
network.cell_list = list(network.cells())
if isinstance(network, nn.SequentialCell) and change:
network.cell_list = list(network.cells())
return network
if device_target==‘Ascend’:
model = auto_white_list(model)这得自己摸索,没啥特定教程,涉及内容的话,和混合精度有关,可以看看这方面的用2.0就直接看官网教程就好了这段代码其实核心就是先把模型计算转成fp16,输出时再转成fp32,具体每行代码是什么意思,在2.0中混合精度教程中有说明吗?@CQU弟中弟2.0你就压根不需要关注这个auto_mixed_precision的意思就是自动混合精度2.0有O1了 无脑O1 就可以了https://www.mindspore.cn/tutorials/zh-CN/r2.0.0-alpha/advanced/mixed_precision.htmlm[1].asnumpy() @CQU弟中弟 才师你好 if token_type_ids is None:
token_type_ids = ops.zeros_like(input_ids)这行代中的 token_type_ids 如果不为空,是怎么理解呀?是不是很多英文字组成的词,如果是中文要怎么办呢?这个是区分两句话的只有0和1 token_type_ids是起什么作用呢?你看一下bert的任务…nsp前面句子是零,后面是1,是这样理解吗?bert_introduction.ipynb这个文件有语料吗?在代码中没看到语料,想跑下这个代码要跑的话 去pretrain目录这个只是把代码截取出来了其实是个ppt例子在这 我还没整理https://github.com/mindspore-lab/mindnlp/tree/master/examples/LLM/Bert-PET jit=True请问这个参数起什么作用呀?即时编译可以看mindnlp的run_pretrain.py这个代码需要在什么环境下才能跑起来,8卡910最低2卡或者2卡GPUoutputs = (prediction_scores, seq_relationship_score,) + outputs[2:]这行代码大致是起什么作用呀?
启智的调试环境了要自己分包下,不能用OBS的方式。训练作业是先暂时保存在/cache目录下,然后用moxing copy到train_url参数给定的obs目录下,它会自己显示在训练结果列表中。智算调试环境也一样,训练作业的生成结果是保存在/cache/output目录下,不需要写moxing copy代码,训练完成后,会把output目录下的自动同步到训练结果列表中,这个同步网速比较慢,模型文件大的话,会同步很久。启智的调试环境,我刚才看来,可以直接在work 目录得到结果。
训练作业不要往work目录里存,首先这个目录容量有限,训练作业一般要存很多文件,这里是存不下的,并且任务结束后,容器可能都会清空回收了,work目录下的东西是拿不到的
切割和下载模型
split -d -b 99m model-7500.ckpt M75_
cat M75_* > model-7500.ckpt
rm M75_*
另外应该也没这么慢吧 我看看是不是没开amp到时间推出后可以点击先fork再次调试的然后在进你fork的仓里的云脑标签我fork了之后,通过在线运行进行还是无法保存文件训练结果是吗?只有云脑这开的是固定的Q:使用mindspore.dataset.Dataset.save存成多个文件好慢啊,有什么好办法嘛
A:也可以试试这样的手动写的方式哦 @小马Running GO
ds_iter = data_set.create_dict_iterator(output_numpy=True, num_epochs=1)
fw = FileWriter(…, shard_num=8)
add schema
for item in ds_iter:
fw.write_raw_data(data, parallel_writer=True)
fw.commit()
- mindspore版本使用最新的master
- FileWriter(…, shard_num=8) 这里多个文件分片
- fw.write_raw_data(data, parallel_writer=True) 第二个参数为True文件保存在work目录下不会丢,其它地方重启后会被清空调试环境都是满4小时会自动停止。https://github.com/mindspore-lab/mindnlp/tree/master/examples/LLM/Bert-PET
#######下载多文件
调用os.path.join(data_dir+‘/102flower’), ####102flowers/jpg/*mage.jpg
[Modelarts Service Log]2023-04-25 13:31:54,243 - INFO - bootstrap proc-rank-0-device-0
Successfully Download s3://open-data/attachment/6/c/6c9fa2e0-ea38-4dc2-8500-bb94dfb5bbfc6c9fa2e0-ea38-4dc2-8500-bb94dfb5bbfc/ to /cache/data/data
Successfully Download s3://open-data/attachment/2/d/2d827568-420b-4964-9d3d-39557b6de0a52d827568-420b-4964-9d3d-39557b6de0a5/ to /cache/data/flower
download_input succeed
单ds_train = create_dataset(os.path.join(data_dir, “train”), cfg.batch_size)
多就是两个或者以上的压缩文件比如MNISTData.zip就是其中一个,这里加了MNISTData.zip的文件名MNISTData做为一层文件夹名字,单数据集的用法就是没有这层文件名
ds_train = create_dataset(os.path.join(data_dir + “/MNISTData”, “train”), cfg.batch_size)
if device_num > 1:
NPU启智集群中单数据集和多数据集的区别:
超参数不同:
单数据集的超参数通过–data_url传递
多数据集的超参数通过–multi_data_url传递,并且需要保留–data_url(只要定义就可以了)### --data_url,–multi_data_url,–train_url,–device_target,These 4 parameters must be defined first in a multi-dataset,
### otherwise an error will be reported.
### There is no need to add these parameters to the running parameters of the Qizhi platform,
### because they are predefined in the background, you only need to define them in your code.
parser.add_argument(‘–data_url’,
help=‘path to training/inference dataset folder’,
default= ‘/cache/data1/’)
parser.add_argument(‘–multi_data_url’,
help=‘path to multi dataset’,
default= ‘/cache/data/’)
parser.add_argument(‘–train_url’,
help=‘model folder to save/load’,
default= ‘/cache/output/’) )
数据集使用方式不同:
如本示例中单数据集MNISTData.zip的使用方式是:数据集位于/cache/data下
多数据集时MNISTData.zip的使用方式是:数据集位于/cache/data/MNISTData/下
“”"
######################## multi-dataset train lenet example ########################
This example is a multi-dataset training tutorial. If it is a single dataset, please refer to the single dataset
training tutorial train.py. This example cannot be used for a single dataset!
“”"
“”"
######################## Instructions for using the training environment ########################
1、(1)The structure of the dataset uploaded for multi-dataset training in this example
MNISTData.zip
├── test
└── train
checkpoint_lenet-1_1875.zip
├── checkpoint_lenet-1_1875.ckpt
(2)The dataset structure in the training image for multiple datasets in this example
workroot
├── MNISTData
| ├── test
| └── train
└── checkpoint_lenet-1_1875
├── checkpoint_lenet-1_1875.ckpt
2、Multi-dataset training requires predefined functions
(1)Copy multi-dataset from obs to training image
function MultiObsToEnv(multi_data_url, data_dir)
(2)Copy the output to obs
function EnvToObs(train_dir, obs_train_url)
(3)Download the input from Qizhi And Init
function DownloadFromQizhi(multi_data_url, data_dir)
(4)Upload the output to Qizhi
function UploadToQizhi(train_dir, obs_train_url)
3、4 parameters need to be defined
–data_url is the first dataset you selected on the Qizhi platform
–multi_data_url is the multi-dataset you selected on the Qizhi platform
–data_url,–multi_data_url,–train_url,–device_target,These 4 parameters must be defined first in a multi-dataset task,
otherwise an error will be reported.
There is no need to add these parameters to the running parameters of the Qizhi platform,
because they are predefined in the background, you only need to define them in your code
4、How the dataset is used
Multi-datasets use multi_data_url as input, data_dir + dataset name + file or folder name in the dataset as the
calling path of the dataset in the training image.
For example, the calling path of the train folder in the MNIST_Data dataset in this example is
data_dir + “/MNIST_Data” +“/train”
For details, please refer to the following sample code.
“”"
import os
import argparse
import moxing as mox
#from config import mnist_cfg as cfg
from dataset import create_dataset
from dataset_distributed import create_dataset_parallel
from lenet import LeNet5
import json
import mindspore.nn as nn
from mindspore import context
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor
from mindspore.train import Model
from mindspore.nn.metrics import Accuracy
from mindspore import load_checkpoint, load_param_into_net
from mindspore.context import ParallelMode
from mindspore.communication.management import init, get_rank
import time
Copy multiple datasets from obs to training image
def MultiObsToEnv(multi_data_url, data_dir):
#–multi_data_url is json data, need to do json parsing for multi_data_url
multi_data_json = json.loads(multi_data_url)
for i in range(len(multi_data_json)):
path = data_dir + “/” + multi_data_json[i][“dataset_name”]
if not os.path.exists(path):
os.makedirs(path)
try:
mox.file.copy_parallel(multi_data_json[i][“dataset_url”], path)
print(“Successfully Download {} to {}”.format(multi_data_json[i][“dataset_url”],path))
except Exception as e:
print('moxing download {} to {} failed: '.format(
multi_data_json[i][“dataset_url”], path) + str(e))
#Set a cache file to determine whether the data has been copied to obs.
#If this file exists during multi-card training, there is no need to copy the dataset multiple times.
f = open(“/cache/download_input.txt”, ‘w’)
f.close()
try:
if os.path.exists(“/cache/download_input.txt”):
print(“download_input succeed”)
except Exception as e:
print(“download_input failed”)
return
Copy the output model to obs
def EnvToObs(train_dir, obs_train_url):
try:
mox.file.copy_parallel(train_dir, obs_train_url)
print(“Successfully Upload {} to {}”.format(train_dir,
obs_train_url))
except Exception as e:
print('moxing upload {} to {} failed: '.format(train_dir,
obs_train_url) + str(e))
return
def DownloadFromQizhi(multi_data_url, data_dir):
device_num = int(os.getenv(‘RANK_SIZE’))
if device_num == 1:
MultiObsToEnv(multi_data_url,data_dir)
context.set_context(mode=context.GRAPH_MODE,device_target=args.device_target)
if device_num > 1:
# set device_id and init for multi-card training
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target, device_id=int(os.getenv(‘ASCEND_DEVICE_ID’)))
context.reset_auto_parallel_context()
context.set_auto_parallel_context(device_num = device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, parameter_broadcast=True)
init()
#Copying obs data does not need to be executed multiple times, just let the 0th card copy the data
local_rank=int(os.getenv(‘RANK_ID’))
if local_rank%80:
MultiObsToEnv(multi_data_url,data_dir)
#If the cache file does not exist, it means that the copy data has not been completed,
#and Wait for 0th card to finish copying data
while not os.path.exists(“/cache/download_input.txt”):
time.sleep(1)
return
def UploadToQizhi(train_dir, obs_train_url):
device_num = int(os.getenv(‘RANK_SIZE’))
local_rank=int(os.getenv(‘RANK_ID’))
if device_num == 1:
EnvToObs(train_dir, obs_train_url)
if device_num > 1:
if local_rank%80:
EnvToObs(train_dir, obs_train_url)
return
parser = argparse.ArgumentParser(description=‘MindSpore Lenet Example’)
–data_url,–multi_data_url,–train_url,–device_target,These 4 parameters must be defined first in a multi-dataset,
otherwise an error will be reported.
There is no need to add these parameters to the running parameters of the Qizhi platform,
because they are predefined in the background, you only need to define them in your code.
parser.add_argument(‘–data_url’,
help=‘path to training/inference dataset folder’,
default= ‘/cache/data1/’)
parser.add_argument(‘–multi_data_url’,
help=‘path to multi dataset’,
default= ‘/cache/data/’)
parser.add_argument(‘–train_url’,
help=‘model folder to save/load’,
default= ‘/cache/output/’)
parser.add_argument(
‘–device_target’,
type=str,
default=“Ascend”,
choices=[‘Ascend’, ‘CPU’],
help=‘device where the code will be implemented (default: Ascend),if to use the CPU on the Qizhi platform:device_target=CPU’)
parser.add_argument(‘–epoch_size’,
type=int,
default=5,
help=‘Training epochs.’)
if name == “main”:
args, unknown = parser.parse_known_args()
data_dir = ‘/cache/data’
train_dir = ‘/cache/output’
try:
if not os.path.exists(data_dir):
os.makedirs(data_dir)
if not os.path.exists(train_dir):
os.makedirs(train_dir)
except Exception as e:
print(“path already exists”)
###Initialize and copy data to training image
DownloadFromQizhi(args.multi_data_url, data_dir)
###The dataset path is used here:data_dir + “/flower” +“/train”
###############################启智和智算的下载路径不能混用###################################
inflating: /cache/data/102flowers/102flowers/jpg/image_08169.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08170.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08171.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08172.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08173.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08174.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08175.jpg
##102flowers/jpg/*image.jpg <-----> 102flowers.zip
########智算多数据集的,请看,这里多了一层/102flowers。。。 os.path.join(data_dir+‘/102flowers’,‘102flowers’),
####启智单数据集的,就没有多一层/102flowers os.path.join(data_dir,“102flowers”),
import ddpm as ddpm
import argparse
import os
from mindspore import context
from mindspore.communication.management import init
from mindspore.context import ParallelMode
import time
from upload import UploadOutput
import moxing as mox
import json
def parse_args():
parser = argparse.ArgumentParser(description=“train ddpm”,
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--data_url',
help='path to training/inference dataset folder',
default= '/cache/data1/')
parser.add_argument('--multi_data_url',
help='path to multi dataset',
default= '/cache/data/')
parser.add_argument('--train_url',
default='./results',
type=str,
help='the path model and fig save path')
parser.add_argument('--ckpt_url',
type=str,
default=None,
help='the pretrain model path')
parser.add_argument('--steps',
default=10000,
type=int,
help='training steps')
parser.add_argument('--save_every',
default=1000,
type=int,
help='save every')
parser.add_argument('--num_samples',
default=4,
type=int,
help='num_samples must have a square root, like 4, 9, 16 ...')
parser.add_argument('--device_target',
default="Ascend",
type=str,
help='device target')
parser.add_argument('--image_size',
default=200,
type=int,
help='image size')
args, _ = parser.parse_known_args()
return args
####这个是智算下载模型的
def C2netModelToEnv(model_url, model_dir):
#–ckpt_url is json data, need to do json parsing for ckpt_url_json
model_url_json = json.loads(model_url)
print(“model_url_json:”,model_url_json)
for i in range(len(model_url_json)):
modelfile_path = model_dir + “/” + “checkpoint.ckpt”
try:
mox.file.copy(model_url_json[i][“model_url”], modelfile_path)
print(“Successfully Download {} to {}”.format(model_url_json[i][“model_url”],modelfile_path))
except Exception as e:
print('moxing download {} to {} failed: '.format(
model_url_json[i][“model_url”], modelfile_path) + str(e))
return
#########这个是启智集群的代码,不同集群的数据下载是不能混用的!!
Copy multiple datasets from obs to training image
def MultiObsToEnv(multi_data_url, data_dir):
#–multi_data_url is json data, need to do json parsing for multi_data_url
multi_data_json = json.loads(multi_data_url)
for i in range(len(multi_data_json)):
path = data_dir + “/” + multi_data_json[i][“dataset_name”]
if not os.path.exists(path):
os.makedirs(path)
try:
mox.file.copy_parallel(multi_data_json[i][“dataset_url”], path)
print(“Successfully Download {} to {}”.format(multi_data_json[i][“dataset_url”],path))
except Exception as e:
print('moxing download {} to {} failed: '.format(
multi_data_json[i][“dataset_url”], path) + str(e))
#Set a cache file to determine whether the data has been copied to obs.
#If this file exists during multi-card training, there is no need to copy the dataset multiple times.
f = open(“/cache/download_input.txt”, ‘w’)
f.close()
try:
if os.path.exists(“/cache/download_input.txt”):
print(“download_input succeed”)
except Exception as e:
print(“download_input failed”)
return
代码解释
train.py: 启智集群训练启动文件
train-newpy: 智算集群训练启动文件
pretrain.py: 智算集群加载上轮训练的模型的启动文件
#########这个是智算集群的代码
def C2netMultiObsToEnv(multi_data_url, data_dir):
multi_data_json = json.loads(multi_data_url)
for i in range(len(multi_data_json)):
zipfile_path = data_dir + “/” + multi_data_json[i][“dataset_name”]
try:
mox.file.copy(multi_data_json[i][“dataset_url”], zipfile_path)
print(“Successfully Download {} to {}”.format(multi_data_json[i][“dataset_url”],zipfile_path))
filename = os.path.splitext(multi_data_json[i][“dataset_name”])[0]
filePath = data_dir + “/” + filename
if not os.path.exists(filePath):
os.makedirs(filePath)
os.system(“unzip {} -d {}”.format(zipfile_path, filePath))
except Exception as e:
print('moxing download {} to {} failed: '.format(
multi_data_json[i][“dataset_url”], zipfile_path) + str(e))
f = open(“/cache/download_input.txt”, ‘w’)
f.close()
try:
if os.path.exists(“/cache/download_input.txt”):
print(“download_input succeed”)
except Exception as e:
print(“download_input failed”)
return
def EnvToObs(train_dir, obs_train_url):
try:
mox.file.copy_parallel(train_dir, obs_train_url)
print(“Successfully Upload {} to {}”.format(train_dir,
obs_train_url))
except Exception as e:
print('moxing upload {} to {} failed: '.format(train_dir,
obs_train_url) + str(e))
return
def DownloadFromQizhi(multi_data_url, data_dir):
device_num = int(os.getenv(‘RANK_SIZE’))
if device_num == 1:
#C2netMultiObsToEnv(multi_data_url,data_dir)
MultiObsToEnv(multi_data_url,data_dir)
context.set_context(mode=context.GRAPH_MODE,device_target=args_opt.device_target)
if device_num > 1:
# set device_id and init for multi-card training
context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=int(os.getenv(‘ASCEND_DEVICE_ID’)))
context.reset_auto_parallel_context()
context.set_auto_parallel_context(device_num = device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, parameter_broadcast=True)
init()
local_rank = int(os.getenv(‘RANK_ID’))
if local_rank % 8 == 0:
#C2netMultiObsToEnv(multi_data_url,data_dir)
MultiObsToEnv(multi_data_url,data_dir)
while not os.path.exists("/cache/download_input.txt"):
time.sleep(1)
return
def DownloadModelFromQizhi(model_url, model_dir):
device_num = int(os.getenv(‘RANK_SIZE’))
if device_num == 1:
C2netModelToEnv(model_url,model_dir)
if device_num > 1:
#Copying obs data does not need to be executed multiple times, just let the 0th card copy the data
local_rank=int(os.getenv(‘RANK_ID’))
if local_rank%8==0:
C2netModelToEnv(model_url,model_dir)
return
def UploadToQizhi(train_dir, obs_train_url):
device_num = int(os.getenv(‘RANK_SIZE’))
local_rank = int(os.getenv(‘RANK_ID’))
if device_num == 1:
EnvToObs(train_dir, obs_train_url)
if device_num > 1:
if local_rank % 8 == 0:
EnvToObs(train_dir, obs_train_url)
return
def train_ddpm():
steps = args_opt.steps
image_size = args_opt.image_size
data_dir = ‘/cache/data’
train_dir = ‘/cache/output’
model_dir = ‘/cache/pretrain’
try:
if not os.path.exists(data_dir):
os.makedirs(data_dir)
if not os.path.exists(train_dir):
os.makedirs(train_dir)
except Exception as e:
print(“path already exists”)
#DownloadModelFromQizhi(args_opt.ckpt_url, model_dir)
DownloadFromQizhi(args_opt.multi_data_url, data_dir)
if name == ‘main’:
args_opt = parse_args()
train_ddpm()
单数据集和多数据集这个代码都能下载,都创建了多了一层文件夹102flowers
inflating: /cache/data/102flowers/102flowers/jpg/image_08180.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08181.jpg
inflating: /cache/data/102flowers/102flowers/jpg/image_08182.jpg
import ddpm as ddpm
import argparse
import os
from mindspore import context
from mindspore.communication.management import init
from mindspore.context import ParallelMode
import time
from upload import UploadOutput
import moxing as mox
import json
def parse_args():
parser = argparse.ArgumentParser(description=“train ddpm”,
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--ckpt_url',
type=str,
default=None,
help='the pretrain model path')
parser.add_argument('--data_url',
help='path to training/inference dataset folder',
default= '/cache/data1/')
parser.add_argument('--multi_data_url',
type=str,
default="C:\\Users\\Administrator\\PycharmProjects\\DDPM\\datasets\\test",
help='training data file path')
parser.add_argument('--train_url',
default='./results',
type=str,
help='the path model and fig save path')
parser.add_argument('--steps',
default=10000,
type=int,
help='training steps')
parser.add_argument('--save_every',
default=1000,
type=int,
help='save every')
parser.add_argument('--num_samples',
default=4,
type=int,
help='num_samples must have a square root, like 4, 9, 16 ...')
parser.add_argument('--device_target',
default="Ascend",
type=str,
help='device target')
parser.add_argument('--image_size',
default=200,
type=int,
help='image size')
args, _ = parser.parse_known_args()
return args
def C2netModelToEnv(model_url, model_dir):
#–ckpt_url is json data, need to do json parsing for ckpt_url_json
model_url_json = json.loads(model_url)
print(“model_url_json:”,model_url_json)
for i in range(len(model_url_json)):
modelfile_path = model_dir + “/” + “checkpoint.ckpt”
try:
mox.file.copy(model_url_json[i][“model_url”], modelfile_path)
print(“Successfully Download {} to {}”.format(model_url_json[i][“model_url”],modelfile_path))
except Exception as e:
print('moxing download {} to {} failed: '.format(
model_url_json[i][“model_url”], modelfile_path) + str(e))
return
def C2netMultiObsToEnv(multi_data_url, data_dir):
multi_data_json = json.loads(multi_data_url)
for i in range(len(multi_data_json)):
zipfile_path = data_dir + “/” + multi_data_json[i][“dataset_name”]
try:
mox.file.copy(multi_data_json[i][“dataset_url”], zipfile_path)
print(“Successfully Download {} to {}”.format(multi_data_json[i][“dataset_url”],zipfile_path))
filename = os.path.splitext(multi_data_json[i][“dataset_name”])[0]
filePath = data_dir + “/” + filename
if not os.path.exists(filePath):
os.makedirs(filePath)
os.system(“unzip {} -d {}”.format(zipfile_path, filePath))
except Exception as e:
print('moxing download {} to {} failed: '.format(
multi_data_json[i]["dataset_url"], zipfile_path) + str(e))
f = open("/cache/download_input.txt", 'w')
f.close()
try:
if os.path.exists("/cache/download_input.txt"):
print("download_input succeed")
except Exception as e:
print("download_input failed")
return
def EnvToObs(train_dir, obs_train_url):
try:
mox.file.copy_parallel(train_dir, obs_train_url)
print(“Successfully Upload {} to {}”.format(train_dir,
obs_train_url))
except Exception as e:
print('moxing upload {} to {} failed: '.format(train_dir,
obs_train_url) + str(e))
return
def DownloadFromQizhi(multi_data_url, data_dir):
device_num = int(os.getenv(‘RANK_SIZE’))
if device_num == 1:
C2netMultiObsToEnv(multi_data_url,data_dir)
context.set_context(mode=context.GRAPH_MODE,device_target=args_opt.device_target)
if device_num > 1:
# set device_id and init for multi-card training
context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=int(os.getenv(‘ASCEND_DEVICE_ID’)))
context.reset_auto_parallel_context()
context.set_auto_parallel_context(device_num = device_num, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, parameter_broadcast=True)
init()
local_rank = int(os.getenv(‘RANK_ID’))
if local_rank % 8 == 0:
C2netMultiObsToEnv(multi_data_url,data_dir)
while not os.path.exists("/cache/download_input.txt"):
time.sleep(1)
return
def DownloadModelFromQizhi(model_url, model_dir):
device_num = int(os.getenv(‘RANK_SIZE’))
if device_num == 1:
C2netModelToEnv(model_url,model_dir)
if device_num > 1:
#Copying obs data does not need to be executed multiple times, just let the 0th card copy the data
local_rank=int(os.getenv(‘RANK_ID’))
if local_rank%8==0:
C2netModelToEnv(model_url,model_dir)
return
def UploadToQizhi(train_dir, obs_train_url):
device_num = int(os.getenv(‘RANK_SIZE’))
local_rank = int(os.getenv(‘RANK_ID’))
if device_num == 1:
EnvToObs(train_dir, obs_train_url)
if device_num > 1:
if local_rank % 8 == 0:
EnvToObs(train_dir, obs_train_url)
return
def train_ddpm():
steps = args_opt.steps
image_size = args_opt.image_size
data_dir = ‘/cache/data’
train_dir = ‘/cache/output’
model_dir = ‘/cache/pretrain’
try:
if not os.path.exists(data_dir):
os.makedirs(data_dir)
if not os.path.exists(train_dir):
os.makedirs(train_dir)
except Exception as e:
print(“path already exists”)
# DownloadModelFromQizhi(args_opt.ckpt_url, model_dir)
DownloadFromQizhi(args_opt.multi_data_url, data_dir)
print("List /cache/data: ", os.listdir(data_dir))
model = ddpm.Unet(
dim=image_size,
out_dim=3,
dim_mults=(1, 2, 4, 8)
)
diffusion = ddpm.GaussianDiffusion(
model,
image_size=image_size,
timesteps=20, # number of time steps
sampling_timesteps=10,
loss_type='l2' # L1 or L2
)
trainer = ddpm.Trainer(
diffusion,
os.path.join(data_dir+'/102flowers','102flowers'),
train_batch_size=1,
train_lr=8e-5,
train_num_steps=steps, # total training steps
gradient_accumulate_every=1, # gradient accumulation steps
ema_decay=0.995, # exponential moving average decay
save_and_sample_every=args_opt.save_every, # image sampling and step
num_samples=4,
results_folder=train_dir,
amp_level="O1",
distributed=False
)
# if args_opt.ckpt_url:
# trainer.load(os.path.join(model_dir, 'ddpm1.ckpt'))
trainer.train()
UploadToQizhi(train_dir, args_opt.train_url)
if name == ‘main’:
args_opt = parse_args()
train_ddpm()