【分布式深度学习异常报错】多机多卡训练时某个step主节点取到的数据突然变成0条导致报错

  1/Unknown - 59s 59s/step - loss: 1055.8333 - loss_array: 1055.4790 - mse: 1055.9851Shape of x before patch_embed: [50 5761]
  2/Unknown - 60s 559ms/step - loss: 1163.2756 - loss_array: 1163.3583 - mse: 1163.2401Shape of x before patch_embed: [50 5761]
  3/Unknown - 60s 531ms/step - loss: 1209.6768 - loss_array: 1211.8483 - mse: 1208.7461Shape of x before patch_embed: [50 5761]
  4/Unknown - 61s 515ms/step - loss: 1237.2795 - loss_array: 1242.0624 - mse: 1235.2297Shape of x before patch_embed: [50 5761]
  5/Unknown - 61s 508ms/step - loss: 1234.6672 - loss_array: 1242.5141 - mse: 1231.3042Shape of x before patch_embed: [50 5761]
  6/Unknown - 62s 504ms/step - loss: 1228.4303 - loss_array: 1239.4085 - mse: 1223.7252Shape of x before patch_embed: [50 5761]
  7/Unknown - 62s 501ms/step - loss: 1256.0232 - loss_array: 1271.7646 - mse: 1249.2767Shape of x before patch_embed: [50 5761]
  8/Unknown - 63s 499ms/step - loss: 1280.5443 - loss_array: 1301.5618 - mse: 1271.5366Shape of x before patch_embed: [50 5761]
  9/Unknown - 63s 496ms/step - loss: 1244.5045 - loss_array: 1269.5423 - mse: 1233.7739Shape of x before patch_embed: [50 5761]
 10/Unknown - 64s 496ms/step - loss: 1269.7253 - loss_array: 1300.3098 - mse: 1256.6178Shape of x before patch_embed: [50 5761]
 11/Unknown - 64s 497ms/step - loss: 1267.3302 - loss_array: 1303.8362 - mse: 1251.6846Shape of x before patch_embed: [50 5761]
 12/Unknown - 65s 496ms/step - loss: 1261.7344 - loss_array: 1303.3875 - mse: 1243.8829Shape of x before patch_embed: [50 5761]
 13/Unknown - 65s 496ms/step - loss: 1252.1316 - loss_array: 1300.0152 - mse: 1231.6099Shape of x before patch_embed: [50 5761]
 14/Unknown - 66s 495ms/step - loss: 1235.3027 - loss_array: 1288.9775 - mse: 1212.2992Shape of x before patch_embed: [50 5761]
 15/Unknown - 66s 495ms/step - loss: 1217.7271 - loss_array: 1277.5720 - mse: 1192.0793Shape of x before patch_embed: [50 5761]
 16/Unknown - 67s 494ms/step - loss: 1204.9285 - loss_array: 1270.5418 - mse: 1176.8085Shape of x before patch_embed: [50 5761]
 17/Unknown - 67s 493ms/step - loss: 1195.4688 - loss_array: 1268.1421 - mse: 1164.3231Shape of x before patch_embed: [50 5761]
 18/Unknown - 68s 492ms/step - loss: 1191.5787 - loss_array: 1271.7016 - mse: 1157.2404Shape of x before patch_embed: [50 5761]
 19/Unknown - 68s 492ms/step - loss: 1182.6924 - loss_array: 1270.5971 - mse: 1145.0192Shape of x before patch_embed: [50 5761]
 20/Unknown - 69s 492ms/step - loss: 1169.4586 - loss_array: 1264.4754 - mse: 1128.7369Shape of x before patch_embed: [50 5761]
 21/Unknown - 69s 492ms/step - loss: 1160.1130 - loss_array: 1263.2279 - mse: 1115.9209Shape of x before patch_embed: [50 5761]
 22/Unknown - 70s 492ms/step - loss: 1147.8940 - loss_array: 1258.5165 - mse: 1100.4843Shape of x before patch_embed: [50 5761]
 23/Unknown - 70s 492ms/step - loss: 1135.8645 - loss_array: 1253.9013 - mse: 1085.2773Shape of x before patch_embed: [50 5761]
 24/Unknown - 71s 492ms/step - loss: 1125.5957 - loss_array: 1251.3021 - mse: 1071.7214Shape of x before patch_embed: [50 5761]
 25/Unknown - 71s 493ms/step - loss: 1124.3586 - loss_array: 1258.7793 - mse: 1066.7496Shape of x before patch_embed: [50 5761]
 26/Unknown - 72s 492ms/step - loss: 1115.1522 - loss_array: 1257.2809 - mse: 1054.2399Shape of x before patch_embed: [50 5761]
 27/Unknown - 72s 492ms/step - loss: 1099.5878 - loss_array: 1248.1189 - mse: 1035.9316Shape of x before patch_embed: [50 5761]
 28/Unknown - 73s 492ms/step - loss: 1106.5978 - loss_array: 1264.7415 - mse: 1038.8218Shape of x before patch_embed: [50 5761]
 29/Unknown - 73s 492ms/step - loss: 1099.6489 - loss_array: 1266.0244 - mse: 1028.3451Shape of x before patch_embed: [50 5761]
 30/Unknown - 74s 492ms/step - loss: 1086.5773 - loss_array: 1259.5500 - mse: 1012.4459Shape of x before patch_embed: [50 5761]
 31/Unknown - 74s 491ms/step - loss: 1074.7220 - loss_array: 1254.7452 - mse: 997.5690 Shape of x before patch_embed: [50 5761]
 32/Unknown - 75s 491ms/step - loss: 1074.7183 - loss_array: 1262.9311 - mse: 994.0555Shape of x before patch_embed: [50 5761]
 33/Unknown - 75s 491ms/step - loss: 1070.3453 - loss_array: 1266.4588 - mse: 986.2966Shape of x before patch_embed: [50 5761]
 34/Unknown - 75s 490ms/step - loss: 1059.4283 - loss_array: 1261.8170 - mse: 972.6904Shape of x before patch_embed: [50 5761]
 35/Unknown - 76s 490ms/step - loss: 1053.1051 - loss_array: 1263.5522 - mse: 962.9136Shape of x before patch_embed: [50 5761]
 36/Unknown - 76s 490ms/step - loss: 1055.0686 - loss_array: 1273.9264 - mse: 961.2723Shape of x before patch_embed: [50 5761]
 37/Unknown - 77s 490ms/step - loss: 1047.8059 - loss_array: 1273.6761 - mse: 951.0045Shape of x before patch_embed: [50 5761]
 38/Unknown - 77s 490ms/step - loss: 1038.2676 - loss_array: 1270.1917 - mse: 938.8716Shape of x before patch_embed: [50 5761]
 39/Unknown - 78s 490ms/step - loss: 1030.4537 - loss_array: 1269.4447 - mse: 928.0291Shape of x before patch_embed: [50 5761]
 40/Unknown - 78s 490ms/step - loss: 1022.5446 - loss_array: 1268.2746 - mse: 917.2316Shape of x before patch_embed: [50 5761]
 41/Unknown - 79s 490ms/step - loss: 1013.1972 - loss_array: 1264.7225 - mse: 905.4007Shape of x before patch_embed: [50 5761]
 42/Unknown - 79s 490ms/step - loss: 1005.6382 - loss_array: 1263.4517 - mse: 895.1467Shape of x before patch_embed: [50 5761]
 43/Unknown - 80s 490ms/step - loss: 1005.3717 - loss_array: 1270.0076 - mse: 891.9564Shape of x before patch_embed: [50 5761]
 44/Unknown - 80s 489ms/step - loss: 994.2159 - loss_array: 1262.7508 - mse: 879.1295 Shape of x before patch_embed: [50 5761]
 45/Unknown - 81s 490ms/step - loss: 987.3069 - loss_array: 1262.7047 - mse: 869.2793Shape of x before patch_embed: [50 5761]
 46/Unknown - 81s 490ms/step - loss: 978.3075 - loss_array: 1258.8486 - mse: 858.0755Shape of x before patch_embed: [50 5761]
 47/Unknown - 82s 489ms/step - loss: 970.7357 - loss_array: 1256.9682 - mse: 848.0646Shape of x before patch_embed: [50 5761]
 48/Unknown - 82s 489ms/step - loss: 965.9424 - loss_array: 1256.6690 - mse: 841.3453Shape of x before patch_embed: [50 5761]
 49/Unknown - 83s 489ms/step - loss: 963.5757 - loss_array: 1260.1774 - mse: 836.4607Shape of x before patch_embed: [50 5761]
 50/Unknown - 83s 489ms/step - loss: 956.5359 - loss_array: 1257.9643 - mse: 827.3522Shape of x before patch_embed: [50 5761]
 51/Unknown - 84s 489ms/step - loss: 950.8594 - loss_array: 1257.9376 - mse: 819.2544Shape of x before patch_embed: [50 5761]
 52/Unknown - 84s 490ms/step - loss: 943.8573 - loss_array: 1255.5366 - mse: 810.2803Shape of x before patch_embed: [50 5761]
 53/Unknown - 85s 489ms/step - loss: 937.1182 - loss_array: 1253.3471 - mse: 801.5916Shape of x before patch_embed: [50 5761]
 54/Unknown - 85s 489ms/step - loss: 930.8763 - loss_array: 1251.7773 - mse: 793.3473Shape of x before patch_embed: [50 5761]
 55/Unknown - 86s 489ms/step - loss: 927.3854 - loss_array: 1253.5832 - mse: 787.5864Shape of x before patch_embed: [50 5761]
 56/Unknown - 86s 489ms/step - loss: 921.8481 - loss_array: 1252.9428 - mse: 779.9505Shape of x before patch_embed: [50 0]

tf.debugging.assert_greater_equal(shape_x[1], 1, message=“The second dimension of x is invalid.”)
Node: ‘masked_autoencoder_vi_t/assert_greater_equal/Assert/Assert’
Detected at node ‘masked_autoencoder_vi_t/assert_greater_equal/Assert/Assert’ defined at (most recent call last):
File “/usr/lib/python3.8/threading.py”, line 890, in _bootstrap
self._bootstrap_inner()
File “/usr/lib/python3.8/threading.py”, line 932, in _bootstrap_inner
self.run()
File “/usr/local/lib/python3.8/dist-packages/keras/src/engine/training.py”, line 1303, in run_step
outputs = model.train_step(data)
File “/JPCM_server/1.JPCM/3.public/xialijian/model_pipeline_hydraulic/mae_model_wrapper_v8_new.py”, line 268, i n train_step
y_pred, y_pred_array, mask = self(data, training=True)
File “/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py”, line 65, in error_handler
return fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/keras/src/engine/training.py”, line 569, in call
return super().call(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py”, line 65, in error_handler
return fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/keras/src/engine/base_layer.py”, line 1150, in call
outputs = call_fn(inputs, *args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py”, line 96, in error_handler
return fn(*args, **kwargs)
File “/JPCM_server/1.JPCM/3.public/xialijian/model_pipeline_hydraulic/mae_model_wrapper_v8_new.py”, line 248, i n call
latent, mask, ids_restore = self.forward_encoder(imgs)
File “/JPCM_server/1.JPCM/3.public/xialijian/model_pipeline_hydraulic/mae_model_wrapper_v8_new.py”, line 147, i n forward_encoder
tf.debugging.assert_greater_equal(shape_x[1], 1, message=“The second dimension of x is invalid.”)
Node: ‘masked_autoencoder_vi_t/assert_greater_equal/Assert/Assert’
2 root error(s) found.
(0) INVALID_ARGUMENT: assertion failed: [The second dimension of x is invalid.] [Condition x >= y did not hold e lement-wise:] [x (masked_autoencoder_vi_t/strided_slice:0) = ] [0] [y (masked_autoencoder_vi_t/assert_greater_equal /y:0) = ] [1]
[[{{node masked_autoencoder_vi_t/assert_greater_equal/Assert/Assert}}]]
[[Reshape_382/_308]]
(1) INVALID_ARGUMENT: assertion failed: [The second dimension of x is invalid.] [Condition x >= y did not hold e lement-wise:] [x (masked_autoencoder_vi_t/strided_slice:0) = ] [0] [y (masked_autoencoder_vi_t/assert_greater_equal /y:0) = ] [1]
[[{{node masked_autoencoder_vi_t/assert_greater_equal/Assert/Assert}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_61690]

主要报错是这个,多方查找资料后仍没解决问题。

### Windows 下 PyTorch 分布式训练教程 在 Windows 环境下进行 PyTorch 的分布式训练,主要依赖于 `torch.distributed` 模块以及其支持的通信后端(如 Gloo 或 NCCL)。以下是实现该目标的关键要点: #### 1. **环境准备** 为了成功运行分布式训练,需满足以下件: - 安装最新版本的 PyTorch,确保它兼容 Windows 并支持分布式功能[^3]。 - 配置网络连接:每台器之间应具备高速稳定的网络连接,以便通过 TCP/IP 协议交换数据。 #### 2. **初始化分布式环境** 使用 `init_process_group()` 方法来初始化分布式环境。可以选择不同的后端(Gloo 是 Windows 支持的主要选项之一),并通过 URL 参数指定节点间通信的方式[^1]。 ```python import torch import torch.distributed as dist def init_distributed_mode(world_size, rank): # 使用 gloo 后端,在 windows 中推荐此方式 backend = 'gloo' # 初始化方法可以是文件或者 tcp 地址 url = f'tcp://localhost:{29500}' # 替换为主节点 IP 和开放端口 dist.init_process_group( backend=backend, init_method=url, world_size=world_size, rank=rank ) ``` #### 3. **模型封装到 DDP** 将模型实例化后传递给 `DistributedDataParallel` 类完成封装。这一步骤能够使模型自动分割输入批次,并分发至不同设备上计算梯度。 ```python model = YourModel() device = torch.device(f'cuda:{rank}') model.to(device) ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank]) ``` #### 4. **数据集划分与加载器配置** 利用 `DistributedSampler` 对整个数据集做均匀采样分配给各进程,从而避免重复读取相同的数据片段[^4]。 ```python from torch.utils.data import DataLoader, DistributedSampler dataset = YourDataset() sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler) ``` #### 5. **定义训练逻辑** 编写核心训练循环函数,注意仅由主进程中保存检查点以减少冲突风险。 ```python def train_one_epoch(ddp_model, dataloader, optimizer, epoch, rank): ddp_model.train() for i, data in enumerate(dataloader): inputs, labels = data[0].to(rank), data[1].to(rank) outputs = ddp_model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() if rank == 0 and i % log_interval == 0: print(f"Epoch {epoch}, Batch {i}/{len(dataloader)}, Loss: {loss.item()}") if rank == 0: checkpoint = { 'state_dict': model.state_dict(), 'optimizer': optimizer.state_dict() } torch.save(checkpoint, path_to_save_checkpoint) ``` #### 6. **启动脚本** 最后借助 `torchrun` 工具简化跨个 GPU/节点的任务调度过程。无需手动调用 multiprocessing spawn 函数。 命令行示例: ```bash torchrun --nnodes=<num_nodes> \ --node_rank=<current_node_index> \ --nproc_per_node=<gpu_count_on_this_machine> \ your_training_script.py ... ``` --- ###
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值