train_transformer.py 异常：Assertion `srcIndex ＜ srcSelectDimSize` failed.

最新推荐文章于 2024-09-17 20:45:29 发布

fange

最新推荐文章于 2024-09-17 20:45:29 发布

阅读量1.5k

点赞数 13

分类专栏：问题/BUG 文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/fange86126/article/details/126201133

版权

问题/BUG 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

引用项目：https://github.com/SMART-TTS/SMART-Single_Emotional_TTS
音频样本数据：LJSpeech-1.1
样本格式内容如：

LJ_NOR_10001.wav|the chronicles of newgate, volume two. by arthur griffiths. section eight: the beginnings of prison reform.
LJ_NOR_10002.wav|newgate prisoners were the victims to another most objectionable practice which obtained all over london.
LJ_NOR_10003.wav|persons committed to a metropolitan jail at that time were taken in gangs, men and women handcuffed together, or linked on to a long chain,
LJ_NOR_10004.wav|unless they could afford to pay for a vehicle out of their own funds.

异常：Assertion srcIndex < srcSelectDimSize failed.

(emo_tts3) D:\workspace_tts\emotion-fs-3>python train_transformer.py
Trainable Parameters: 15.927M
C:\Users\fangg\Anaconda3\envs\emo_tts3\lib\site-packages\torch\nn\modules\loss.py:94: UserWarning: Using a target size (torch.Size([8])) that is different to the input size (torch.Size([8, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.l1_loss(input, target, reduction=self.reduction)
| Epoch: 0, 0/330th loss : 0.9549 + 1.1547 + 0.0280 + 4.0505 = 1.2376
Validation| loss : 1.0039 + 1.2023 + 0.0212 + 3.8226 = 6.0500
| Epoch: 0, 1/330th loss : 0.9634 + 1.1604 + 0.0290 + 3.7260 = 1.1758
| Epoch: 0, 2/330th loss : 0.9589 + 1.1530 + 0.0286 + 3.8567 = 1.1994
| Epoch: 0, 3/330th loss : 0.9564 + 1.1508 + 0.0285 + 3.7279 = 1.1727
.
.
.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: block: [38,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: block: [38,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
.
.
.
Traceback (most recent call last):
  File "train_transformer.py", line 261, in <module>
    main()
  File "train_transformer.py", line 223, in main
    mel_pred, postnet_pred, attn_probs, decoder_outputs, attns_enc, attns_dec, attns_style, post_linear, duration_predictor_output, duration, weights = m.forward(character, mel_input, pos_text, pos_mel, mel, pos_mel, mel_max_length_array=mel_max_length_array)
  File "C:\Users\fangg\Anaconda3\envs\emo_tts3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\fangg\Anaconda3\envs\emo_tts3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\workspace_tts\emotion-fs-3\network.py", line 288, in forward
    memory, c_mask, attns_enc, duration_mask = self.encoder(characters, pos=pos_text)
  File "C:\Users\fangg\Anaconda3\envs\emo_tts3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\workspace_tts\emotion-fs-3\network.py", line 106, in forward
    x, attn = layer(x, x, mask=mask, query_mask=c_mask)
  File "C:\Users\fangg\Anaconda3\envs\emo_tts3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\workspace_tts\emotion-fs-3\module.py", line 289, in forward
    result, attns = self.multihead(key, value, query, mask=mask, query_mask=query_mask, kv_mask=kv_mask)
  File "C:\Users\fangg\Anaconda3\envs\emo_tts3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\workspace_tts\emotion-fs-3\module.py", line 212, in forward
    attn = t.bmm(query, key.transpose(1, 2))    #batch matrix-matrix product
RuntimeError: CUDA error: device-side assert triggered

问题描述：当前项目下调用python train_transformer.py命令后，有时会报上面异常，有时则直接卡住然后运行结束（什么信息也没有，其实主要的问题就是：Assertion srcIndex < srcSelectDimSize failed.），然后我就开始尝试修改hyperparams.py里面的一些主要参数（其中网上查找了很多问题相关的文章），没有效果……最后看到了这位老哥的文章no cuda capable device给了我灵感，他说词表中索引不对，我当然不知道他的词表是怎样的，但想到我的metadata_train.csv文件里面的内容好像有一大把标点符号，因为这些标点符号在训练过程中是没有什么用的，很可能问题就在这里，最后我把所有的标点符号都去掉，重新开始
python prepare_data.py
python train_transformer.py
……
想不到竟然OK了，哎，我可是搞了半天了啊这个问题，要是还没成我都打算直接到原项目里面去提问了，值得记录一下。