RNN-T 模型最后一层的输出是一个 4-D 的 tensor,维度是 (N, T, U, C), 其中
- N: batch size。数值大小: 一般是几十
- T: encoder 的输出语音特征帧数。数值大小:一般是好几百
- U: decoder 的输出文本token数。数值大小:几十至上百
- C: vocabulary size。数值大小:几百至上千
所以,RNN-T 训练时,所需的内存正比于 N, T , U, C 这 4 个数的乘积 NTUC。训练 CTC 或者 attention-based 模型时,所需的内存一般与 NTC 或者 NUC 成正比。
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at