RNN-T models the acoustic and language features jointly,which eliminates the drawbacks in the output-independent CTC model.Nevertheless,this appealing feature comes at thecost of high memory and computation consumption duringtraining.Specifically,the RNN-T loss calculates on a 4-D lat-tice of shape (N,T,U,V),where N is the batch size,T is theoutput length of the acoustic encoder,U is the output lengthof the prediction network,and V is the vocabulary size.
RNN-T模型将声学特性和语言特性联合起来建模,消除了CTC模型中输出无关的缺点。然而,这种引人注目的特性是以训练过程中高内存和计算消耗为代价的。具体来说,RNN-T损失函数在一个形状为(N, T, U, V)的4D张量上进行计算,其中N是批量大小,T是声学编码器的输出长度,U是预测网络的输出长度,V是词汇表大小。
An, K., et al. (2023) BAT: Boundary aware transducer for memory-efficient and low-latency ASR. arXiv:2305.11571 DOI: 10.48550/arXiv.2305.11571