论文分享-Consecutive Decoding for Speech-to-text Translation-总结报告

1.Introduction

At present, the traditional speech-to-text translation is implemented by two models, the cascaded model and the end-to-end model.

The cascaded model divides speech-to-text translation into two parts. The first part is to convert the speech in the source language into text in the source language, which is called automatic speech recognition (ASR). The second part is to translate the source language text into the target language text, which is called machine translation (MT). It needs to complete ASR first to complete MT.

 End-to-end model is to construct a complete neural network model, and jointly optimize speech recognition, post-recognition processing and machine translation, and establish the mapping relationship between source language speech signal and target language text, then implements the translation from source speech to target translation. In other words, the end-to-end model takes the source language speech as input and directly outputs the target language text after passing through the model. It integrates all speech translation functions into a single model, unlike the cascaded model, which has an intermediate result, the source language transcript. Note that although end-to-end systems are very promising, cascaded systems still dominate practical deployment in industry.[1]

 

The two models have some advantages. (a)The end-to-end model has advantages such as lower latency, smaller model size, and less error accumulation.[1] (b)For cascaded model, it can make better use of the independent speech recognition or machine translation corpus.

2.Problems

Although the two models have some advantages, they also have some disadvantages.

A.The disadvantages of the end-to-end model

The first disadvantage of the end-to-end model is that it is too complex, because it integrates multiple learning tasks into a single model, which places a great burden on a single model. Another disadvantage is that it cannot make full use of the external ASR or MT corpus. That is because the end-to-end model has only one input of source language speech and directly outputs target language text, which means the training dataset of end-to-end model is composed of source language speech and target language text. However, the training dataset of external ASR corpus is source language speech and source transcript, while the training dataset of external MT corpus is source transcript and target language text. Due to the different composition of training dataset, end-to-end model cannot make use of external ASR or MT corpus.

​​​​​​​B.The disadvantages of the cascaded model

The main disadvantage of cascaded model is error accumulation. Because the cascaded model needs to complete the MT after completing the ASR, there will be errors propagation during this process. In the ASR stage, there may be recognition errors due to complex factors such as the speaker’s accent, environmental noise, and homophones or easily confused words in the language. The output of the ASR will continue to be used as the input of the MT, where an error propagation will occur, and incorrect speech recognition will cause incorrect source transcript translation. And in the MT stage, there may also be translation errors such as repetition, omission, and inversion in spoken language, as well as unclear semantic logic, difficulty in segmentation, etc., resulting in accumulation of errors.

​​​​​​​C.The problem of existing method of integrating benefits of cascaded and end-to-end models

The existing researches usually make the end-to-end model meet such requirements, resorting to pre-training or multitask learning to bridge the benefits of cascaded and end-to-end models. The existing method which integrates the advantages of both cascaded and end-to-end models is to use the ASR corpus to pre-train the encoder, and then use the ST corpus to fine-tune the entire model. Here pre-train refers to train the model with a set of data before formal training, to obtain some parameters, and then use these parameters as the initial parameters to fine-tune, which means formal train the model.[2] The advantage of pre-training is that you don't have to train the model from scratch every time, and if the tasks and datasets of pre-training have a strong correlation with the tasks and datasets of fine-tuning, the effects of pre-training will be really great. We can find that this method also has a problem, that is, it still cannot make full use of the external MT corpus, because it does not have the phase of translating the source language transcript into target language text.

3.Motivation

The authors hope to integrate the advantages of the cascade model and the end-to-end model, and solve the problems of these two models: the proposed new system has the advantage of end-to-end system that can solve error accumulation and this system can make full use of external corpus (ASR or MT data).

The authors mention in the paper that they have gotten one inspiration insight in the ASR model and the MT model respectively. (a) A branch of the ASR model will produce an intermediate product which is called phoneme before the speech transcript is produced by speech recognition. Phoneme is the smallest speech unit which is used to distinguish words. For example, s and z in sip and zip in English are two different phonemes. (b) The motivation insight obtained in the MT model is that if we not only have to decode the final target language text, but also the intermediate product source language transcript, the effects of the whole speech translation model will be improved. These two motivations have created the most important ideas of the two stages of our COSTT framework. Therefore, the authors propose COSTT, a unified speech translation framework with consecutive decoding for jointly modeling speech recognition and translation.

(以下为另外2个队友总结,这份报告是小组作业hhh)

4.Basic idea to solve the problem

There are three basic ideas in the paper and they will be explained below.

The first basic idea is using consecutive decoding to solve error accumulation or error propagation. The authors mention that the most important basic idea is consecutive decoding, and we can also realize its importance from the truth that the term appears as a part of the paper’s title. Consecutive decoding refers to sequentially generating source transcript and target translation text with a single decoder in one pass. The detailed process of consecutive decoding will be explained in section 5. The mechanism of consecutive decoding can solve the error accumulation or error propagation problem that appears in cascaded models to some extent. The reason why consecutive decoding can solve error propagation is not explained in detail by authors. In my understanding, that is because cascaded models have error propagation due to their two characteristics. One characteristic is that cascaded models have multiple stages and the other is that the output of the last phase will serve as the input of the next phase. According to the cascaded models which are mentioned in previous sections, source transcript is the output of ASR phase and also the input of MT phase. Both of the two characteristics can result in error propagation and error accumulation. However, consecutive decoding sequentially produce source transcript and target translation text, which means source transcript does not serve as the input of next phase and there will not be error propagation. As a result, consecutive decoding avoids two characteristics that cascaded models cause error propagation. In other words, error propagation can be solved by consecutive decoding.

The second basic idea is pre-training the consecutive decoder to make full use of the external MT corpus. In this idea, we use external MT corpus to pre-train the decoder before formal decoding. Therefore, the second basic idea solves end-to-end models’ problem that end-to-end models cannot make full use of external MT corpus.

The third basic idea is that adopt CTC loss supervised by phoneme labels to accelerate the convergence of acoustic modeling and preserve more acoustic information, and adopt shrinking operation to improve the performance of acoustic model. CTC loss is used to align the source speech sequence which is divided into multiple small time slices with phoneme sequence, and then calculate CTC loss function and adopt gradient decent method to accelerate the convergence of acoustic model. The detailed process of CTC loss will be explained in section 5. The convergence of model means the values of loss steadily fluctuate within an accepted interval. The step of aligning speech sequence with phoneme sequence in CTC loss helps to conserve more speech information. The shrinking operation means removing the blank time slices and merging the repeated time slices of speech sequence which has multiple time slices. The advantage of shrinking is that it can avoid some meaningless computation and occupation of memory, which will improve the performance of the acoustic model.

5.Main steps to the solution

The approach which is proposed in the paper is called COnSecutive Transcription and Translation (COSTT). The framework of COSTT will be explained in detail below and the general framework is shown in figure 3. COSTT contains two phases: acoustic-semantic phase(AS) and transcription-translation phase(TT).[1] In brief, the input of AS phase is original speech sequence and the output of AS phase is acoustic representation with some sematic information hAS . TT phase first pre-train decoder, then accept the output of AS phase as its input, and then output the sequence of predicted speech transcript and target translation text. From the general structure of COSTT, we can conclude that COSTT has characteristics of both cascaded and end-to-end models. This can be seen that in one aspect, COSTT has two phases, and AS phase and the first part of TT phase’s output(source transcript) serve as ASR model, while TT phase itself serves as MT model.[1] This aspect demonstrates that COSTT has the characteristics of cascaded models. In other aspect, the first part of TT phase’s output(source transcript) is trivial and can be ignored. This aspect indicates that COSTT has the characteristics of end-to-end models. Moreover, the whole framework corresponds to the encoder-decoder model. AS phase serves as a encoder. It encodes original speech sequence into acoustic representation with semantic information. While TT phase serves as a decoder. It decodes acoustic representation with semantic information into source transcript and target translation text.

Then, we explain the detailed process of the framework. First of all, speech sequence proceeds sampling, which means manual select one frame from every three frames in speech for subsequent handling. Then proceed a linear layer and multiple Transformer blocks with shrinking operation and multiple blocks without shrinking operation. Notice that although there are only one encoder in COSTT, the encoder has several layers, which means several blocks. In each block with shrinking operation, it has two parts: self-attention with CTC and shrinking operation. First, it will proceed multi-head self attention layer, linear layer and softmax layer sequentially. softmax layer is used to calculate all the probabilities corresponding to all the possible results of each time slice in speech sequence and the probabilities will be used to calculate CTC loss function. The detailed process of CTC loss in the paper is first aligning the speech sequence which is divided into multiple time slices with phoneme sequence. Speech sequence can be divided into multiple time slices, and there are some situations that several repeated time slices would correspond to one phoneme. Also, there are some blank time slices because there will be some pauses in the speech. Due to some repeated and blank time slices, the relationship between speech sequence and phoneme sequence is many-to-one. During the alignment, blank and repeated time slices will be removed. After alignment, we start to calculate CTC loss function. Through alignment, we know that a phoneme sequence can correspond to multiple paths. Path refers to a situation of all the time slices of a speech sequence, including blank and repeated time slices. The reason why a phoneme sequence can correspond to multiple paths is that repeated and blank time slices can be added into a phoneme sequence to produce various combinations of paths. Through the softmax layer, we have already computed all the probabilities corresponding to all the possible results of each time slice in speech sequence. In addition, we assume that each time slice in a path is independent, so we can compute the probability of a path by multiplying the probabilities of all the time slices’ current results. After that, we can compute all the probabilities of paths corresponding to a phoneme sequence by doing that for every path corresponding to the phoneme sequence. Afterwards, add all the paths’ probabilities together and apply logarithmic function to the sum. Then, we take the negative result of the logarithmic result to get the CTC loss function. After obtaining the CTC loss function, we can apply gradient decent on it and find the situation of minimum loss to accelerate the convergence of AS phase. The other part of blocks with shrinking operation is shrinking operation. Shrinking operation refers to remove blank time slices and average repeated time slices in the speech sequence and we obtain some acoustic representation h’AS. Then, h’AS still need to be processed by multiple blocks without shrinking operation to get acoustic representation with some sematic information hAS. There are two reasons why we still need some blocks without shrinking operation before the end of AS phase. The first reason is that h’AS are some acoustic representation, but the output of TT phase is some texts, so we need a period of transition between h’AS and the start of TT phase that extract some semantic information. The other reason is that after shrinking, there are some internal changes of acoustic representation and we need to re-extract some acoustic information.

Then, we explain the TT phase. TT phase itself is a decoder and the decoder also has multiple layers, which means it has multiple decoder blocks. Each decoder block contains two sublayers: multi-head self attention and multi-head cross attention. The output of AS phase hAS will serve as the input of each decoder’s multi-head cross attention sublayer. Then, I would like to explain the detailed process of consecutive decoding by using the example in figure 1. First, we input a symbol <asr> which is used to mark the start of source transcript into the decoder and then we obtain the next word, which is the first word in source transcript: see. After that, we input <asr> and word see into the decoder to predict the next word you. After repeating doing this, we can predict the word of the sequence of source transcript and target translation text one by one. The principle is that we use the words obtained before and the word obtained currently to predict the next word. Note that source transcript and target translation text is combined in one sequence instead of two separated sequences. We can also understand the name of consecutive decoding because the words of the sequence of source transcript and target translation text are predicted one by one. Then, we use some special symbols to distinguish source transcript and target translation text in one sequence. For instance, <st> represents the start of target translation text and <eos> indicates the end of the sequence. As a consequence, we can get the final result, target translation text, from the output of TT phase between the special symbols <st> and <eos>.

6.Two figures that verify the results

 Finally, we will use two figures to verify that our method does indeed solve the previous problems. First of all, Figure 4 verifies that consecutive decoding can solve the problem of error propagation. In this example, there are errors in the speech transcript. For the model without consecutive decoding mechanism(Base ST), it mistranslates most of the contents in the sentence, which is the bold part. However, in our method COSTT, although there are some errors in the speech transcript(the underlined part, today is wrongly recognized as to day), the target translation text is still completely correct, so it proves that our method solves the problem of error propagation to some extent.

 Figure 5 intuitively demonstrates the effects of pre-training decoder. We can see that the model after pre-training decoder has a higher BLEU scores than the model without pre-training decoder. BLEU is an approach to automatically evaluating the qualities of translation.[3] The higher the scores, the better the qualities of translation. Hence, we can conclude that pre-training decoder can promote the qualities of translation. Moreover, we can observe that the curve after pre-training converges earlier than the curve without pre-training. In other words, the curve with pre-training decoder steadily fluctuates within an interval earlier than the curve without pre-training decoder. As a result, we can conclude that pre-training decoder can also accelerate the convergence of model to some extent. Therefore, it is proved that using the external MT corpus to pre-train the decoder is certainly effective.

[1] 本篇论文:Consecutive Decoding for Speech-to-text Translation
[2] https://stats.stackexchange.com/questions/193082/what-is-pre-training-a-neural-network#中的第一个answer
[3] BLEU的论文

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值