Introduction to the CTC loss

introduction

The CTC loss is a loss function that is commonly used with RNN, it is widely used but the existing turorials are not simply so difficult to follow. As a result, I want to write a tutorial that is aimed to teach the basic principle and usage of the CTC loss.

Suppose the alphabet is A , we augment A with a blank character (denoted as _ ) and get the new alphabet A , Taken a perticular input, say you are handling a project about speech recognition and the input is a sound sequence and the output of RNN is the probability of outputing each characters in A of each timestamp. To be more presice, let us assume that the output of the RNN is exactly T , and if we feed a perticular input x to the RNN, we will get a 2-dimensional matrix of the size |A|×T , and we denote the probability of output the k ’s symbol in A at time t as ytk, Then, we use the symbol π to denote a character sequence over AT , we will get the following formula

P(π|x)=t=1Tytπt

That is, the probability of output a path π is the product of output all its characters at all the timestamps. Take a simple example, suppose the input is a sound sequence representing the word ‘cat’, and you dictate that the output of the RNN is of length 5. In this condition, your output of the RNN maybe:

12345
a0.10.10.50.7
c0.60.10.20.1
t0.20.20.20.1
_0.10.60.10.1

Here, for example, y1a=0.1 means that in timestamp 1, the probability of the RNN to output a is 0.1. We can also calculate the probability of outputing specific path π. For example, in this case, the probability of outputing the path c_aat is

P(π=c_aat|x)=0.6×0.6×0.5×0.7×0.6

You can easily see from the example that we can not compare the output of the RNN directly with the desired output. How can we manage to design a efficient loss function which takes the output of the RNN and the desired output, then give us the loss?

from path to label

How can we map from a sequence in the RNN’s output, called path, to the estimated output of the neural network, called labelling sequence? When we use the word labelling sequence, we want to express that we want to compare it with the input’s ground truth. We can denote it as l , and please think, how can be infer from the input x that the output is l ? The answer is, we can eliminate all the consecutively repeated characters and then all the blanks from the path given the input x and get the labelling sequence l , for example, if the path is ‘c_aat’, as in the previous section, the desired labelling sequcence l is ‘cat’, after we remove the second character ‘_’, and the fourth character ‘a’ from the path. We then denote the operation as a function F , that is, F is a function from AT to AT , that is, from the path of the input x called π, to the possible labelling of x named as l. In addition, F represents the operation of removing all the consecutively repeated characters (until only one left) and all the blanks from the input character sequence.

Then, we can answer the question: what is the probability of obtaining a perticular labelling l, given the input sequence x ? the answer is, the sum of the all the probability of all the pathes that lead to (by F ) labelling l. That is

P(l|x)=F(π)=lP(π|x)

For illustration, let’s consider the cat example. We speculate that the length of the output of the RNN, that is, the length of the path, is 5. Then what is the probability of obtaining the lablling ‘cat’ given a particular input sequence x ? It is just the summation of all the probability of all the paths that lead to the labelling ‘cat’, for example, ‘c_aat’, ‘ccaat’, ‘__cat’ and so on. That is, in order to estimate P(l=cat|x) , we have to sum up P(π=c_aat|x) , P(π=ccaat|x) , P(π=c\_aat|x) … the list is almost infinite! As a result, we have to handle the following stuffs:

  • put forward a efficient algorithm to estimate P(l|x)
  • associate a loss function with P(l|x) and suggest a method to compute its derivative efficiently

estimate P(l|x)

the forward pass

We take advantage of a algorithm called dynamic programming to quickly estimate P(l|x) .

Firstly, let’s denote the length of the labelling l as U, then we insert a blank symbol between all the symbols in l and to the beginning and end of l. We call the obtained sequence l . That is, if the labelling l is ‘cat’, then l is ‘_c_a_t_’. Then we use the symbol α(t,u) to represent the summation of all the probability of all the paths’ which satisfy the following properties:

  • the paths’ first t characters is mapped (by F) to the first u2 characters of l
  • the t’s character of the path is the u ’s character of l.

How to express these properties in the mathmatical way? first, At is the set of all the paths’ first t characters. We can denote the first u2 characters of l as l1:u2 and the set of the t prefix of all the paths that satisfy the properties can be expressed as

V(t,u)={π|πAt,F(π)=l1:u2and πt=lu}

Where we transcate the integer u2 . For example, if l is ‘cat’ and l is ‘_c_a_t_’, let t=1 and u=1 , then V(1,1) is the set of the 1 prefix (just the first symbol) of all the paths that can be mapped to the first 0 ( 12 ) characters of l (just the blank symbol ) and the first symbol of the path must be . It is apparent that V(1,1) contains only one single symbol, the blank symbol _ , because no other symbol can be mapped to _. Then, what is V(1,2) ? it is the set of the 1 prefix of all the paths that can be mapped to the first ( 22 ) character of l (just the symbol ‘c’ in the labelling ‘cat’) and the first symbol of the path must be the second symbol of l, that is , ‘c’. It is also apperent that V(1,2) contains only a character ‘c’.

Then, the probability α(t,u) is defined to be the probability of all the prefix t of the paths in V(t,u), that is

α(t,u)=πV(t,u)P(π|x)=πV(t,u)i=1tyiπi

So, how to estimate α(t,u) ? We already know that V(1,1)=_ and V(1,2)=l1=c . As a result, α(1,1)=y1_ and α(1,2)=y1l1 . It takes no effort to figure out that α(1,3)=0 , because a blank symbol cannot be mapped as ‘c’. In fact, α(1,u)=0u3 . In this particular example, u can be as large as 7.

Now, as for the situation where t=2, α(2,1)=y2_×α(1,1) , why? because α(2,1) means that the first 2 characters of the path must be mapped to the blank symbol and the 2nd symbol of the path must be blank. This event will happen with a probability y2_×α(1,1) .

Now consider the general case. for α(t,u) , if lu is blank, then in order for the first t symbols of the path to be mapped to the first u2 symbols of l and the t’s symbol of the path to be lu (blank), then one of the following things must happen, apart from ytlu :

  • the first t1 symbols can be mapped to the first u2 symbols of l and the t1’s symbol of the path is just the u2 ’s symbol of l (the u1’s symbol of l ), that is, α(t1,u1) . In this way, the t1 ’s symbol is kept by F and the t ’s symbol is eliminated by F.
  • the first t1 symbols can be mapped to the first u2 symbols of l and the t1’s symbol of the path is just blank (the u ’s symbol of l), that is, α(t1,u) . In this way, both the t1 ’s symbol and the t ’s symbol are elinimated by F.

As a result,

α(t,u)=ytlu(α(t1,u1)+α(t1,u))if lu=blank

Then, what if lu is not blank? Note that F involves eliminating all the consectively repeated sumbols and the blank symbol, we can get that, apart from ytlu , one of the following things must happen

  • the first t1 symbols can be mapped to the first u22 symbols of l and the t1’s symbol of the path is just the u22 ’s symbol of l (the u2’s symbol of l ), that is, α(t1,u2) . In this way, both the t1 ’s symbol and the t ’s symbol is kept by F.
  • the first t1 symbols can be mapped to the first u22 symbols of l and the t1’s symbol of the path is just blank (the u1 ’s symbol of l ), that is, α(t1,u1) . In this way, the t1 ’s symbol is eliminated by F and the t ’s symbol is kept by F.
  • the first t1 symbols can be mapped to the first u2 symbols of l and the t1’s symbol of the path is just the u2 ’s symbol of l (the u’s symbol of l ), that is, α(t1,u) . In this way, the t1 ’s symbol is kept by F and the t ’s symbol is eliminated by F because of it is the repeatation of the previous symbol.

As a result,

α(t,u)=ytlu(α(t1,u1)+α(t1,u)+α(t,u2))if lublank

However, there are uncovered cases. What if l has consectively repeated character? you may think it is impossible, but in the function F, we first remove all the consectively repeated symbols and then remove all the blank symbols. After the operation, there may be new consectively repeated symbols produced! If lu=lu2 , then you can easily figure out that α(t,u)=ytlu(α(t1,u1)+α(t1,u)) no matter whether lu is blank.

To summrise

α(t,u)={ytlu(α(t1,u1)+α(t1,u))if lu=blank or lu=lu2ytlu(α(t,u2)+α(t1,u1)+α(t1,u))otherwise

To make the formula compatible when u=1 and u=2 , we must have α(t,0) defined, we can define them to be 0.

the backward pass

Now we come to the backward pass. We define another variable called β(t,u) to be the probability that the last Tt symbols in the path can be mapped to the last Uu2 symbols in the labelling, given that the t ’s symbol in the path is lu. take the cat’s example, then l is ‘_c_a_t_’, the variable α(3,4) is trying to answer the question: what’s the probability that the first 3 symbols of the path can be mapped to the labelling ‘ca’ and the 3rd symbol in the path is ‘a’? On the other hand, suppose that the length of the output path is 7, the variable β(3,4) is trying to answer the question: if it is given that he 3rd symbol in the path is the forth symbol in the labelling, namely, ‘a’, what is the probability that the 4 through 7 symbols of the path can be mapped to the rest part of ‘cat’, namely, ‘t’.

Now, how to estimate β(t,u) ? suppose the length of the path is T and the total length of the augmented labelling l is U , we can start from the following initial case:

β(T,U)β(T,U1)=1=1

the variable β(T,U) is the probability that the last TT symbols (no symbol) can be mapped to the last UU2 symbols (no symbol), given that πT is equal to lU , This is a no-symbols to no-symbol map, and the probability is dictated to be 1. The case in β(T,U1) is similar.

How to estimate β(t,u) in general case?, In order for the last Tt symbols of the path to be mapped to the last Uu2 symbols of the labelling l , given that πt=lu, if the symbol lu is blank, then one of the following must be true:

  • the t+1 symbol of the path is also blank, and the last T(t+1) symbols of the path can be mapped to the last Uu2 symbols of l , that is, yt+1luβ(t+1,u)
  • the t+1 symbol of the path is not blank, instead, it is lu+1 , and the last T(t+1) symbols of the path can be mapped to the last Uu+12 symbols of l , that is, yt+1lu+1β(t+1,u+1)

As a result, we can imitate the inference in the last section and get the formula for the β(t,u)

β(t,u)=yt+1lu+1β(t+1,u+1)+yt+1luβ(t+1,u)if lu=blank or lu=lu2yt+1lu+2β(t+1,u+2)+yt+1lu+1β(t+1,u+1)+yt+1luβ(t+1,u)otherwise

Now we can compute α(t,u) and β(t,u) for each t1,2T and u1,2U .

loss function

Note that α(t,u) means the probability that the first t symbols of a path can be mapped to the first u2 symbols of a labelling l and πt=lu, and β(t,u) means the probability that the last Tt symbols of the path can be mapped to the last Uu2 symbols of the labelling l , given that πt=lu, we can conclude that α(t,u)β(t,u) is the probability that all the T symbols of the path can be mapped to the U symbols in the labelling l and πt=lu. The length of l is U , so we can sum over all the u to get

P(l|x)=u=1Uα(t,u)β(t,u)t1,2T

for a particular training example x with the ground truth l , we can define the function L(x,l)=log(P(l|x)) to be the example loss, then the example loss is

L(x,l)=logu=1Uα(t,u)β(t,u)t1,2T

Then we want to get the loss gradient L(x,l)ytk , which is just the partial derivative of the loss function with respect to the output of the RNN. Recall that ytk is the probability that the RNN output label k at time t, then
L(x,l)ytk=ttklog(P(l|x))=1P(l|x)P(l|x)ytk=1P(l|x)(Uu=1α(t,u)β(t,u))ytk=1P(l|x)u=1Uα(t,u)β(t,u)ytk

Then what is α(t,u)β(t,u)ttk for a particular t and u?. Recall that α(t,u)β(t,u) is the probability that the path π can be mapped into the labelling l and the t’s symbol in π is the u ’s symbol in l. Then, α(t,u)β(t,u) is just the sum of the probability of all these paths. that is
α(t,u)β(t,u)=πX(t,u)i=1Tyiπi

where X(t,u) is the set of all the paths that can be mapped to l and the t’s symbol in the path π is lu . For a particulat t and u, if lu = k , then πt=lu=k, and it is apparent that the term ytk appears in the right hand side of the expression for α(t,u)β(t,u) , and we can find that in this case
α(t,u)β(t,u)ytk=α(t,u)β(t,u)ytk

On the other hand, if luk , then the term ytk will not appear on the right hand side of the expression, as a result, α(t,u)β(t,u)ytk=0 , We can conclude from the the conditions that
α(t,u)β(t,u)ytk=α(t,u)β(t,u)ttkif lu=k0 otherwise

Then we can obtain that
L(x,l)ytk=1P(l|x)u=1Uα(t,u)β(t,u)ytk=1P(l|x)lu=kα(t,u)β(t,u)ytk=1P(l|x)ytklu=kα(t,u)β(t,u)

Congratulations! we obtain the formula for L(x,l)ytk ! How ever, we must take the final step. How we get the RNN’s output ytk ? In fact, we put the activation atk into a softmax layer and the formula for the softmax layer is
ytk=eatk|A|l=1etlt1,2T

remember that A is the augmented alphabet set (including the blank symbol), then
L(x,l)atk=k=1|A|L(x,l)ytkytkatk=k=1|A|(ytkδkkytkytk)L(x,l)ytk try to derive the expression by yourself=k=1|A|1P(l|x)ytk(ytkδkkytkytk)lu=kα(t,u)β(t,u)

Take a close look at the summed items. If k=k then δk;k=1 , and
1P(l|x)ytk(ytkδkkytkytk)lu=kα(t,u)β(t,u)=1P(l|x)ytk(ytkytkytk)lu=kα(t,u)β(t,u)=1P(l|x)(ytk1)lu=kα(t,u)β(t,u)=ytkP(l|x)lu=kα(t,u)β(t,u)1P(l|x)lu=kα(t,u)β(t,u)

Then if kk then δ=0 , and
1P(l|x)ytk(ytkδkkytkytk)lu=kα(t,u)β(t,u)=1P(l|x)ytk(ytkytk)lu=kα(t,u)β(t,u)=ytkP(l|x)lu=kα(t,u)β(t,u)

Then we sum them up, and finally we can get
L(x,l)atk=ytkP(l|x)k=1|A|lu=kα(t,u)β(t,u)1P(l|x)lu=kα(t,u)β(t,u)=ytkP(l|x)u=1|U|α(t,u)β(t,u)1P(l|x)lu=kα(t,u)β(t,u)=ytkP(l|x)P(l|x)1P(l|x)lu=kα(t,u)β(t,u)=ytk1P(l|x)lu=kα(t,u)β(t,u)

This is the final formula that we want to get. It is easily observed from the expression that all of the ytk , P(l|x) and α(t,u)β(t,u) can be computed efficiently.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值