introduction
The CTC loss is a loss function that is commonly used with RNN, it is widely used but the existing turorials are not simply so difficult to follow. As a result, I want to write a tutorial that is aimed to teach the basic principle and usage of the CTC loss.
Suppose the alphabet is
A
, we augment
That is, the probability of output a path π is the product of output all its characters at all the timestamps. Take a simple example, suppose the input is a sound sequence representing the word ‘cat’, and you dictate that the output of the RNN is of length 5. In this condition, your output of the RNN maybe:
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
a | 0.1 | 0.1 | 0.5 | 0.7 |
c | 0.6 | 0.1 | 0.2 | 0.1 |
t | 0.2 | 0.2 | 0.2 | 0.1 |
_ | 0.1 | 0.6 | 0.1 | 0.1 |
Here, for example,
y1a=0.1
means that in timestamp 1, the probability of the RNN to output
a
is 0.1. We can also calculate the probability of outputing specific path
You can easily see from the example that we can not compare the output of the RNN directly with the desired output. How can we manage to design a efficient loss function which takes the output of the RNN and the desired output, then give us the loss?
from path to label
How can we map from a sequence in the RNN’s output, called path, to the estimated output of the neural network, called labelling sequence? When we use the word labelling sequence, we want to express that we want to compare it with the input’s ground truth. We can denote it as
l
, and please think, how can be infer from the input
Then, we can answer the question: what is the probability of obtaining a perticular labelling
For illustration, let’s consider the cat example. We speculate that the length of the output of the RNN, that is, the length of the path, is 5. Then what is the probability of obtaining the lablling ‘cat’ given a particular input sequence x ? It is just the summation of all the probability of all the paths that lead to the labelling ‘cat’, for example, ‘c_aat’, ‘ccaat’, ‘__cat’ and so on. That is, in order to estimate P(l=cat|x) , we have to sum up P(π=c_aat|x) , P(π=ccaat|x) , P(π=c\_aat|x) … the list is almost infinite! As a result, we have to handle the following stuffs:
- put forward a efficient algorithm to estimate P(l|x)
- associate a loss function with P(l|x) and suggest a method to compute its derivative efficiently
estimate P(l|x)
the forward pass
We take advantage of a algorithm called dynamic programming to quickly estimate P(l|x) .
Firstly, let’s denote the length of the labelling
l
as
- the paths’ first
t
characters is mapped (by
F ) to the first u2 characters of l - the
t ’s character of the path is the u ’s character ofl′ .
How to express these properties in the mathmatical way? first,
A′t
is the set of all the paths’ first
t
characters. We can denote the first
Where we transcate the integer u2 . For example, if l is ‘cat’ and
Then, the probability
α(t,u)
is defined to be the probability of all the prefix
t
of the paths in
So, how to estimate α(t,u) ? We already know that V(1,1)=_ and V(1,2)=l1=c . As a result, α(1,1)=y1_ and α(1,2)=y1l1 . It takes no effort to figure out that α(1,3)=0 , because a blank symbol cannot be mapped as ‘c’. In fact, α(1,u)=0∀u≥3 . In this particular example, u can be as large as 7.
Now, as for the situation where
Now consider the general case. for
α(t,u)
, if
l′u
is blank, then in order for the first
t
symbols of the path to be mapped to the first
- the first
t−1
symbols can be mapped to the first
u2
symbols of
l
and the
t−1 ’s symbol of the path is just the u2 ’s symbol of l (theu−1 ’s symbol of l′ ), that is, α(t−1,u−1) . In this way, the t−1 ’s symbol is kept by F and the t ’s symbol is eliminated byF . - the first
t−1
symbols can be mapped to the first
u2
symbols of
l
and the
t−1 ’s symbol of the path is just blank (the u ’s symbol ofl′ ), that is, α(t−1,u) . In this way, both the t−1 ’s symbol and the t ’s symbol are elinimated byF .
As a result,
Then, what if l′u is not blank? Note that F involves eliminating all the consectively repeated sumbols and the blank symbol, we can get that, apart from ytl′u , one of the following things must happen
- the first
t−1
symbols can be mapped to the first
u−22
symbols of
l
and the
t−1 ’s symbol of the path is just the u−22 ’s symbol of l (theu−2 ’s symbol of l′ ), that is, α(t−1,u−2) . In this way, both the t−1 ’s symbol and the t ’s symbol is kept byF . - the first
t−1
symbols can be mapped to the first
u−22
symbols of
l
and the
t−1 ’s symbol of the path is just blank (the u−1 ’s symbol of l′ ), that is, α(t−1,u−1) . In this way, the t−1 ’s symbol is eliminated by F and the t ’s symbol is kept byF . - the first
t−1
symbols can be mapped to the first
u2
symbols of
l
and the
t−1 ’s symbol of the path is just the u2 ’s symbol of l (theu ’s symbol of l′ ), that is, α(t−1,u) . In this way, the t−1 ’s symbol is kept by F and the t ’s symbol is eliminated byF because of it is the repeatation of the previous symbol.
As a result,
However, there are uncovered cases. What if l has consectively repeated character? you may think it is impossible, but in the function
To summrise
To make the formula compatible when u=1 and u=2 , we must have α(t,0) defined, we can define them to be 0.
the backward pass
Now we come to the backward pass. We define another variable called
β(t,u)
to be the probability that the last
T−t
symbols in the path can be mapped to the last
U−u2
symbols in the labelling, given that the
t
’s symbol in the path is
Now, how to estimate
β(t,u)
? suppose the length of the path is
T
and the total length of the augmented labelling
the variable β(T,U′) is the probability that the last T−T symbols (no symbol) can be mapped to the last U−U′2 symbols (no symbol), given that πT is equal to l′U′ , This is a no-symbols to no-symbol map, and the probability is dictated to be 1. The case in β(T,U′−1) is similar.
How to estimate
β(t,u)
in general case?, In order for the last
T−t
symbols of the path to be mapped to the last
U′−u2
symbols of the labelling
l
, given that
- the
t+1
symbol of the path is also blank, and the last
T−(t+1)
symbols of the path can be mapped to the last
U−u2
symbols of
l
, that is,
yt+1l′uβ(t+1,u) - the
t+1
symbol of the path is not blank, instead, it is
l′u+1
, and the last
T−(t+1)
symbols of the path can be mapped to the last
U−u+12
symbols of
l
, that is,
yt+1l′u+1β(t+1,u+1)
As a result, we can imitate the inference in the last section and get the formula for the
β(t,u)
Now we can compute α(t,u) and β(t,u) for each t∈1,2⋯T and u∈1,2⋯U′ .
loss function
Note that
α(t,u)
means the probability that the first
t
symbols of a path can be mapped to the first
for a particular training example x with the ground truth l , we can define the function
Then we want to get the loss gradient ∂L(x,l)∂ytk , which is just the partial derivative of the loss function with respect to the output of the RNN. Recall that ytk is the probability that the RNN output label k at time
Then what is ∂α(t,u)β(t,u)∂ttk for a particular t and
where X(t,u) is the set of all the paths that can be mapped to l and the
On the other hand, if l′u≠k , then the term ytk will not appear on the right hand side of the expression, as a result, ∂α(t,u)β(t,u)∂ytk=0 , We can conclude from the the conditions that
Then we can obtain that
Congratulations! we obtain the formula for ∂L(x,l)∂ytk ! How ever, we must take the final step. How we get the RNN’s output ytk ? In fact, we put the activation atk into a softmax layer and the formula for the softmax layer is
remember that A′ is the augmented alphabet set (including the blank symbol), then
Take a close look at the summed items. If k=k′ then δk;k=1 , and
Then if k≠k′ then δ=0 , and
Then we sum them up, and finally we can get
This is the final formula that we want to get. It is easily observed from the expression that all of the ytk , P(l|x) and α(t,u)β(t,u) can be computed efficiently.