Introduction to the CTC loss

最新推荐文章于 2024-07-24 14:49:54 发布

yao_zou

最新推荐文章于 2024-07-24 14:49:54 发布

阅读量461

点赞数 1

分类专栏： mathmatics 文章标签： CTC-loss 机器学习

本文链接：https://blog.csdn.net/yao_zou/article/details/69935816

版权

mathmatics 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

introduction

The CTC loss is a loss function that is commonly used with RNN, it is widely used but the existing turorials are not simply so difficult to follow. As a result, I want to write a tutorial that is aimed to teach the basic principle and usage of the CTC loss.

Suppose the alphabet is $A$ , we augment $A$ with a blank character (denoted as _ ) and get the new alphabet $A'$ , Taken a perticular input, say you are handling a project about speech recognition and the input is a sound sequence and the output of RNN is the probability of outputing each characters in $A'$ of each timestamp. To be more presice, let us assume that the output of the RNN is exactly $T$ , and if we feed a perticular input $\mathbf{x}$ to the RNN, we will get a 2-dimensional matrix of the size $|A'|\times T$ , and we denote the probability of output the $k$ ’s symbol in $A'$ at time $t$ as $y^{t}_k$ , Then, we use the symbol $\pi$ to denote a character sequence over $A'^{T}$ , we will get the following formula

P (π | x) = \prod t = 1 T y t π t

$P(\pi | \mathbf{x}) = \prod_{t = 1}^{T}y_{\pi_t}^t$
That is, the probability of output a path

π $\pi$ is the product of output all its characters at all the timestamps. Take a simple example, suppose the input is a sound sequence representing the word ‘cat’, and you dictate that the output of the RNN is of length 5. In this condition, your output of the RNN maybe:

1	2	3	4	5
a	0.1	0.1	0.5	0.7
c	0.6	0.1	0.2	0.1
t	0.2	0.2	0.2	0.1
_	0.1	0.6	0.1	0.1

Here, for example, $y_a^1 = 0.1$ means that in timestamp 1, the probability of the RNN to output $a$ is 0.1. We can also calculate the probability of outputing specific path $\pi$ . For example, in this case, the probability of outputing the path $c\_aat$ is

P (π = c_a a t | x) = 0.6 \times 0.6 \times 0.5 \times 0.7 \times 0.6

$P(\pi = c\_aat | \mathbf{x}) = 0.6 \times 0.6 \times 0.5 \times 0.7 \times 0.6$
You can easily see from the example that we can not compare the output of the RNN directly with the desired output. How can we manage to design a efficient loss function which takes the output of the RNN and the desired output, then give us the loss?

from path to label

How can we map from a sequence in the RNN’s output, called path, to the estimated output of the neural network, called labelling sequence? When we use the word labelling sequence, we want to express that we want to compare it with the input’s ground truth. We can denote it as $l$ , and please think, how can be infer from the input $\mathbf{x}$ that the output is $l$ ? The answer is, we can eliminate all the consecutively repeated characters and then all the blanks from the path given the input $\mathbf{x}$ and get the labelling sequence $l$ , for example, if the path is ‘c_aat’, as in the previous section, the desired labelling sequcence $l$ is ‘cat’, after we remove the second character ‘_’, and the fourth character ‘a’ from the path. We then denote the operation as a function $F$ , that is, $F$ is a function from $A'^{T}$ to $A^{\le T}$ , that is, from the path of the input $x$ called $\pi$ , to the possible labelling of $x$ named as $l$ . In addition, $F$ represents the operation of removing all the consecutively repeated characters (until only one left) and all the blanks from the input character sequence.

Then, we can answer the question: what is the probability of obtaining a perticular labelling $l$ , given the input sequence $\mathbf{x}$ ? the answer is, the sum of the all the probability of all the pathes that lead to (by $F$ ) labelling $l$ . That is

P (l | x) = \sum F (π) = l P (π | x)

$P(l|\mathbf{x}) = \sum_{\cal{F}(\pi) = l}P(\pi | \mathbf{x})$
For illustration, let’s consider the cat example. We speculate that the length of the output of the RNN, that is, the length of the path, is 5. Then what is the probability of obtaining the lablling ‘cat’ given a particular input sequence

x $\mathbf{x}$ ? It is just the summation of all the probability of all the paths that lead to the labelling ‘cat’, for example, ‘c_aat’, ‘ccaat’, ‘__cat’ and so on. That is, in order to estimate

P(l=cat|x) $P(l = \text{cat} | \mathbf{x})$ , we have to sum up

P(π=c_aat|x) $P(\pi = \text{c_aat} | \mathbf{x})$ ,

P(π=ccaat|x) $P(\pi = \text{ccaat}| \mathbf{x})$ ,

P(π=c\_aat|x) $P(\pi = \text{c\_aat} | \mathbf{x})$ … the list is almost infinite! As a result, we have to handle the following stuffs:

put forward a efficient algorithm to estimate $P(l|\mathbf{x})$
associate a loss function with $P(l|\mathbf{x})$ and suggest a method to compute its derivative efficiently

estimate $P(l | \mathbf{x})$

the forward pass

We take advantage of a algorithm called dynamic programming to quickly estimate $P(l|\mathbf{x})$ .

Firstly, let’s denote the length of the labelling $l$ as $U$ , then we insert a blank symbol between all the symbols in $l$ and to the beginning and end of $l$ . We call the obtained sequence $l'$ . That is, if the labelling $l$ is ‘cat’, then $l'$ is ‘_c_a_t_’. Then we use the symbol $\alpha(t, u)$ to represent the summation of all the probability of all the paths’ which satisfy the following properties:

the paths’ first $t$ characters is mapped (by $\cal{F}$ ) to the first $\frac{u}{2}$ characters of $l$
the $t$ ’s character of the path is the $u$ ’s character of $l'$ .

How to express these properties in the mathmatical way? first, $A'^t$ is the set of all the paths’ first $t$ characters. We can denote the first $\frac{u}{2}$ characters of $l$ as $l_{1:\frac{u}{2}}$ and the set of the $t$ prefix of all the paths that satisfy the properties can be expressed as

V (t, u) = {π | π \in A' t, F (π) = l 1 : u 2 and π t = l' u}

$V(t, u) = \left\{\pi | \pi \in A'^t, \cal{F}(\pi) = l_{1:\frac{u}{2}} \text{and}\ \pi_t = l'_u \right\}$
Where we transcate the integer

u2 $\frac{u}{2}$ . For example, if

l $l$ is ‘cat’ and

l′ $l'$ is ‘_c_a_t_’, let

t=1 $t = 1$ and

u=1 $u = 1$ , then

V(1,1) $V(1, 1)$ is the set of the 1 prefix (just the first symbol) of all the paths that can be mapped to the first 0 (

12 $\frac{1}{2}$ ) characters of

l $l$ (just the blank symbol ) and the first symbol of the path must be . It is apparent that

V(1,1) $V(1, 1)$ contains only one single symbol, the blank symbol _ , because no other symbol can be mapped to _. Then, what is

V(1,2) $V(1, 2)$ ? it is the set of the 1 prefix of all the paths that can be mapped to the first (

22 $\frac{2}{2}$ ) character of

l $l$ (just the symbol ‘c’ in the labelling ‘cat’) and the first symbol of the path must be the second symbol of

l′ $l'$ , that is , ‘c’. It is also apperent that

V(1,2) $V(1, 2)$ contains only a character ‘c’.

Then, the probability $\alpha(t, u)$ is defined to be the probability of all the prefix $t$ of the paths in $V(t, u)$ , that is

α (t, u) = \sum π \in V (t, u) P (π | x) = \sum π \in V (t, u) \prod i = 1 t y i π i

$\begin{align} \alpha(t, u) &= \sum_{\pi \in V(t, u)} P(\pi | \mathbf{x}) \\ &= \sum_{\pi \in V(t, u)} \prod_{i = 1}^ty_{\pi_i}^i \end{align}$
So, how to estimate

α(t,u) $\alpha(t, u)$ ? We already know that

V(1,1)=_ $V(1, 1) = \_$ and

V(1,2)=l1=c $V(1, 2) = l_1 = \text{c}$ . As a result,

α(1,1)=y1_ $\alpha(1, 1) = y_{\_}^1$ and

α(1,2)=y1l1 $\alpha(1, 2) = y_{l_1}^1$ . It takes no effort to figure out that

α(1,3)=0 $\alpha(1, 3) = 0$ , because a blank symbol cannot be mapped as ‘c’. In fact,

α(1,u)=0∀u≥3 $\alpha(1, u) = 0 \forall u\ge 3$ . In this particular example,

u $u$ can be as large as 7.

Now, as for the situation where $t = 2$ , $\alpha(2, 1) = y_{\_}^2\times\alpha(1, 1)$ , why? because $\alpha(2, 1)$ means that the first 2 characters of the path must be mapped to the blank symbol and the 2nd symbol of the path must be blank. This event will happen with a probability $y_{\_}^2\times\alpha(1, 1)$ .

Now consider the general case. for $\alpha(t, u)$ , if $l'_u$ is blank, then in order for the first $t$ symbols of the path to be mapped to the first $\frac{u}{2}$ symbols of $l$ and the $t$ ’s symbol of the path to be $l'_u$ (blank), then one of the following things must happen, apart from $y_{l'_u}^t$ :

the first $t-1$ symbols can be mapped to the first $\frac{u}{2}$ symbols of $l$ and the $t-1$ ’s symbol of the path is just the $\frac{u}{2}$ ’s symbol of $l$ (the $u-1$ ’s symbol of $l'$ ), that is, $\alpha(t-1, u-1)$ . In this way, the $t-1$ ’s symbol is kept by $\cal{F}$ and the $t$ ’s symbol is eliminated by $\cal{F}$ .
the first $t-1$ symbols can be mapped to the first $\frac{u}{2}$ symbols of $l$ and the $t-1$ ’s symbol of the path is just blank (the $u$ ’s symbol of $l'$ ), that is, $\alpha(t-1, u)$ . In this way, both the $t-1$ ’s symbol and the $t$ ’s symbol are elinimated by $\cal{F}$ .

As a result,

α (t, u) = y t l' u (α (t - 1, u - 1) + α (t - 1, u)) if l' u = blank

$\alpha(t, u) = y_{l'_u}^t\left(\alpha(t-1, u -1) + \alpha(t-1, u)\right) \text{if } l'_u = \text{blank}$
Then, what if

l′u $l'_u$ is not blank? Note that

F $\cal{F}$ involves eliminating all the consectively repeated sumbols and the blank symbol, we can get that, apart from

ytl′u $y^t_{l'_u}$ , one of the following things must happen

the first $t-1$ symbols can be mapped to the first $\frac{u-2}{2}$ symbols of $l$ and the $t-1$ ’s symbol of the path is just the $\frac{u-2}{2}$ ’s symbol of $l$ (the $u-2$ ’s symbol of $l'$ ), that is, $\alpha(t-1, u-2)$ . In this way, both the $t-1$ ’s symbol and the $t$ ’s symbol is kept by $\cal{F}$ .
the first $t-1$ symbols can be mapped to the first $\frac{u-2}{2}$ symbols of $l$ and the $t-1$ ’s symbol of the path is just blank (the $u-1$ ’s symbol of $l'$ ), that is, $\alpha(t-1, u-1)$ . In this way, the $t-1$ ’s symbol is eliminated by $\cal{F}$ and the $t$ ’s symbol is kept by $\cal{F}$ .
the first $t-1$ symbols can be mapped to the first $\frac{u}{2}$ symbols of $l$ and the $t-1$ ’s symbol of the path is just the $\frac{u}{2}$ ’s symbol of $l$ (the $u$ ’s symbol of $l'$ ), that is, $\alpha(t-1, u)$ . In this way, the $t-1$ ’s symbol is kept by $\cal{F}$ and the $t$ ’s symbol is eliminated by $\cal{F}$ because of it is the repeatation of the previous symbol.

As a result,

α (t, u) = y t l' u (α (t - 1, u - 1) + α (t - 1, u) + α (t, u - 2)) if l' u \neq blank

$\alpha(t, u) =y_{l'_u}^t\left( \alpha(t-1, u -1) + \alpha(t-1, u) + \alpha(t, u-2)\right)\text{if } l'_u \ne \text{blank}$
However, there are uncovered cases. What if

l $l$ has consectively repeated character? you may think it is impossible, but in the function

F $\cal{F}$ , we first remove all the consectively repeated symbols and then remove all the blank symbols. After the operation, there may be new consectively repeated symbols produced! If

l′u=l′u−2 $l'_u = l'_{u-2}$ , then you can easily figure out that

α(t,u)=ytl′u(α(t−1,u−1)+α(t−1,u)) $\alpha(t, u) = y_{l'_u}^t\left(\alpha(t-1, u -1) + \alpha(t-1, u)\right)$ no matter whether

l′u $l'_u$ is blank.

To summrise

α (t, u) = {y t l' u (α (t - 1, u - 1) + α (t - 1, u)) if l' u = blank or l' u = l' u - 2 y t l' u (α (t, u - 2) + α (t - 1, u - 1) + α (t - 1, u)) otherwise

$\alpha(t, u) = \begin{cases} y_{l'_u}^t\left(\alpha(t-1, u -1) + \alpha(t-1, u)\right) \text{if } l'_u = \text{blank or } l'_u = l'_{u-2} \\ y_{l'_u}^t\left( \alpha(t, u-2) + \alpha(t-1, u -1) + \alpha(t-1, u)\right)\text{otherwise} \end{cases}$
To make the formula compatible when

u=1 $u = 1$ and

u=2 $u = 2$ , we must have

α(t,0) $\alpha(t, 0)$ defined, we can define them to be 0.

the backward pass

Now we come to the backward pass. We define another variable called $\beta(t, u)$ to be the probability that the last $T-t$ symbols in the path can be mapped to the last $U-\frac{u}{2}$ symbols in the labelling, given that the $t$ ’s symbol in the path is $l'_u$ . take the cat’s example, then $l'$ is ‘_c_a_t_’, the variable $\alpha(3, 4)$ is trying to answer the question: what’s the probability that the first 3 symbols of the path can be mapped to the labelling ‘ca’ and the 3rd symbol in the path is ‘a’? On the other hand, suppose that the length of the output path is 7, the variable $\beta(3, 4)$ is trying to answer the question: if it is given that he 3rd symbol in the path is the forth symbol in the labelling, namely, ‘a’, what is the probability that the 4 through 7 symbols of the path can be mapped to the rest part of ‘cat’, namely, ‘t’.

Now, how to estimate $\beta(t, u)$ ? suppose the length of the path is $T$ and the total length of the augmented labelling $l'$ is $U'$ , we can start from the following initial case:

β (T, U') β (T, U' - 1) = 1 = 1

$\begin{align} \beta(T, U') &= 1 \\ \beta(T, U'-1) &= 1 \end{align}$
the variable

β(T,U′) $\beta(T, U')$ is the probability that the last

T−T $T-T$ symbols (no symbol) can be mapped to the last

U−U′2 $U - \frac{U'}{2}$ symbols (no symbol), given that

πT $\pi_T$ is equal to

l′U′ $l'_{U'}$ , This is a no-symbols to no-symbol map, and the probability is dictated to be 1. The case in

β(T,U′−1) $\beta(T, U' - 1)$ is similar.

How to estimate $\beta(t, u)$ in general case?, In order for the last $T-t$ symbols of the path to be mapped to the last $U'-\frac{u}{2}$ symbols of the labelling $l$ , given that $\pi_t = l'_u$ , if the symbol $l'_u$ is blank, then one of the following must be true:

the $t+1$ symbol of the path is also blank, and the last $T-(t+1)$ symbols of the path can be mapped to the last $U-\frac{u}{2}$ symbols of $l$ , that is, $y_{l'_u}^{t+1}\beta(t+1, u)$
the $t+1$ symbol of the path is not blank, instead, it is $l'_{u+1}$ , and the last $T-(t+1)$ symbols of the path can be mapped to the last $U-\frac{u+1}{2}$ symbols of $l$ , that is, $y_{l'_{u+1}}^{t+1}\beta(t+1, u+1)$

As a result, we can imitate the inference in the last section and get the formula for the $\beta(t, u)$

β (t, u) = ⎧ ⎩ ⎨ y t + 1 l' u + 1 β (t + 1, u + 1) + y t + 1 l' u β (t + 1, u) if l' u = blank or l' u = l' u - 2 y t + 1 l' u + 2 β (t + 1, u + 2) + y t + 1 l' u + 1 β (t + 1, u + 1) + y t + 1 l' u β (t + 1, u) otherwise

$\beta(t, u) = \begin{cases} y_{l'_{u+1}}^{t+1}\beta(t+1, u + 1) + y_{l'_u}^{t+1}\beta(t+1, u) \text{if } l'_u = \text{blank or } l'_u = l'_{u-2} \\ y_{l'_{u+2}}^{t+1}\beta(t+1, u+2) + y_{l'_{u+1}}^{t+1}\beta(t+1, u + 1) + y_{l'_u}^{t+1}\beta(t+1, u)\text{otherwise} \end{cases}$
Now we can compute

α(t,u) $\alpha(t, u)$ and

β(t,u) $\beta(t, u)$ for each

t∈1,2⋯T $t \in 1, 2 \cdots T$ and

u∈1,2⋯U′ $u \in 1, 2 \cdots U'$ .

loss function

Note that $\alpha(t, u)$ means the probability that the first $t$ symbols of a path can be mapped to the first $\frac{u}{2}$ symbols of a labelling $l$ and $\pi_t = l'_u$ , and $\beta(t, u)$ means the probability that the last $T-t$ symbols of the path can be mapped to the last $U-\frac{u}{2}$ symbols of the labelling $l$ , given that $\pi_t = l'_u$ , we can conclude that $\alpha(t, u)\beta(t, u)$ is the probability that all the $T$ symbols of the path can be mapped to the $U$ symbols in the labelling $l$ and $\pi_t = l'_u$ . The length of $l'$ is $U'$ , so we can sum over all the $u$ to get

P (l | x) = \sum u = 1 U' α (t, u) β (t, u) \forall t \in 1, 2 \dots T

$P(l | \mathbf{x}) = \sum_{u = 1}^{U'} \alpha(t, u)\beta(t, u) \forall t \in 1, 2 \cdots T$
for a particular training example

x $\mathbf{x}$ with the ground truth

l $l$ , we can define the function

L(x,l)=−log(P(l|x)) $\mathcal{L}(\mathbf{x}, l) = -\mathrm{log}(P(l | \mathbf{x}))$ to be the example loss, then the example loss is

L (x, l) = - l o g ⎛ ⎝ \sum u = 1 U' α (t, u) β (t, u) ⎞ ⎠ \forall t \in 1, 2 \dots T

$\mathcal{L}(\mathbf{x}, l) = -\mathrm{log}\left(\sum_{u = 1}^{U'} \alpha(t, u)\beta(t, u)\right) \forall t \in 1, 2 \cdots T$
Then we want to get the loss gradient

∂L(x,l)∂ytk $\frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial y_{k}^t}$ , which is just the partial derivative of the loss function with respect to the output of the RNN. Recall that

ytk $y^t_k$ is the probability that the RNN output label

k $k$ at time

t $t$ , then

\partial L ( x , l ) \partial y t k = - \partial \partial t t k l o g (P (l | x)) = - 1 P ( l | x ) \partial P ( l | x ) \partial y t k = - 1 P ( l | x ) \partial ( \sum U ' u = 1 α ( t , u ) β ( t , u ) ) \partial y t k = - 1 P ( l | x ) \sum u = 1 U' \partial α ( t , u ) β ( t , u ) \partial y t k

$\begin{align} \frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial y^t_k} &= -\frac{\partial }{\partial t^t_k}\mathrm{log}(P(l | \mathbf{x})) \\ &= -\frac{1}{P(l|\mathbf{x})} \frac{\partial P(l | \mathbf{x})}{\partial y^t_k} \\ &= -\frac{1}{P(l|\mathbf{x})} \frac{\partial \left(\sum_{u = 1}^{U'}\alpha(t, u)\beta(t, u)\right)}{\partial y^t_k} \\ &= -\frac{1}{P(l|\mathbf{x})} \sum_{u = 1} ^{U'}\frac{ \partial \alpha(t, u)\beta(t, u)}{\partial y^t_k} \end{align}$
Then what is

∂α(t,u)β(t,u)∂ttk $\frac{\partial \alpha(t, u)\beta(t, u)}{\partial t^t_k}$ for a particular

t $t$ and

u $u$ ?. Recall that

α(t,u)β(t,u) $\alpha(t, u)\beta(t, u)$ is the probability that the path

π $\pi$ can be mapped into the labelling

l $l$ and the

t $t$ ’s symbol in

π $\pi$ is the

u $u$ ’s symbol in

l′ $l'$ . Then,

α(t,u)β(t,u) $\alpha(t, u)\beta(t, u)$ is just the sum of the probability of all these paths. that is

α (t, u) β (t, u) = \sum π \in X (t, u) \prod i = 1 T y i π i

$\alpha(t, u)\beta(t, u) = \sum_{\pi \in X(t, u)} \prod_{i = 1}^Ty^i_{\pi_i}$
where

X(t,u) $X(t, u)$ is the set of all the paths that can be mapped to

l $l$ and the

t $t$ ’s symbol in the path

π $\pi$ is

l′u $l'_u$ . For a particulat

t $t$ and

u $u$ , if

l′u $l'_u$ =

k $k$ , then

πt=l′u=k $\pi_t = l'_u = k$ , and it is apparent that the term

ytk $y^t_k$ appears in the right hand side of the expression for

α(t,u)β(t,u) $\alpha(t, u)\beta(t, u)$ , and we can find that in this case

\partial α ( t , u ) β ( t , u ) \partial y t k = α ( t , u ) β ( t , u ) y t k

$\frac{\partial \alpha(t, u)\beta(t, u)}{\partial y^t_k} = \frac{\alpha(t, u)\beta(t, u)}{y^t_k}$
On the other hand, if

l′u≠k $l'_u \ne k$ , then the term

ytk $y^t_k$ will not appear on the right hand side of the expression, as a result,

∂α(t,u)β(t,u)∂ytk=0 $\frac{\partial \alpha(t, u)\beta(t, u)}{\partial y^t_k} = 0$ , We can conclude from the the conditions that

\partial α ( t , u ) β ( t , u ) \partial y t k = ⎧ ⎩ ⎨ α ( t , u ) β ( t , u ) t t k if l' u = k 0 otherwise

$\frac{\partial \alpha(t, u)\beta(t, u)}{\partial y^t_k} = \begin{cases} \frac{\alpha(t, u)\beta(t, u)}{t^t_k} \text{if $l'_u = k$} \\ 0 \text{ otherwise} \end{cases}$
Then we can obtain that

\partial L ( x , l ) \partial y t k = - 1 P ( l | x ) \sum u = 1 U' \partial α ( t , u ) β ( t , u ) \partial y t k = - 1 P ( l | x ) \sum l' u = k α ( t , u ) β ( t , u ) y t k = - 1 P ( l | x ) y t k \sum l' u = k α (t, u) β (t, u)

$\begin{align} \frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial y^t_k} &= -\frac{1}{P(l|\mathbf{x})} \sum_{u = 1} ^{U'}\frac{ \partial \alpha(t, u)\beta(t, u)}{\partial y^t_k} \\ &= -\frac{1}{P(l|\mathbf{x})} \sum_{l'_u = k}\frac{ \alpha(t, u)\beta(t, u)}{ y^t_k} \\ &= -\frac{1}{P(l|\mathbf{x})y^t_k} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \end{align}$
Congratulations! we obtain the formula for

∂L(x,l)∂ytk $\frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial y^t_k}$ ! How ever, we must take the final step. How we get the RNN’s output

ytk $y^t_k$ ? In fact, we put the activation

atk $a^t_k$ into a softmax layer and the formula for the softmax layer is

y t k = e a t k \sum | A ' | l = 1 e t l \forall t \in 1, 2 \dots T

$y^t_k = \frac{e^{a^t_k}}{\sum_{l = 1}^{|A'|}e^t_l} \forall t \in 1, 2 \cdots T$
remember that

A′ $A'$ is the augmented alphabet set (including the blank symbol), then

\partial L ( x , l ) \partial a t k = \sum k' = 1 | A' | \partial L ( x , l ) \partial y t k ' \partial y t k ' \partial a t k = \sum k' = 1 | A' | (y t k' δ k' k - y t k' y t k) \partial L ( x , l ) \partial y t k ' try to derive the expression by yourself = \sum k' = 1 | A' | ⎛ ⎝ - 1 P ( l | x ) y t k ' (y t k' δ k' k - y t k' y t k) \sum l' u = k' α (t, u) β (t, u) ⎞ ⎠

$\begin{align} \frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial a^t_k} &= \sum_{k' = 1}^{|A'|}\frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial y^t_{k'}}\frac{\partial y^t_{k'}}{\partial a^t_k} \\ &= \sum_{k' = 1}^{|A'|}\left(y^t_{k'}\delta_{k'k} - y^t_{k'}y^t_k\right) \frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial y^t_{k'}} \text{ try to derive the expression by yourself}\\ &= \sum_{k' = 1}^{|A'|}\left( -\frac{1}{P(l|\mathbf{x})y^t_{k'}}\left(y^t_{k'}\delta_{k'k} - y^t_{k'}y^t_k\right) \sum_{l'_u = k'} \alpha(t, u)\beta(t, u)\right) \end{align}$
Take a close look at the summed items. If

k=k′ $k = k'$ then

δk;k=1 $\delta_{k;k} = 1$ , and

- 1 P ( l | x ) y t k ' (y t k' δ k' k - y t k' y t k) \sum l' u = k' α (t, u) β (t, u) = - 1 P ( l | x ) y t k (y t k - y t k y t k) \sum l' u = k α (t, u) β (t, u) = 1 P ( l | x ) (y t k - 1) \sum l' u = k α (t, u) β (t, u) = y t k P ( l | x ) \sum l' u = k α (t, u) β (t, u) - 1 P ( l | x ) \sum l' u = k α (t, u) β (t, u)

$\begin{align} -\frac{1}{P(l|\mathbf{x})y^t_{k'}}\left(y^t_{k'}\delta_{k'k} - y^t_{k'}y^t_k\right) \sum_{l'_u = k'} \alpha(t, u)\beta(t, u) &= -\frac{1}{P(l|\mathbf{x})y^t_{k}}\left(y^t_{k} - y^t_{k}y^t_k\right) \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \\ &= \frac{1}{P(l|\mathbf{x})}\left(y^t_{k} -1\right) \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \\ &= \frac{y^t_k}{P(l|\mathbf{x})} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) - \frac{1}{P(l|\mathbf{x})} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \end{align}$
Then if

k≠k′ $k \ne k'$ then

δ=0 $\delta = 0$ , and

- 1 P ( l | x ) y t k ' (y t k' δ k' k - y t k' y t k) \sum l' u = k' α (t, u) β (t, u) = - 1 P ( l | x ) y t k ' (- y t k' y t k) \sum l' u = k' α (t, u) β (t, u) = y t k P ( l | x ) \sum l' u = k' α (t, u) β (t, u)

$\begin{align} -\frac{1}{P(l|\mathbf{x})y^t_{k'}}\left(y^t_{k'}\delta_{k'k} - y^t_{k'}y^t_k\right) \sum_{l'_u = k'} \alpha(t, u)\beta(t, u) &= -\frac{1}{P(l|\mathbf{x})y^t_{k'}}\left( - y^t_{k'}y^t_k\right) \sum_{l'_u = k'} \alpha(t, u)\beta(t, u) \\ &= \frac{y^t_k}{P(l|\mathbf{x})} \sum_{l'_u = k'} \alpha(t, u)\beta(t, u) \end{align}$
Then we sum them up, and finally we can get

\partial L ( x , l ) \partial a t k = y t k P ( l | x ) \sum k' = 1 | A' | \sum l' u = k' α (t, u) β (t, u) - 1 P ( l | x ) \sum l' u = k α (t, u) β (t, u) = y t k P ( l | x ) \sum u = 1 | U' | α (t, u) β (t, u) - 1 P ( l | x ) \sum l' u = k α (t, u) β (t, u) = y t k P ( l | x ) P (l | x) - 1 P ( l | x ) \sum l' u = k α (t, u) β (t, u) = y t k - 1 P ( l | x ) \sum l' u = k α (t, u) β (t, u)

$\begin{align} \frac{\partial \mathcal{L}(\mathbf{x}, l)}{\partial a^t_k} &= \frac{y^t_k}{P(l|\mathbf{x})}\sum_{k' = 1}^{|A'|}\sum_{l'_u = k'}\alpha(t, u)\beta(t, u) - \frac{1}{P(l|\mathbf{x})} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \\ &= \frac{y^t_k}{P(l|\mathbf{x})}\sum_{u = 1}^{|U'|}\alpha(t, u)\beta(t, u) - \frac{1}{P(l|\mathbf{x})} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \\ &= \frac{y^t_k}{P(l|\mathbf{x})}P(l|\mathbf{x}) - \frac{1}{P(l|\mathbf{x})} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \\ &= y^t_k - \frac{1}{P(l|\mathbf{x})} \sum_{l'_u = k} \alpha(t, u)\beta(t, u) \end{align}$
This is the final formula that we want to get. It is easily observed from the expression that all of the

ytk $y^t_k$ ,

P(l|x) $P(l|\mathbf{x})$ and