A Note on Kaldi's PLDA Implementation

最新推荐文章于 2020-11-02 13:42:25 发布

MoussaTintin

最新推荐文章于 2020-11-02 13:42:25 发布

阅读量836

点赞数

分类专栏：原创机器学习概率统计语音技术文章标签： Kaldi EM

本文链接：https://blog.csdn.net/jackytintin/article/details/79744063

版权

原创同时被 3 个专栏收录

31 篇文章 8 订阅

订阅专栏

机器学习

12 篇文章 0 订阅

订阅专栏

语音技术

6 篇文章 0 订阅

订阅专栏

Kaldi’s PLDA implementation is based on [1], the so-called two-covariance PLDA by [2]. The authors derive a clean update formula for the EM training and give a detailed comment in the source code. Here we add some explanations to make formula derivation more easy to catch.

A pdf version of this note can be found here

1. Background

Recall that PLDA assume a two stage generative process:
1) generate the class center according to

y \sim N (μ, Φ b)

$y \sim \mathcal{N}(\mu, \Phi_b)$
2) then, generate the observed data by:

x \sim N (y, Φ w)

$x \sim \mathcal{N}(y, \Phi_w)$

Here, $\mu$ is estimated by the global mean value:

μ = \sum k = 1 K \sum i = 1 n k z k i

$\mu = \sum_{k=1}^K \sum_{i=1}^{n_k} z_{ki}$
here

zki z k i $z_{ki}$ depicts the

i i $i$ -th sample of the

k

$k$ -th class.

So let’s to the estimation of $\Phi_b$ and $\Phi_w$ .

Note that, as $\mu$ is fixed, we remove it from all samples. Hereafter, we assume all samples have pre-processed by removing $mu$ from them.

The prior distribution of an arbitrary sample $z$ is:

p (z) \sim N (0, Φ_{w} + Φ_{w})

$p(z) \sim \mathcal{N}(0, \Phi_w + \Phi_w)$
Let’s suppose the mean of a particular class is

m m $m$ , and suppose that that class had

n

$n$ examples.

m = 1 n \sum i = 1 n z i \sim N (0, Φ w + Φ w n)

$m = \frac{1}{n}\sum_{i=1}^n z_i \sim \mathcal{N}(0, \Phi_w + \frac{\Phi_w}{n} )$

i.e. $m$ is Gaussian-distributed with zero mean and variance equal to the between-class variance plus $1/n$ times the within-class variance. Now, $m$ is observed (average of all observed samples).

2. EM

We’re doing an E-M procedure where we treat $m$ as the sum of two variables:

m = x + y

$m = x + y$

where $x \sim N(0, \Phi_b)$ , $y \sim N(0, \Phi_w/n)$ .

The distribution of $x$ will contribute to the stats of $\Phi_b$ , and $y$ to $\Phi_w$ .

2.1 E Step

Note that given $m$ , there’s only one latent variable in effect. Observe the $y = m - x$ , so we can focus on working out the distribution of $x$ and then we can very simply get the distribution of $y$ .

Given $m$ , the posterior distribution of $x$ is:

p (x | m) = \int y p x (x | m, y) p y (y) = p x (x | m) p y (m - x | m)

$p(x|m) = \int_y p_x(x|m, y)p_y(y) = p_x(x|m)p_y(m-x|m)$
Hereafter, we drop the condition on

m m $m$ for brevity.

p (x) = p_{x} (x) p_{y} (m - x) = N (x | 0, Φ_{b}) N (x | m, Φ_{w} / n)

$p(x) = p_x(x)p_y(m-x) =\mathcal{N}(x|0, \Phi_b) \mathcal{N}(x|m, \Phi_w/n)$

Since two Gaussian’s product is Gaussian as well, we get.

p (x) = N (w, Φ^)

$p(x) = \mathcal{N}(w, \hat \Phi)$
where

Φ^=(Φ−1b+nΦ−1w)−1 Φ ^ = ( Φ b − 1 + n Φ w − 1 ) − 1 $\hat \Phi = (\Phi_b^{-1} + n \Phi_w^{-1}) ^{-1}$ and

w=Φ^nΦ−1wm w = Φ ^ n Φ w − 1 m $w = \hat \Phi n\Phi_w^{-1} m$ .

$\hat \Phi$ and $w$ can be inferred by comparing the one and two order coefficients to the standard form of log Gaussian. As Kaldi’s comment does:

Note: the C is different from line to line.

\ln p (x) = C - 0.5 (x^{T} Φ_{b}^{- 1} x + (m - x)^{T} n Φ_{w}^{- 1} (m - x)) = C - 0.5 x^{T} (Φ_{b}^{- 1} + n Φ_{w}^{- 1}) x + x^{T} z

$\ln p(x) = C - 0.5 (x^T \Phi_b^{-1} x + (m-x)^T n\Phi_w^{-1}(m-x)) = C - 0.5 x^T (\Phi_b^{-1} + n\Phi_w^{-1}) x + x^T z$
where

z=nΦ−1wm z = n Φ w − 1 m $z = n \Phi_w^{-1} m$ , and we can write this as:

ln p (x) = C - 0.5 (x - w) T (Φ - 1 b + n Φ - 1 w) (x - w)

$\ln p(x) = C - 0.5 (x-w)^T (\Phi_b^{-1} + n \Phi_w^{-1}) (x-w)$
where

xT(Φ−1b+nΦ−1w)w=xTz x T ( Φ b − 1 + n Φ w − 1 ) w = x T z $x^T (\Phi_b^{-1} + n \Phi_w^{-1}) w = x^T z$ , i.e.

(Φ - 1 b + n Φ - 1 w) w = z = n Φ - 1 w m

$(\Phi_b^{-1} + n \Phi_w^{-1}) w = z = n \Phi_w^{-1} m$ ,
so

w = (Φ - 1 b + n Φ - 1 w) - 1 * n Φ - 1 w m

$w = (\Phi_b^{-1} + n \Phi_w^{-1})^{-1} * n \Phi_w^{-1} m$

Φ^= (Φ - 1 b + n Φ - 1 w) - 1

$\hat \Phi = (\Phi_b^{-1} + n \Phi_w^{-1}) ^{-1}$

2.2 M Step

The objective function of EM update is:

Q = E x ln p x (x) = E x - 0.5 ln | Φ b | - 0.5 x T (Φ b) - 1 x = - 0.5 ln | Φ b | - 0.5 t r (x x T (Φ w b) - 1)

$\begin{eqnarray} \nonumber Q = \mathbb{E}_x \ln p_x(x) = \mathbb{E}_x -0.5 \ln|\Phi_b| -0.5 x^T (\Phi_b)^{-1} x\\ \nonumber = -0.5 \ln|\Phi_b| -0.5 \mathrm{tr}( xx^T (\Phi_wb)^{-1}) \end{eqnarray}$

derivative w.r.t $\Phi_w/n$ is as follows:

\partial \partial ( Φ b ) = - 0.5 (Φ b) - 1 + 0.5 (Φ b) - 1 E [x x T] (Φ b) - 1

$\frac{\partial }{\partial (\Phi_b)} = -0.5 (\Phi_b)^{-1} + 0.5 (\Phi_b)^{-1} \mathbb{E}[xx^T] (\Phi_b)^{-1}$

to zero it, we have:

Φ^b = E x [x x T] = Φ^+ E x [x] E x [x] T = Φ^+ w w T

$\hat \Phi_b = \mathbb{E}_x[xx^T] = \hat \Phi + \mathbb{E}_x[x] \mathbb{E}_x[x] ^T = \hat \Phi +ww^T$

Similarly, we have:

Φ^w / n = E y [y y T] = Φ^+ E y [y] E y [y] T = Φ^+ (w - m) (w - m) T

$\hat \Phi_w/n = \mathbb{E}_y[yy^T] = \hat \Phi + \mathbb{E}_y[y] \mathbb{E}_y[y] ^T = \hat \Phi +(w-m)(w-m)^T$

3. Summary

recap that given samples of certain class, we can calculate the following statistics:

Φ^= (Φ - 1 b + n Φ - 1 w) - 1

$\hat \Phi = (\Phi_b^{-1} + n \Phi_w^{-1}) ^{-1}$

w = Φ^Φ - 1 w n m

$w =\hat \Phi \Phi_w^{-1}n m$

Φ^w = n (Φ + w w T)

$\hat \Phi_w = n( \Phi +ww^T)$

Φ^b = Φ^+ (w - m) (w - m) T

$\hat \Phi_b = \hat \Phi +(w-m)(w-m)^T$
Given

K K $K$ classes, updated estimation via EM will be:

Φ_{w} = \frac{1}{N} \sum_{k} n_{k} ({\hat{Φ}}_{k} + w_{k} w_{k}^{T})

$\Phi_w = \frac{1}{N}\sum_k n_k(\hat \Phi_k +w_kw_k^T)$

Φ b = 1 K \sum k (Φ^k + (w k - m k) (w k - m k) T)

$\Phi_b = \frac{1}{K}\sum_k (\hat \Phi_k + (w_k-m_k)(w_k-m_k)^T)$

Finally, Kaldi use the following update formula for $\Phi_w$ :

Φ w = 1 2 N - K (S + \sum k n k (Φ^k + w k w T k))

$\Phi_w = \frac{1}{2N-K} (S + \sum_k n_k(\hat \Phi_k +w_kw_k^T))$

where $S$ is the scatter matrix $S = \sum_k \sum_i (z_{ki} - c_k)$ , and $c_k = \frac{1}{n_k}\sum_i z_{ki}$ is the mean of samples of the $k$ -th class.

Note that $S$ is the result of EM used here, since $m = x + y$ only take pooling of data into consideration.

For other EM training, see [2] and the references therein.

References

Ioffe. Probabilistic Linear Discriminant Analysis.
Sizov et al. Unifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication.

MoussaTintin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
A Note on Kaldi's PLDA Implementation

Kaldi’s PLDA implementation is based on [1], the so-called two-covariance PLDA by [2]. The authors derive a clean update formula for the EM training and give a detailed comment in the source code. He...
复制链接

扫一扫