模式识别 | PRML Chapter 10 Approximate Inference

PRML Chapter 10 Approximate Inference

10.1 Variational Inference

For observed variable X = { x 1 , . . . , x N } X=\{x_1,...,x_N\} X={x1,...,xN} and latent Z = { z 1 , . . . , z N } Z=\{z_1,...,z_N\} Z={z1,...,zN}. Our probabilistic model specifies the joint distribution p ( Z ∥ X ) p(Z\|X) p(ZX) and our goal is to find an approximation for the posterior distribution p ( Z ∥ X ) p(Z\|X) p(ZX) as well as for the model evidence p ( X ) p(X) p(X). As in our discussion of EM, we can decompose the log marginal probability using:

ln ⁡ p ( X ) = L ( q ) + K L ( q ∣ ∣ p ) \ln p(X) = \mathcal{L}(q) + KL(q || p) lnp(X)=L(q)+KL(qp)

where

L ( q ) = ∫ q ( Z ) ln ⁡ { p ( X , Z ) q ( Z ) } d Z \mathcal{L}(q) = \int q(Z) \ln\left\{ \frac{p(X, Z)}{q(Z)} \right\} dZ L(q)=q(Z)ln{q(Z)p(X,Z)}dZ

K L ( q ∣ ∣ p ) = − ∫ q ( Z ) ln ⁡ { p ( Z ∣ X ) q ( Z ) } d Z KL(q || p) = -\int q(Z)\ln \left\{p(Z | X)q(Z) \right\} dZ KL(qp)=q(Z)ln{p(ZX)q(Z)}dZ

10.1.1 Factorized distributions

Suppose we partition the elements of Z Z Z into disjoint groups and the q factorizes with respect to there groups:

q ( Z ) = ∏ i = 1 M q i ( Z i ) q(Z) = \prod_{i=1}^{M}q_{i}(Z_{i}) q(Z)=i=1Mqi(Zi)

substitute it to the fomula above and denoting q j ( Z j ) q_j(Z_j) qj(Zj) by simply q j q_j qj we obtain:

L ( q ) = ∫ q j ln ⁡ p ~ ( X , Z j ) − ∫ q j ln ⁡ q j d Z j + c o n s t \mathcal{L}(q) = \int q_{j}\ln \tilde{p}(X, Z_{j}) - \int q_{j}\ln q_{j}dZ_{j} + const L(q)=qjlnp~(X,Zj)qjlnqjdZj+const

where we have defined a new distribution p ~ ( X , Z j ) \tilde{p}(X, Z_{j}) p~(X,Zj) by the relation:

ln ⁡ p ~ ( X , Z j ) = E i ≠ j [ ln ⁡ p ( X , Z ) ] + c o n s t \ln \tilde{p}(X, Z_{j}) = E_{i\neq j}[\ln p(X, Z)] + const lnp~(X,Zj)=Ei=j[lnp(X,Z)]+const

E i ≠ j [ ln ⁡ p ( X , Z ) ] = ∫ ln ⁡ p ( X , Z ) ∏ i ≠ j q i d Z i E_{i\neq j}[\ln p(X, Z)] = \int \ln p(X,Z)\prod_{i\neq j}q_{i}dZ_{i} Ei=j[lnp(X,Z)]=lnp(X,Z)i=jqidZi

maximizing L ( q ) \mathcal{L}(q) L(q) is equivalent to minimizing the Kullback-Leibler divergence, and the minimum occurs when q j ( Z j ) = p ~ ( X , Z j ) q_j(Z_j)=\tilde{p}(X,Z_j) qj(Zj)=p~(X,Zj). Thus we obtain the generl expression for the optimal solution q j ∗ ( Z j ) q_j^*(Z_j) qj(Zj) given by:

ln ⁡ q j ∗ ( Z j ) = E i ≠ j [ ln ⁡ p ( X , Z ) ] + c o n s t \ln q^{*}_{j}(Z_{j}) = E_{i\neq j}[\ln p(X, Z)] + const lnqj(Zj)=Ei=j[lnp(X,Z)]+const

10.1.2 Properties of factorized approximations

Two forms of Kullback-Leibler divergence are members of the alpha family of divergences defined by:

D α ( p ∣ ∣ q ) = 4 1 − α 2 ( 1 − ∫ p ( x ) ( 1 + α ) / 2 q ( x ) ( 1 − α ) / 2 d x ) D_\alpha (p||q)=\frac{4}{1-\alpha^2}(1-\int p(x)^{(1+\alpha)/2}q(x)^{(1-\alpha)/2}dx) Dα(pq)=1α24(1p(x)(1+α)/2q(x)(1α)/2dx)

10.1.3 Example: The univariate Gaussian

10.1.4 Model comparison

We can readily verify the following decomposition based on this variational distribution:

l n p ( X ) = L m − ∑ m ∑ Z q ( Z ∣ m ) q ( m ) l n ( p ( Z , m ∣ X ) q ( Z ∣ m ) q ( m ) ) lnp(X)=\mathcal{L}_m-\sum_m\sum_Z q(Z|m)q(m)ln(\frac{p(Z,m|X)}{q(Z|m)q(m)}) lnp(X)=LmmZq(Zm)q(m)ln(q(Zm)q(m)p(Z,mX))

where L m \mathcal{L}_m Lm is a lower bound on l n p ( X ) lnp(X) lnp(X) and is given by

L m = ∑ m ∑ Z q ( Z ∣ m ) q ( m ) l n ( p ( Z , X , m ) q ( Z ∣ m ) q ( m ) ) \mathcal{L}_m=\sum_m\sum_Z q(Z|m)q(m)ln(\frac{p(Z,X,m)}{q(Z|m)q(m)}) Lm=mZq(Zm)q(m)ln(q(Zm)q(m)p(Z,X,m))

We can maximizing L \mathcal{L} L with respect to the distribution q ( m ) q(m) q(m) using a Lagrange multiplier, with the result:

q ( m ) ∝ p ( m ) e x p ( L m ) q(m)\propto p(m)exp(\mathcal{L}_m) q(m)p(m)exp(Lm)

10.2 Illustration: Variational Mixture of Gaussians

The conditional distribution of Z Z Z, given the mixing coefficients π \pi π, in the form:

p ( Z ∣ π ) = ∏ n = 1 N ∏ k = 1 K π k z n k p(Z | \pi) = \prod_{n=1}^{N} \prod_{k=1}^{K} \pi_{k}^{z_{nk}} p(Zπ)=n=1Nk=1Kπkznk

Similarly, the conditional distribution of the observed data vectors, given the latent variables and the component parameters:

p ( X ∣ Z , μ , Λ ) = ∏ n = 1 N ∏ k = 1 K N ( x n ∣ μ k , Λ k − 1 ) z n k p(X | Z, \mu, \Lambda) = \prod_{n=1}^{N} \prod_{k=1}^{K}\mathcal{N}(x_{n} | \mu_{k}, \Lambda_{k}^{-1})^{z_{nk}} p(XZ,μ,Λ)=n=1Nk=1KN(xnμk,Λk1)znk

Choose a Dirichlet distribution over the mixing coefficients π \pi π:

p ( π ) D i r ( π ∣ α 0 ) C ( α 0 ) ∏ k = 1 K π k α 0 − 1 p(\pi) Dir(\pi | \alpha_{0}) C(\alpha_{0})\prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} p(π)Dir(πα0)C(α0)k=1Kπkα01

Introduce an independent Gaussian-Wishart prior governing the mean and precision of each Gaussian component, give by:

p ( μ , Λ ) = p ( μ ∣ Λ ) p ( Λ ) = ∏ k = 1 K N ( μ k ∣ m 0 , ( β 0 Λ ) − 1 ) W ( Λ k ∣ W 0 , ν 0 ) p(\mu, \Lambda) = p(\mu | \Lambda)p(\Lambda) = \prod_{k=1}^{K}\mathcal{N}(\mu_{k} | m_{0}, (\beta_{0}\Lambda)^{-1})\mathcal{W}(\Lambda_{k} | W_{0}, \nu_{0}) p(μ,Λ)=p(μΛ)p(Λ)=k=1KN(μkm0,(β0Λ)1)W(ΛkW0,ν0)

10.2.1 Variational distribution

In order to formulate a variational treatment of this model, we write down the joint distribution of all of the random variables:

p ( X , Z , π , μ , Λ ) = p ( X ∣ Z , μ , Λ ) p ( Z ∣ π ) p ( π ) p ( μ ∣ Λ ) p ( Λ ) p(X, Z, \pi, \mu, \Lambda) = p(X | Z, \mu, \Lambda)p(Z | \pi)p(\pi)p(\mu | \Lambda)p(\Lambda) p(X,Z,π,μ,Λ)=p(XZ,μ,Λ)p(Zπ)p(π)p(μΛ)p(Λ)

Consider a variational distribution which factorizes between the latent variables and the parameters as that:

q ( Z , π , μ , Λ ) = q ( Z ) q ( π , μ , Λ ) q(Z, \pi, \mu, \Lambda) = q(Z)q(\pi, \mu, \Lambda) q(Z,π,μ,Λ)=q(Z)q(π,μ,Λ)

Let us consider the derivation of the update equation for the factor q ( Z ) q(Z) q(Z). The log of the optimized factor is given by:

ln ⁡ q ∗ ( Z ) = E π , μ , Λ [ ln ⁡ p ( X , Z , π , μ , Λ ) ] + c o n s t \ln q^{*}(Z) = E_{\pi, \mu,\Lambda}[\ln p(X, Z, \pi, \mu, \Lambda)] + const lnq(Z)=Eπ,μ,Λ[lnp(X,Z,π,μ,Λ)]+const

Absorb terms do not depend on Z Z Z into the constant:

ln ⁡ q ∗ ( Z ) = E π [ ln ⁡ p ( Z ∣ π ) ] + E μ , Λ [ ln ⁡ p ( X ∣ Z , μ , Λ ) ] + c o n s t \ln q^{*}(Z) = E_\pi[\ln p(Z | \pi)] + E_{\mu, \Lambda}[\ln p(X | Z,\mu, \Lambda)] + const lnq(Z)=Eπ[lnp(Zπ)]+Eμ,Λ[lnp(XZ,μ,Λ)]+const

Substituting for the two conditional distributions on the right-hand side and absorb terms independent of Z Z Z into the constant, we have:

l n q ∗ ( Z ) = ∑ n = 1 N ∑ k = 1 K z n k l n ρ n k + c o n s t ln q^*(Z)=\sum_{n=1}^N\sum_{k=1}^K z_{nk}ln\rho_{nk}+const lnq(Z)=n=1Nk=1Kznklnρnk+const

where we have defined:

ln ⁡ ρ n k = E [ ln ⁡ π k ] + 1 2 E [ ln ⁡ ∣ Λ k ∣ ] − D 2 ln ⁡ ( 2 π ) − 1 2 E μ k , Λ k [ ( x n − μ k ) T Λ k ( x n − μ k ) ] \ln \rho_{nk} = E[\ln \pi_{k}] + \frac{1}{2}E[\ln|\Lambda_{k}|] - \frac{D}{2}\ln(2\pi) - \frac{1}{2}E_{\mu_{k}, \Lambda_{k}}[(x_{n} - \mu_{k})^{T}\Lambda_{k}(x_{n} - \mu_{k})] lnρnk=E[lnπk]+21E[lnΛk]2Dln(2π)21Eμk,Λk[(xnμk)TΛk(xnμk)]

Taking the exponential of both sides we obtain:

q ∗ ( Z ) ∝ ∏ n = 1 N ∏ k = 1 K ρ n k z n k q^{*}(Z) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}\rho_{nk}^{z_{nk}} q(Z)n=1Nk=1Kρnkznk

Normalized the distribution we obtain:

q ∗ ( Z ) = ∏ n = 1 N ∏ k = 1 K r n k z n k q^{*}(Z) = \prod_{n=1}^{N}\prod_{k=1}^{K} r_{nk}^{z_{nk}} q(Z)=n=1Nk=1Krnkznk

where:

r n k = ρ n k ∑ j = 1 K ρ n j r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}} rnk=j=1Kρnjρnk

For the distribution q ∗ ( Z ) q^*(Z) q(Z) we have the standard result:

E [ z n k ] = r n k E[z_{nk}] = r_{nk} E[znk]=rnk

Let us consider the factor q ( π , μ , Λ ) q(\pi, \mu, \Lambda) q(π,μ,Λ) in the variational posterior distribution. We have:

ln ⁡ q ∗ ( π , μ , Λ ) = ln ⁡ p ( π ) ∑ k = 1 K ln ⁡ p ( μ k , Λ k ) + E Z [ ln ⁡ p ( Z ∣ π ) ] + ∑ k = 1 K ∑ n = 1 N E [ z n k ] ln ⁡ N ( x n ∣ μ k , Λ k − 1 ) + c o n s t \ln q^{*}(\pi, \mu, \Lambda) = \ln p(\pi) \sum_{k=1}^{K}\ln p(\mu_{k}, \Lambda_{k}) + E_{Z}[\ln p(Z | \pi)] + \sum_{k=1}^{K}\sum_{n=1}^{N}E[z_{nk}]\ln \mathcal{N}(x_{n} | \mu_{k}, \Lambda_{k}^{-1}) + const lnq(π,μ,Λ)=lnp(π)k=1Klnp(μk,Λk)+EZ[lnp(Zπ)]+k=1Kn=1NE[znk]lnN(xnμk,Λk1)+const

The variational approximation will be:

q ( π , μ , Λ ) = q ( π ) ∏ k = 1 k q ( μ k , Λ k ) q(\pi,\mu,\Lambda)=q(\pi)\prod_{k=1}^kq(\mu_k,\Lambda_k) q(π,μ,Λ)=q(π)k=1kq(μk,Λk)

The results are given by:

q ∗ ( π ) = D i r ( π ∣ α ) q^{*}(\pi) = Dir(\pi | \alpha) q(π)=Dir(πα)

where α \alpha α has components α k \alpha_k αk given by:

α k = α 0 + ∑ n = 1 N r n k \alpha_{k} = \alpha_{0} + \sum_{n=1}^{N}r_{nk} αk=α0+n=1Nrnk

Using q ∗ ( μ k , Λ k ) q ∗ ( μ k ∥ Λ k ) q ∗ ( Λ k ) q^{*}(\mu_{k}, \Lambda_{k}) q^{*}(\mu_{k} \| \Lambda_{k})q^{*}(\Lambda_{k}) q(μk,Λk)q(μkΛk)q(Λk), the posterior distribution is a Gaussian-Wishart distribution and is given by:

q ∗ ( μ k , Λ k ) = N ( μ k ∣ m k , ( β k Λ k ) − 1 ) W ( Λ k ∣ W k , ν k ) q^{*}(\mu_{k}, \Lambda_{k}) = \mathcal{N}(\mu_{k} | m_{k}, (\beta_{k}\Lambda_{k})^{-1})\mathcal{W}(\Lambda_{k} | W_{k}, \nu_{k}) q(μk,Λk)=N(μkmk,(βkΛk)1)W(ΛkWk,νk)

where

β k = β 0 + N 0 \beta_{k} = \beta_{0} + N_{0} βk=β0+N0

m k = 1 β k ( β 0 m 0 + N k x ^ k ) m_{k} = \frac{1}{\beta_{k}}(\beta_{0}m_{0} + N_{k}\hat{x}_{k}) mk=βk1(β0m0+Nkx^k)

W k − 1 = W 0 − 1 + N k S k + β 0 N k β 0 + N k ( x ^ k − m 0 ) ( x ^ k − m 0 ) T W_{k}^{-1} = W_{0}^{-1} + N_{k}S_{k} + \frac{\beta_{0}N_{k}}{\beta_{0} + N_{k}}(\hat{x}_{k} - m_{0})(\hat{x}_{k} - m_{0})^{T} Wk1=W01+NkSk+β0+Nkβ0Nk(x^km0)(x^km0)T

ν k = ν 0 + N k \nu_{k} = \nu_{0} + N_{k} νk=ν0+Nk

10.2.2 Variational lower bound

For the variational mixture of Gaussians, the lower bound is given by:

L = ∑ Z q ( Z , π , μ , Λ ) ln ⁡ { p ( X , Z , π , μ , Λ ) q ( Z , π , μ , Λ ) } d π d μ d Λ \mathcal{L} = \sum_{Z}q(Z, \pi, \mu, \Lambda)\ln\left\{\frac{p(X, Z, \pi, \mu, \Lambda)}{q(Z, \pi, \mu, \Lambda)}\right\} d\pi d\mu d\Lambda L=Zq(Z,π,μ,Λ)ln{q(Z,π,μ,Λ)p(X,Z,π,μ,Λ)}dπdμdΛ

10.2.3 Predictive density

Wihe a new value x ^ \hat{x} x^ the predictive density is given by:

p ( x ^ ∣ X ) = ∑ Z ^ ∫ ∫ ∫ p ( X ^ ∣ z ^ , μ , Λ ) p ( z ^ ∣ π ) p ( π , μ , Λ ∣ X ) d π d μ d Λ p(\hat{x} | X) = \sum_{\hat{Z}}\int\int\int p(\hat{X} | \hat{z},\mu, \Lambda)p(\hat{z} | \pi)p(\pi, \mu, \Lambda | X)d\pi d\mu d\Lambda p(x^X)=Z^p(X^z^,μ,Λ)p(z^π)p(π,μ,ΛX)dπdμdΛ

10.2.4 Determining the number of components

10.2.5 Induced factorizations

10.3 Variational Linear Regression

The joint distribution of all the variables is given by:

p ( t , w , α ) = p ( t ∣ w ) p ( w ∣ α ) p ( α ) p(t,w,\alpha)=p(t|w)p(w|\alpha)p(\alpha) p(t,w,α)=p(tw)p(wα)p(α)

where

p ( t ∣ w ) = ∏ n = 1 N N ( t n ∣ w T ϕ n , β − 1 ) p(t|w)=\prod_{n=1}^N N(t_n|w^T\phi_n,\beta^{-1}) p(tw)=n=1NN(tnwTϕn,β1)

p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) p(w|\alpha)=N(w|0,\alpha^{-1}I) p(wα)=N(w0,α1I)

p ( α ) = G a m ( α ∣ a 0 , b 0 ) p(\alpha)=Gam(\alpha|a_0,b_0) p(α)=Gam(αa0,b0)

10.3.1 Variational distribution

Our first goal is to find an approximation to the posterior distribution p ( w , α ∥ t ) p(w,\alpha\|\mathbf{t}) p(w,αt). The variational posterior distribution is given by the factorized expression:

q ( w , α ) = q ( w ) q ( α ) q(w,\alpha)=q(w)q(\alpha) q(w,α)=q(w)q(α)

The result will be:

q ∗ ( α ) = G a m ( α ∣ a N , b N ) q^*(\alpha)=Gam(\alpha|a_N,b_N) q(α)=Gam(αaN,bN)

where

a N = a 0 + M 2 a_N=a_0+\frac{M}{2} aN=a0+2M

b N = b 0 + 1 2 E [ w T w ] b_N=b_0+\frac{1}{2}E[w^Tw] bN=b0+21E[wTw]

and the distribution q ∗ ( w ) q^*(w) q(w) is Gaussian:

q ∗ ( w ) = N ( w ∣ m N , S N ) q^*(w)=N(w|m_N,S_N) q(w)=N(wmN,SN)

where

m N = β S N Φ T t m_N=\beta S_N\Phi^T t mN=βSNΦTt

S N = ( E [ α ] I + β Φ T Φ ) − 1 S_N=(E[\alpha]I+\beta\Phi^T\Phi)^{-1} SN=(E[α]I+βΦTΦ)1

10.3.2 Predictive distribution

The predictive distribution over t t t, given a new input x x x is evaluated using the Gaussian variational posterior:

p ( t ∣ x , t ) ≃ N ( t ∣ m N T ϕ ( x ) , σ 2 ( x ) ) p(t|x,\mathbf{t})\simeq N(t|m^T_N\phi(x),\sigma^2(x)) p(tx,t)N(tmNTϕ(x),σ2(x))

where the input-dependent variance is given by

σ 2 ( x ) = 1 β + ϕ ( x ) T S N ϕ ( x ) \sigma^2(x)=\frac{1}{\beta}+\phi(x)^TS_N\phi(x) σ2(x)=β1+ϕ(x)TSNϕ(x)

10.3.3 Lower bound

10.4 Exponential Family Distributions

Suppose the joint distribution of observed and latent variables is a member of the exponential family, parameterized by natural parameters η \eta η so that:

p ( X , Z ∣ η ) = ∏ n = 1 N h ( x n , z n ) g ( η ) e x p { η T u ( x n , z n ) } p(X,Z|\eta)=\prod_{n=1}^N h(x_n,z_n)g(\eta)exp\{\eta^Tu(x_n,z_n)\} p(X,Zη)=n=1Nh(xn,zn)g(η)exp{ηTu(xn,zn)}

we shall also use a conjugate prior for e t a eta eta, which can be written as:

p ( η ∣ v 0 ) = f ( v 0 , x 0 ) g ( η ) v 0 e x p { v 0 η T x 0 } p(\eta|v_0)=f(v_0,x_0)g(\eta)^{v_0}exp\{v_0\eta^Tx_0\} p(ηv0)=f(v0,x0)g(η)v0exp{v0ηTx0}

Now consider a variational distribution that factorizes between the latent variables and the parameters, so that q ( Z , η ) = q ( Z ) q ( η ) q(Z,\eta)=q(Z)q(\eta) q(Z,η)=q(Z)q(η). The result will be

q ∗ ( z n ) = h ( x n , z n ) g ( E [ η ] ) e x p { E [ η T ] u ( x n , z n ) } q^*(z_n)=h(x_n,z_n)g(E[\eta])exp\{E[\eta^T]u(x_n,z_n)\} q(zn)=h(xn,zn)g(E[η])exp{E[ηT]u(xn,zn)}

q ∗ ( η ) = f ( v N , x N ) g ( η ) v N e x p { η T x N } q^*(\eta)=f(v_N,x_N)g(\eta)^{v_N}exp\{\eta^Tx_N\} q(η)=f(vN,xN)g(η)vNexp{ηTxN}

where

v N = v o + N v_N=v_o+N vN=vo+N

x N = x 0 + ∑ n = 1 N E z n [ u ( x n , z n ) ] x_N=x_0+\sum_{n=1}^N E_{z_n}[u(x_n,z_n)] xN=x0+n=1NEzn[u(xn,zn)]

10.4.1 Variational message passing

The joint distribution corresponding to a directed graph can be written using the decomposition:

p ( x ) = ∏ i p ( x i ∣ p a i ) p(x)=\prod_i p(x_i|pa_i) p(x)=ip(xipai)

Now consider a variational approximation in which the distribution q ( x ) q(x) q(x) is assumed to factorize with respect to the x i x_i xi so that:

q ( x ) = ∏ i q i ( x i ) q(x)=\prod_i q_i(x_i) q(x)=iqi(xi)

Substitute the formula above into the general result we will get:

l n q j ∗ ( x j ) = E i ≠ j [ ∑ i l n p ( x i ∣ p a i ) ] + c o n s t lnq_j^*(x_j)=E_{i\not =j}[\sum_i lnp(x_i|pa_i)]+const lnqj(xj)=Ei=j[ilnp(xipai)]+const

10.5 Local Variational Methods

For convex funcitons, we can obtain upper bounds by:

g ( η ) = − min ⁡ x { f ( x ) − η x } = max ⁡ x { η x − f ( x ) } g(\eta) = - \min_{x}\{ f(x) - \eta x \} = \max_{x}\{ \eta x - f(x) \} g(η)=xmin{f(x)ηx}=xmax{ηxf(x)}

f ( x ) = max ⁡ η { η x − g ( η ) } f(x) = \max_{\eta}\{ \eta x - g(\eta) \} f(x)=ηmax{ηxg(η)}

And for concave functions:

f ( x ) = min ⁡ η { η x − g ( η ) } f(x) = \min_{\eta}\{ \eta x- g(\eta) \} f(x)=ηmin{ηxg(η)}

g ( η ) = min ⁡ x { η x − f ( x ) } g(\eta) = \min_{x}\{ \eta x - f(x) \} g(η)=xmin{ηxf(x)}

If the function is not convex or concave, then we need the invertible transformations. An example will be logistic sigmoid function:

σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+ex1

The result with that of the logistic sigmoid will be:

σ ( x ) ≤ e x p ( η x − g ( η ) ) \sigma(x) \leq exp(\eta x - g(\eta)) σ(x)exp(ηxg(η))

σ ( x ) ≥ σ ( ξ ) e x p { x − ξ 2 − λ ( ξ ) ( x 2 − ξ 2 ) } \sigma(x) \geq \sigma(\xi)exp\left\{ \frac{x-\xi}{2} - \lambda(\xi)(x^{2} - \xi^{2}) \right\} σ(x)σ(ξ)exp{2xξλ(ξ)(x2ξ2)}

where:

λ ( ξ ) = − 1 2 ξ [ σ ( ξ ) − 1 2 ] \lambda(\xi) = -\frac{1}{2\xi}\left[ \sigma(\xi) - \frac{1}{2} \right] λ(ξ)=2ξ1[σ(ξ)21]

We cna see how the bounds can be used, suppose we wish to evaluate an integral of the form:

I = ∫ σ ( a ) p ( a ) d a I = \int \sigma(a)p(a) da I=σ(a)p(a)da

We can employ the variational bound and we will get:

I ≥ ∫ f ( a , ξ ) p ( a ) d a = F ( ξ ) I \geq \int f(a, \xi)p(a) da = F(\xi) If(a,ξ)p(a)da=F(ξ)

10.6 Variational Logistic Regression

10.6.1 Variational posterior distribution

In the variational framework, we seek to maximize a lower bound on the marginal likelihood. For the Bayesian logistic regression model, the marginal likelihood takes the form:

p ( t ) = ∫ p ( t ∣ w ) p ( w ) d w = ∫ [ ∏ n = 1 N p ( t n ∣ w ) ] p ( w ) d w p(t) = \int p(t | w)p(w)dw = \int \left[\prod_{n=1}^{N}p(t_{n} | w) \right]p(w) dw p(t)=p(tw)p(w)dw=[n=1Np(tnw)]p(w)dw

The conditional distribution for t t t can be written as

p ( t ∣ w ) = e a t σ ( − a ) p(t | w) = e^{at}\sigma(-a) p(tw)=eatσ(a)

where a = w T ϕ a = w^{T}\phi a=wTϕ. Using the variational lower bound on the logistic sigmoid function we can get:

p ( t ∣ w ) = e a t σ ( − a ) ≥ e a t σ ( ξ ) e x p { − a + ξ 2 − λ ( ξ ) ( a 2 − ξ 2 ) } p(t | w) = e^{at}\sigma(-a)\geq e^{at}\sigma(\xi)exp\left\{ -\frac{a + \xi}{2} - \lambda(\xi)(a^{2} - \xi^{2}) \right\} p(tw)=eatσ(a)eatσ(ξ)exp{2a+ξλ(ξ)(a2ξ2)}

Using a = w T ϕ a=w^T\phi a=wTϕ, and multiplying by the prior distribution, we obtain the bound on the joint distribution of t t t and w w w:

p ( t ∣ w ) = p ( t ∣ w ) p ( w ) ≥ h ( w , ξ ) p ( w ) p(t | w) = p(t | w)p(w) \geq h(w, \xi)p(w) p(tw)=p(tw)p(w)h(w,ξ)p(w)

where

h ( w , ξ ) = ∏ n = 1 N σ ( ξ n ) e x p { w T ϕ n t n − ( w T ϕ n + ξ n ) / 2 − λ ( ξ n ) ( [ w T ϕ n ] − ξ n 2 ) } h(w, \xi) = \prod_{n=1}^{N}\sigma(\xi_{n})exp\{w^{T}\phi_{n}t_{n} - (w^{T}\phi_{n} + \xi_{n})/2 - \lambda(\xi_{n})([w^{T}\phi_{n}] - \xi^{2}_{n}) \} h(w,ξ)=n=1Nσ(ξn)exp{wTϕntn(wTϕn+ξn)/2λ(ξn)([wTϕn]ξn2)}

The Gaussian variational posterior will be the form:

q ( w ) = N ( w ∣ m N , S N ) q(w) = \mathcal{N}(w | m_{N}, S_{N}) q(w)=N(wmN,SN)

where

m N = S N ( S 0 − 1 m 0 + ∑ n = 1 N ( t n − 1 2 ) ϕ n ) m_{N} = S_{N}\left( S_{0}^{-1}m_{0} + \sum_{n=1}^{N}(t_{n} - \frac{1}{2}) \phi_{n} \right) mN=SN(S01m0+n=1N(tn21)ϕn)

S N − 1 = S 0 − 1 + 2 ∑ n = 1 N λ ( ξ n ) ϕ n ϕ n T S_{N}^{-1} = S_{0}^{-1} + 2\sum_{n=1}^{N}\lambda(\xi_{n})\phi_{n}\phi_{n}^{T} SN1=S01+2n=1Nλ(ξn)ϕnϕnT

10.6.2 Optimizing the variational parameters

Substitute the inequality above back into the marginal likelihood to give:

ln ⁡ p ( t ) = ln ⁡ ∫ p ( t ∣ w ) p ( w ) d w ≥ ln ⁡ ∫ h ( w , ξ ) p ( w ) d w = L ( ξ ) \ln p(t) = \ln\int p(t | w)p(w)dw \geq \ln \int h(w, \xi)p(w) dw = \mathcal{L}(\xi) lnp(t)=lnp(tw)p(w)dwlnh(w,ξ)p(w)dw=L(ξ)

There are two approaches to determining the ξ n \xi_n ξn: EM algorithm and integrate over w w w.

10.6.3 Inference of hyperparameters

10.7 Expectation Propagation

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值