PRML Chapter 10 Approximate Inference
10.1 Variational Inference
For observed variable X = { x 1 , . . . , x N } X=\{x_1,...,x_N\} X={x1,...,xN} and latent Z = { z 1 , . . . , z N } Z=\{z_1,...,z_N\} Z={z1,...,zN}. Our probabilistic model specifies the joint distribution p ( Z ∥ X ) p(Z\|X) p(Z∥X) and our goal is to find an approximation for the posterior distribution p ( Z ∥ X ) p(Z\|X) p(Z∥X) as well as for the model evidence p ( X ) p(X) p(X). As in our discussion of EM, we can decompose the log marginal probability using:
ln p ( X ) = L ( q ) + K L ( q ∣ ∣ p ) \ln p(X) = \mathcal{L}(q) + KL(q || p) lnp(X)=L(q)+KL(q∣∣p)
where
L ( q ) = ∫ q ( Z ) ln { p ( X , Z ) q ( Z ) } d Z \mathcal{L}(q) = \int q(Z) \ln\left\{ \frac{p(X, Z)}{q(Z)} \right\} dZ L(q)=∫q(Z)ln{q(Z)p(X,Z)}dZ
K L ( q ∣ ∣ p ) = − ∫ q ( Z ) ln { p ( Z ∣ X ) q ( Z ) } d Z KL(q || p) = -\int q(Z)\ln \left\{p(Z | X)q(Z) \right\} dZ KL(q∣∣p)=−∫q(Z)ln{p(Z∣X)q(Z)}dZ
10.1.1 Factorized distributions
Suppose we partition the elements of Z Z Z into disjoint groups and the q factorizes with respect to there groups:
q ( Z ) = ∏ i = 1 M q i ( Z i ) q(Z) = \prod_{i=1}^{M}q_{i}(Z_{i}) q(Z)=i=1∏Mqi(Zi)
substitute it to the fomula above and denoting q j ( Z j ) q_j(Z_j) qj(Zj) by simply q j q_j qj we obtain:
L ( q ) = ∫ q j ln p ~ ( X , Z j ) − ∫ q j ln q j d Z j + c o n s t \mathcal{L}(q) = \int q_{j}\ln \tilde{p}(X, Z_{j}) - \int q_{j}\ln q_{j}dZ_{j} + const L(q)=∫qjlnp~(X,Zj)−∫qjlnqjdZj+const
where we have defined a new distribution p ~ ( X , Z j ) \tilde{p}(X, Z_{j}) p~(X,Zj) by the relation:
ln p ~ ( X , Z j ) = E i ≠ j [ ln p ( X , Z ) ] + c o n s t \ln \tilde{p}(X, Z_{j}) = E_{i\neq j}[\ln p(X, Z)] + const lnp~(X,Zj)=Ei=j[lnp(X,Z)]+const
E i ≠ j [ ln p ( X , Z ) ] = ∫ ln p ( X , Z ) ∏ i ≠ j q i d Z i E_{i\neq j}[\ln p(X, Z)] = \int \ln p(X,Z)\prod_{i\neq j}q_{i}dZ_{i} Ei=j[lnp(X,Z)]=∫lnp(X,Z)i=j∏qidZi
maximizing L ( q ) \mathcal{L}(q) L(q) is equivalent to minimizing the Kullback-Leibler divergence, and the minimum occurs when q j ( Z j ) = p ~ ( X , Z j ) q_j(Z_j)=\tilde{p}(X,Z_j) qj(Zj)=p~(X,Zj). Thus we obtain the generl expression for the optimal solution q j ∗ ( Z j ) q_j^*(Z_j) qj∗(Zj) given by:
ln q j ∗ ( Z j ) = E i ≠ j [ ln p ( X , Z ) ] + c o n s t \ln q^{*}_{j}(Z_{j}) = E_{i\neq j}[\ln p(X, Z)] + const lnqj∗(Zj)=Ei=j[lnp(X,Z)]+const
10.1.2 Properties of factorized approximations
Two forms of Kullback-Leibler divergence are members of the alpha family of divergences defined by:
D α ( p ∣ ∣ q ) = 4 1 − α 2 ( 1 − ∫ p ( x ) ( 1 + α ) / 2 q ( x ) ( 1 − α ) / 2 d x ) D_\alpha (p||q)=\frac{4}{1-\alpha^2}(1-\int p(x)^{(1+\alpha)/2}q(x)^{(1-\alpha)/2}dx) Dα(p∣∣q)=1−α24(1−∫p(x)(1+α)/2q(x)(1−α)/2dx)
10.1.3 Example: The univariate Gaussian
10.1.4 Model comparison
We can readily verify the following decomposition based on this variational distribution:
l n p ( X ) = L m − ∑ m ∑ Z q ( Z ∣ m ) q ( m ) l n ( p ( Z , m ∣ X ) q ( Z ∣ m ) q ( m ) ) lnp(X)=\mathcal{L}_m-\sum_m\sum_Z q(Z|m)q(m)ln(\frac{p(Z,m|X)}{q(Z|m)q(m)}) lnp(X)=Lm−m∑Z∑q(Z∣m)q(m)ln(q(Z∣m)q(m)p(Z,m∣X))
where L m \mathcal{L}_m Lm is a lower bound on l n p ( X ) lnp(X) lnp(X) and is given by
L m = ∑ m ∑ Z q ( Z ∣ m ) q ( m ) l n ( p ( Z , X , m ) q ( Z ∣ m ) q ( m ) ) \mathcal{L}_m=\sum_m\sum_Z q(Z|m)q(m)ln(\frac{p(Z,X,m)}{q(Z|m)q(m)}) Lm=m∑Z∑q(Z∣m)q(m)ln(q(Z∣m)q(m)p(Z,X,m))
We can maximizing L \mathcal{L} L with respect to the distribution q ( m ) q(m) q(m) using a Lagrange multiplier, with the result:
q ( m ) ∝ p ( m ) e x p ( L m ) q(m)\propto p(m)exp(\mathcal{L}_m) q(m)∝p(m)exp(Lm)
10.2 Illustration: Variational Mixture of Gaussians
The conditional distribution of Z Z Z, given the mixing coefficients π \pi π, in the form:
p ( Z ∣ π ) = ∏ n = 1 N ∏ k = 1 K π k z n k p(Z | \pi) = \prod_{n=1}^{N} \prod_{k=1}^{K} \pi_{k}^{z_{nk}} p(Z∣π)=n=1∏Nk=1∏Kπkznk
Similarly, the conditional distribution of the observed data vectors, given the latent variables and the component parameters:
p ( X ∣ Z , μ , Λ ) = ∏ n = 1 N ∏ k = 1 K N ( x n ∣ μ k , Λ k − 1 ) z n k p(X | Z, \mu, \Lambda) = \prod_{n=1}^{N} \prod_{k=1}^{K}\mathcal{N}(x_{n} | \mu_{k}, \Lambda_{k}^{-1})^{z_{nk}} p(X∣Z,μ,Λ)=n=1∏Nk=1∏KN(xn∣μk,Λk−1)znk
Choose a Dirichlet distribution over the mixing coefficients π \pi π:
p ( π ) D i r ( π ∣ α 0 ) C ( α 0 ) ∏ k = 1 K π k α 0 − 1 p(\pi) Dir(\pi | \alpha_{0}) C(\alpha_{0})\prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} p(π)Dir(π∣α0)C(α0)k=1∏Kπkα0−1
Introduce an independent Gaussian-Wishart prior governing the mean and precision of each Gaussian component, give by:
p ( μ , Λ ) = p ( μ ∣ Λ ) p ( Λ ) = ∏ k = 1 K N ( μ k ∣ m 0 , ( β 0 Λ ) − 1 ) W ( Λ k ∣ W 0 , ν 0 ) p(\mu, \Lambda) = p(\mu | \Lambda)p(\Lambda) = \prod_{k=1}^{K}\mathcal{N}(\mu_{k} | m_{0}, (\beta_{0}\Lambda)^{-1})\mathcal{W}(\Lambda_{k} | W_{0}, \nu_{0}) p(μ,Λ)=p(μ∣Λ)p(Λ)=k=1∏KN(μk∣m0,(β0Λ)−1)W(Λk∣W0,ν0)
10.2.1 Variational distribution
In order to formulate a variational treatment of this model, we write down the joint distribution of all of the random variables:
p ( X , Z , π , μ , Λ ) = p ( X ∣ Z , μ , Λ ) p ( Z ∣ π ) p ( π ) p ( μ ∣ Λ ) p ( Λ ) p(X, Z, \pi, \mu, \Lambda) = p(X | Z, \mu, \Lambda)p(Z | \pi)p(\pi)p(\mu | \Lambda)p(\Lambda) p(X,Z,π,μ,Λ)=p(X∣Z,μ,Λ)p(Z∣π)p(π)p(μ∣Λ)p(Λ)
Consider a variational distribution which factorizes between the latent variables and the parameters as that:
q ( Z , π , μ , Λ ) = q ( Z ) q ( π , μ , Λ ) q(Z, \pi, \mu, \Lambda) = q(Z)q(\pi, \mu, \Lambda) q(Z,π,μ,Λ)=q(Z)q(π,μ,Λ)
Let us consider the derivation of the update equation for the factor q ( Z ) q(Z) q(Z). The log of the optimized factor is given by:
ln q ∗ ( Z ) = E π , μ , Λ [ ln p ( X , Z , π , μ , Λ ) ] + c o n s t \ln q^{*}(Z) = E_{\pi, \mu,\Lambda}[\ln p(X, Z, \pi, \mu, \Lambda)] + const lnq∗(Z)=Eπ,μ,Λ[lnp(X,Z,π,μ,Λ)]+const
Absorb terms do not depend on Z Z Z into the constant:
ln q ∗ ( Z ) = E π [ ln p ( Z ∣ π ) ] + E μ , Λ [ ln p ( X ∣ Z , μ , Λ ) ] + c o n s t \ln q^{*}(Z) = E_\pi[\ln p(Z | \pi)] + E_{\mu, \Lambda}[\ln p(X | Z,\mu, \Lambda)] + const lnq∗(Z)=Eπ[lnp(Z∣π)]+Eμ,Λ[lnp(X∣Z,μ,Λ)]+const
Substituting for the two conditional distributions on the right-hand side and absorb terms independent of Z Z Z into the constant, we have:
l n q ∗ ( Z ) = ∑ n = 1 N ∑ k = 1 K z n k l n ρ n k + c o n s t ln q^*(Z)=\sum_{n=1}^N\sum_{k=1}^K z_{nk}ln\rho_{nk}+const lnq∗(Z)=n=1∑Nk=1∑Kznklnρnk+const
where we have defined:
ln ρ n k = E [ ln π k ] + 1 2 E [ ln ∣ Λ k ∣ ] − D 2 ln ( 2 π ) − 1 2 E μ k , Λ k [ ( x n − μ k ) T Λ k ( x n − μ k ) ] \ln \rho_{nk} = E[\ln \pi_{k}] + \frac{1}{2}E[\ln|\Lambda_{k}|] - \frac{D}{2}\ln(2\pi) - \frac{1}{2}E_{\mu_{k}, \Lambda_{k}}[(x_{n} - \mu_{k})^{T}\Lambda_{k}(x_{n} - \mu_{k})] lnρnk=E[lnπk]+21E[ln∣Λk∣]−2Dln(2π)−21Eμk,Λk[(xn−μk)TΛk(xn−μk)]
Taking the exponential of both sides we obtain:
q ∗ ( Z ) ∝ ∏ n = 1 N ∏ k = 1 K ρ n k z n k q^{*}(Z) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}\rho_{nk}^{z_{nk}} q∗(Z)∝n=1∏Nk=1∏Kρnkznk
Normalized the distribution we obtain:
q ∗ ( Z ) = ∏ n = 1 N ∏ k = 1 K r n k z n k q^{*}(Z) = \prod_{n=1}^{N}\prod_{k=1}^{K} r_{nk}^{z_{nk}} q∗(Z)=n=1∏Nk=1∏Krnkznk
where:
r n k = ρ n k ∑ j = 1 K ρ n j r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}} rnk=∑j=1Kρnjρnk
For the distribution q ∗ ( Z ) q^*(Z) q∗(Z) we have the standard result:
E [ z n k ] = r n k E[z_{nk}] = r_{nk} E[znk]=rnk
Let us consider the factor q ( π , μ , Λ ) q(\pi, \mu, \Lambda) q(π,μ,Λ) in the variational posterior distribution. We have:
ln q ∗ ( π , μ , Λ ) = ln p ( π ) ∑ k = 1 K ln p ( μ k , Λ k ) + E Z [ ln p ( Z ∣ π ) ] + ∑ k = 1 K ∑ n = 1 N E [ z n k ] ln N ( x n ∣ μ k , Λ k − 1 ) + c o n s t \ln q^{*}(\pi, \mu, \Lambda) = \ln p(\pi) \sum_{k=1}^{K}\ln p(\mu_{k}, \Lambda_{k}) + E_{Z}[\ln p(Z | \pi)] + \sum_{k=1}^{K}\sum_{n=1}^{N}E[z_{nk}]\ln \mathcal{N}(x_{n} | \mu_{k}, \Lambda_{k}^{-1}) + const lnq∗(π,μ,Λ)=lnp(π)k=1∑Klnp(μk,Λk)+EZ[lnp(Z∣π)]+k=1∑Kn=1∑NE[znk]lnN(xn∣μk,Λk−1)+const
The variational approximation will be:
q ( π , μ , Λ ) = q ( π ) ∏ k = 1 k q ( μ k , Λ k ) q(\pi,\mu,\Lambda)=q(\pi)\prod_{k=1}^kq(\mu_k,\Lambda_k) q(π,μ,Λ)=q(π)k=1∏kq(μk,Λk)
The results are given by:
q ∗ ( π ) = D i r ( π ∣ α ) q^{*}(\pi) = Dir(\pi | \alpha) q∗(π)=Dir(π∣α)
where α \alpha α has components α k \alpha_k αk given by:
α k = α 0 + ∑ n = 1 N r n k \alpha_{k} = \alpha_{0} + \sum_{n=1}^{N}r_{nk} αk=α0+n=1∑Nrnk
Using q ∗ ( μ k , Λ k ) q ∗ ( μ k ∥ Λ k ) q ∗ ( Λ k ) q^{*}(\mu_{k}, \Lambda_{k}) q^{*}(\mu_{k} \| \Lambda_{k})q^{*}(\Lambda_{k}) q∗(μk,Λk)q∗(μk∥Λk)q∗(Λk), the posterior distribution is a Gaussian-Wishart distribution and is given by:
q ∗ ( μ k , Λ k ) = N ( μ k ∣ m k , ( β k Λ k ) − 1 ) W ( Λ k ∣ W k , ν k ) q^{*}(\mu_{k}, \Lambda_{k}) = \mathcal{N}(\mu_{k} | m_{k}, (\beta_{k}\Lambda_{k})^{-1})\mathcal{W}(\Lambda_{k} | W_{k}, \nu_{k}) q∗(μk,Λk)=N(μk∣mk,(βkΛk)−1)W(Λk∣Wk,νk)
where
β k = β 0 + N 0 \beta_{k} = \beta_{0} + N_{0} βk=β0+N0
m k = 1 β k ( β 0 m 0 + N k x ^ k ) m_{k} = \frac{1}{\beta_{k}}(\beta_{0}m_{0} + N_{k}\hat{x}_{k}) mk=βk1(β0m0+Nkx^k)
W k − 1 = W 0 − 1 + N k S k + β 0 N k β 0 + N k ( x ^ k − m 0 ) ( x ^ k − m 0 ) T W_{k}^{-1} = W_{0}^{-1} + N_{k}S_{k} + \frac{\beta_{0}N_{k}}{\beta_{0} + N_{k}}(\hat{x}_{k} - m_{0})(\hat{x}_{k} - m_{0})^{T} Wk−1=W0−1+NkSk+β0+Nkβ0Nk(x^k−m0)(x^k−m0)T
ν k = ν 0 + N k \nu_{k} = \nu_{0} + N_{k} νk=ν0+Nk
10.2.2 Variational lower bound
For the variational mixture of Gaussians, the lower bound is given by:
L = ∑ Z q ( Z , π , μ , Λ ) ln { p ( X , Z , π , μ , Λ ) q ( Z , π , μ , Λ ) } d π d μ d Λ \mathcal{L} = \sum_{Z}q(Z, \pi, \mu, \Lambda)\ln\left\{\frac{p(X, Z, \pi, \mu, \Lambda)}{q(Z, \pi, \mu, \Lambda)}\right\} d\pi d\mu d\Lambda L=Z∑q(Z,π,μ,Λ)ln{q(Z,π,μ,Λ)p(X,Z,π,μ,Λ)}dπdμdΛ
10.2.3 Predictive density
Wihe a new value x ^ \hat{x} x^ the predictive density is given by:
p ( x ^ ∣ X ) = ∑ Z ^ ∫ ∫ ∫ p ( X ^ ∣ z ^ , μ , Λ ) p ( z ^ ∣ π ) p ( π , μ , Λ ∣ X ) d π d μ d Λ p(\hat{x} | X) = \sum_{\hat{Z}}\int\int\int p(\hat{X} | \hat{z},\mu, \Lambda)p(\hat{z} | \pi)p(\pi, \mu, \Lambda | X)d\pi d\mu d\Lambda p(x^∣X)=Z^∑∫∫∫p(X^∣z^,μ,Λ)p(z^∣π)p(π,μ,Λ∣X)dπdμdΛ
10.2.4 Determining the number of components
10.2.5 Induced factorizations
10.3 Variational Linear Regression
The joint distribution of all the variables is given by:
p ( t , w , α ) = p ( t ∣ w ) p ( w ∣ α ) p ( α ) p(t,w,\alpha)=p(t|w)p(w|\alpha)p(\alpha) p(t,w,α)=p(t∣w)p(w∣α)p(α)
where
p ( t ∣ w ) = ∏ n = 1 N N ( t n ∣ w T ϕ n , β − 1 ) p(t|w)=\prod_{n=1}^N N(t_n|w^T\phi_n,\beta^{-1}) p(t∣w)=n=1∏NN(tn∣wTϕn,β−1)
p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) p(w|\alpha)=N(w|0,\alpha^{-1}I) p(w∣α)=N(w∣0,α−1I)
p ( α ) = G a m ( α ∣ a 0 , b 0 ) p(\alpha)=Gam(\alpha|a_0,b_0) p(α)=Gam(α∣a0,b0)
10.3.1 Variational distribution
Our first goal is to find an approximation to the posterior distribution p ( w , α ∥ t ) p(w,\alpha\|\mathbf{t}) p(w,α∥t). The variational posterior distribution is given by the factorized expression:
q ( w , α ) = q ( w ) q ( α ) q(w,\alpha)=q(w)q(\alpha) q(w,α)=q(w)q(α)
The result will be:
q ∗ ( α ) = G a m ( α ∣ a N , b N ) q^*(\alpha)=Gam(\alpha|a_N,b_N) q∗(α)=Gam(α∣aN,bN)
where
a N = a 0 + M 2 a_N=a_0+\frac{M}{2} aN=a0+2M
b N = b 0 + 1 2 E [ w T w ] b_N=b_0+\frac{1}{2}E[w^Tw] bN=b0+21E[wTw]
and the distribution q ∗ ( w ) q^*(w) q∗(w) is Gaussian:
q ∗ ( w ) = N ( w ∣ m N , S N ) q^*(w)=N(w|m_N,S_N) q∗(w)=N(w∣mN,SN)
where
m N = β S N Φ T t m_N=\beta S_N\Phi^T t mN=βSNΦTt
S N = ( E [ α ] I + β Φ T Φ ) − 1 S_N=(E[\alpha]I+\beta\Phi^T\Phi)^{-1} SN=(E[α]I+βΦTΦ)−1
10.3.2 Predictive distribution
The predictive distribution over t t t, given a new input x x x is evaluated using the Gaussian variational posterior:
p ( t ∣ x , t ) ≃ N ( t ∣ m N T ϕ ( x ) , σ 2 ( x ) ) p(t|x,\mathbf{t})\simeq N(t|m^T_N\phi(x),\sigma^2(x)) p(t∣x,t)≃N(t∣mNTϕ(x),σ2(x))
where the input-dependent variance is given by
σ 2 ( x ) = 1 β + ϕ ( x ) T S N ϕ ( x ) \sigma^2(x)=\frac{1}{\beta}+\phi(x)^TS_N\phi(x) σ2(x)=β1+ϕ(x)TSNϕ(x)
10.3.3 Lower bound
10.4 Exponential Family Distributions
Suppose the joint distribution of observed and latent variables is a member of the exponential family, parameterized by natural parameters η \eta η so that:
p ( X , Z ∣ η ) = ∏ n = 1 N h ( x n , z n ) g ( η ) e x p { η T u ( x n , z n ) } p(X,Z|\eta)=\prod_{n=1}^N h(x_n,z_n)g(\eta)exp\{\eta^Tu(x_n,z_n)\} p(X,Z∣η)=n=1∏Nh(xn,zn)g(η)exp{ηTu(xn,zn)}
we shall also use a conjugate prior for e t a eta eta, which can be written as:
p ( η ∣ v 0 ) = f ( v 0 , x 0 ) g ( η ) v 0 e x p { v 0 η T x 0 } p(\eta|v_0)=f(v_0,x_0)g(\eta)^{v_0}exp\{v_0\eta^Tx_0\} p(η∣v0)=f(v0,x0)g(η)v0exp{v0ηTx0}
Now consider a variational distribution that factorizes between the latent variables and the parameters, so that q ( Z , η ) = q ( Z ) q ( η ) q(Z,\eta)=q(Z)q(\eta) q(Z,η)=q(Z)q(η). The result will be
q ∗ ( z n ) = h ( x n , z n ) g ( E [ η ] ) e x p { E [ η T ] u ( x n , z n ) } q^*(z_n)=h(x_n,z_n)g(E[\eta])exp\{E[\eta^T]u(x_n,z_n)\} q∗(zn)=h(xn,zn)g(E[η])exp{E[ηT]u(xn,zn)}
q ∗ ( η ) = f ( v N , x N ) g ( η ) v N e x p { η T x N } q^*(\eta)=f(v_N,x_N)g(\eta)^{v_N}exp\{\eta^Tx_N\} q∗(η)=f(vN,xN)g(η)vNexp{ηTxN}
where
v N = v o + N v_N=v_o+N vN=vo+N
x N = x 0 + ∑ n = 1 N E z n [ u ( x n , z n ) ] x_N=x_0+\sum_{n=1}^N E_{z_n}[u(x_n,z_n)] xN=x0+n=1∑NEzn[u(xn,zn)]
10.4.1 Variational message passing
The joint distribution corresponding to a directed graph can be written using the decomposition:
p ( x ) = ∏ i p ( x i ∣ p a i ) p(x)=\prod_i p(x_i|pa_i) p(x)=i∏p(xi∣pai)
Now consider a variational approximation in which the distribution q ( x ) q(x) q(x) is assumed to factorize with respect to the x i x_i xi so that:
q ( x ) = ∏ i q i ( x i ) q(x)=\prod_i q_i(x_i) q(x)=i∏qi(xi)
Substitute the formula above into the general result we will get:
l n q j ∗ ( x j ) = E i ≠ j [ ∑ i l n p ( x i ∣ p a i ) ] + c o n s t lnq_j^*(x_j)=E_{i\not =j}[\sum_i lnp(x_i|pa_i)]+const lnqj∗(xj)=Ei=j[i∑lnp(xi∣pai)]+const
10.5 Local Variational Methods
For convex funcitons, we can obtain upper bounds by:
g ( η ) = − min x { f ( x ) − η x } = max x { η x − f ( x ) } g(\eta) = - \min_{x}\{ f(x) - \eta x \} = \max_{x}\{ \eta x - f(x) \} g(η)=−xmin{f(x)−ηx}=xmax{ηx−f(x)}
f ( x ) = max η { η x − g ( η ) } f(x) = \max_{\eta}\{ \eta x - g(\eta) \} f(x)=ηmax{ηx−g(η)}
And for concave functions:
f ( x ) = min η { η x − g ( η ) } f(x) = \min_{\eta}\{ \eta x- g(\eta) \} f(x)=ηmin{ηx−g(η)}
g ( η ) = min x { η x − f ( x ) } g(\eta) = \min_{x}\{ \eta x - f(x) \} g(η)=xmin{ηx−f(x)}
If the function is not convex or concave, then we need the invertible transformations. An example will be logistic sigmoid function:
σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+e−x1
The result with that of the logistic sigmoid will be:
σ ( x ) ≤ e x p ( η x − g ( η ) ) \sigma(x) \leq exp(\eta x - g(\eta)) σ(x)≤exp(ηx−g(η))
σ ( x ) ≥ σ ( ξ ) e x p { x − ξ 2 − λ ( ξ ) ( x 2 − ξ 2 ) } \sigma(x) \geq \sigma(\xi)exp\left\{ \frac{x-\xi}{2} - \lambda(\xi)(x^{2} - \xi^{2}) \right\} σ(x)≥σ(ξ)exp{2x−ξ−λ(ξ)(x2−ξ2)}
where:
λ ( ξ ) = − 1 2 ξ [ σ ( ξ ) − 1 2 ] \lambda(\xi) = -\frac{1}{2\xi}\left[ \sigma(\xi) - \frac{1}{2} \right] λ(ξ)=−2ξ1[σ(ξ)−21]
We cna see how the bounds can be used, suppose we wish to evaluate an integral of the form:
I = ∫ σ ( a ) p ( a ) d a I = \int \sigma(a)p(a) da I=∫σ(a)p(a)da
We can employ the variational bound and we will get:
I ≥ ∫ f ( a , ξ ) p ( a ) d a = F ( ξ ) I \geq \int f(a, \xi)p(a) da = F(\xi) I≥∫f(a,ξ)p(a)da=F(ξ)
10.6 Variational Logistic Regression
10.6.1 Variational posterior distribution
In the variational framework, we seek to maximize a lower bound on the marginal likelihood. For the Bayesian logistic regression model, the marginal likelihood takes the form:
p ( t ) = ∫ p ( t ∣ w ) p ( w ) d w = ∫ [ ∏ n = 1 N p ( t n ∣ w ) ] p ( w ) d w p(t) = \int p(t | w)p(w)dw = \int \left[\prod_{n=1}^{N}p(t_{n} | w) \right]p(w) dw p(t)=∫p(t∣w)p(w)dw=∫[n=1∏Np(tn∣w)]p(w)dw
The conditional distribution for t t t can be written as
p ( t ∣ w ) = e a t σ ( − a ) p(t | w) = e^{at}\sigma(-a) p(t∣w)=eatσ(−a)
where a = w T ϕ a = w^{T}\phi a=wTϕ. Using the variational lower bound on the logistic sigmoid function we can get:
p ( t ∣ w ) = e a t σ ( − a ) ≥ e a t σ ( ξ ) e x p { − a + ξ 2 − λ ( ξ ) ( a 2 − ξ 2 ) } p(t | w) = e^{at}\sigma(-a)\geq e^{at}\sigma(\xi)exp\left\{ -\frac{a + \xi}{2} - \lambda(\xi)(a^{2} - \xi^{2}) \right\} p(t∣w)=eatσ(−a)≥eatσ(ξ)exp{−2a+ξ−λ(ξ)(a2−ξ2)}
Using a = w T ϕ a=w^T\phi a=wTϕ, and multiplying by the prior distribution, we obtain the bound on the joint distribution of t t t and w w w:
p ( t ∣ w ) = p ( t ∣ w ) p ( w ) ≥ h ( w , ξ ) p ( w ) p(t | w) = p(t | w)p(w) \geq h(w, \xi)p(w) p(t∣w)=p(t∣w)p(w)≥h(w,ξ)p(w)
where
h ( w , ξ ) = ∏ n = 1 N σ ( ξ n ) e x p { w T ϕ n t n − ( w T ϕ n + ξ n ) / 2 − λ ( ξ n ) ( [ w T ϕ n ] − ξ n 2 ) } h(w, \xi) = \prod_{n=1}^{N}\sigma(\xi_{n})exp\{w^{T}\phi_{n}t_{n} - (w^{T}\phi_{n} + \xi_{n})/2 - \lambda(\xi_{n})([w^{T}\phi_{n}] - \xi^{2}_{n}) \} h(w,ξ)=n=1∏Nσ(ξn)exp{wTϕntn−(wTϕn+ξn)/2−λ(ξn)([wTϕn]−ξn2)}
The Gaussian variational posterior will be the form:
q ( w ) = N ( w ∣ m N , S N ) q(w) = \mathcal{N}(w | m_{N}, S_{N}) q(w)=N(w∣mN,SN)
where
m N = S N ( S 0 − 1 m 0 + ∑ n = 1 N ( t n − 1 2 ) ϕ n ) m_{N} = S_{N}\left( S_{0}^{-1}m_{0} + \sum_{n=1}^{N}(t_{n} - \frac{1}{2}) \phi_{n} \right) mN=SN(S0−1m0+n=1∑N(tn−21)ϕn)
S N − 1 = S 0 − 1 + 2 ∑ n = 1 N λ ( ξ n ) ϕ n ϕ n T S_{N}^{-1} = S_{0}^{-1} + 2\sum_{n=1}^{N}\lambda(\xi_{n})\phi_{n}\phi_{n}^{T} SN−1=S0−1+2n=1∑Nλ(ξn)ϕnϕnT
10.6.2 Optimizing the variational parameters
Substitute the inequality above back into the marginal likelihood to give:
ln p ( t ) = ln ∫ p ( t ∣ w ) p ( w ) d w ≥ ln ∫ h ( w , ξ ) p ( w ) d w = L ( ξ ) \ln p(t) = \ln\int p(t | w)p(w)dw \geq \ln \int h(w, \xi)p(w) dw = \mathcal{L}(\xi) lnp(t)=ln∫p(t∣w)p(w)dw≥ln∫h(w,ξ)p(w)dw=L(ξ)
There are two approaches to determining the ξ n \xi_n ξn: EM algorithm and integrate over w w w.