模式识别 | PRML Chapter 8 Graphical Models

PRML Chapter 8 Graphical Models

8.1 Bayesian Networks

A specific graph can make probabilistic statements for a broad class of distributions. We can writh the joint distribution in the form:

p ( a , b , c ) = p ( c ∣ a , b ) p ( a , b ) = p ( c ∣ a , b ) p ( b ∣ a ) p ( a ) p(a, b, c) = p(c|a,b)p(a,b)=p(c|a, b)p(b | a)p(a) p(a,b,c)=p(ca,b)p(a,b)=p(ca,b)p(ba)p(a)

For a graph with K nodes, the joint distribution is given by:

p ( x ) = ∏ k = 1 K p ( x k ∣ p a k ) p(x) = \prod_{k=1}^{K} p(x_{k} | pa_{k}) p(x)=k=1Kp(xkpak)

8.1.1 Example: Polynomial regression

The joint distribution can be written as:

p ( t , w ) = p ( w ) ∏ n = 1 N p ( t n ∣ w ) p(\mathbf{t},w)=p(w)\prod_{n=1}^Np(t_n|w) p(t,w)=p(w)n=1Np(tnw)

It may be useful to make the parameters and stochastic variables explicit:

p ( t , w ∣ x , α , σ 2 ) = p ( w ∣ α ) ∏ n = 1 N p ( t n ∣ w , x n , σ 2 ) p(\mathbf{t},w | \mathbf{x}, \alpha, \sigma^{2}) = p(w|\alpha)\prod_{n=1}^{N}p(t_{n} | w, x_{n}, \sigma^{2}) p(t,wx,α,σ2)=p(wα)n=1Np(tnw,xn,σ2)

To calculate the posterior distribution of w w w, we note that:

p ( w ∣ T ) ∝ p ( w ) ∏ n = 1 N p ( t n ∣ w ) p(w|\textbf{T})\propto p(w)\prod_{n=1}^Np(t_n|w) p(wT)p(w)n=1Np(tnw)

To predict t ^ \hat{t} t^ for a new input value x ^ \hat{x} x^ we write down the joint distribution of all the random variables conditioned on the deterministic parameters:

p ( t ^ , t , w ∣ x ^ , x , α , σ 2 ) = [ ∏ n = 1 N p ( t n ∣ x n , w , σ 2 ) ] p ( w ∣ α ) p ( t ^ ∣ x ^ , w , σ 2 ) p(\hat{t},\mathbf{t},w|\hat{x},\mathbf{x},\alpha,\sigma^2)=[\prod_{n=1}^N p(t_n|x_n,w,\sigma^2)]p(w|\alpha)p(\hat{t}|\hat{x},w,\sigma^2) p(t^,t,wx^,x,α,σ2)=[n=1Np(tnxn,w,σ2)]p(wα)p(t^x^,w,σ2)

The required predictive distribution for t ^ \hat{t} t^ is then obtained:

p ( t ^ ∣ x ^ , x , t , α , σ 2 ) ∝ ∫ p ( t ^ , t , w ∣ x ^ , x , α , σ 2 ) d w p(\hat{t}|\hat{x},\mathbf{x},\mathbf{t},\alpha,\sigma^2)\propto\int p(\hat{t},\mathbf{t},w|\hat{x},\mathbf{x},\alpha,\sigma^2)dw p(t^x^,x,t,α,σ2)p(t^,t,wx^,x,α,σ2)dw

8.1.2 Generative models

8.1.3 Discrete variables

The probability distribution p ( x ∥ μ ) p(x\|\mu) p(xμ) for a single discrete variable x x x having K possible states is given by:

p ( x ∣ μ ) = ∏ k = 1 K μ k x k p(x|\mu)=\prod_{k=1}^K \mu_k^{x_k} p(xμ)=k=1Kμkxk

Suppose we have two discrete variables and each of them has K states:

p ( x 1 , x 2 ∣ μ ) = ∏ k = 1 K ∏ l = 1 K μ k l x 1 k x 2 l p(x_1,x_2|\mu)=\prod_{k=1}^K\prod_{l=1}^K \mu_{kl}^{x_{1k}x_{2l}} p(x1,x2μ)=k=1Kl=1Kμklx1kx2l

We can use parameterized models for the conditional distributions, a more parsimonious form for the conditional distribution can be obtained by using logistic sigmoid function:

p ( y = 1 ∣ x 1 , . . . , x M ) = σ ( w 0 + ∑ i = 1 M w i x i ) = σ ( w T x ) p(y=1|x_1,...,x_M)=\sigma (w_0+\sum_{i=1}^Mw_i x_i)=\sigma(w^T x) p(y=1x1,...,xM)=σ(w0+i=1Mwixi)=σ(wTx)

8.1.4 Linear-Gaussian models

Suppose a single continuous random variable x i x_i xi having a Gaussian distribution:

p ( x i ∣ p a i ) = N ( x i ∣ ∑ j ∈ p a i w i j x j + b i , v i ) p(x_{i} | pa_{i}) = N(x_{i} | \sum_{j\in pa_{i}} w_{ij}x_{j} + b_{i}, v_{i}) p(xipai)=N(xijpaiwijxj+bi,vi)

The log of the distribution is the log of the product of conditionals over all nodes:

ln ⁡ p ( x ) = ∑ i = 1 D p ( x i ∣ p a i ) = − ∑ i = 1 D 1 2 v i ( x i − ∑ j ∈ p a i w i j x j − b i ) 2 + C o n s t \ln p(x) = \sum_{i=1}^{D}p(x_{i} | pa_{i}) = -\sum_{i=1}^{D}\frac{1}{2v_{i}}\left( x_{i} - \sum_{j\in pa_{i}} w_{ij}x_{j} - b_{i} \right)^{2} + Const lnp(x)=i=1Dp(xipai)=i=1D2vi1(xijpaiwijxjbi)2+Const

Each variable x i x_i xi has a Gaussian distribution and we can get:

x i = ∑ j ∈ p a i w i j x j + b i + v i ϵ i x_i=\sum_{j\in pa_i}w_{ij}x_j+b_i+\sqrt{v_i}\epsilon_i xi=jpaiwijxj+bi+vi ϵi

where ϵ i \epsilon_i ϵi satisfies: E [ ϵ i ] = 0 E[\epsilon_i]=0 E[ϵi]=0 and E [ ϵ i ϵ j ] = I i j E[\epsilon_i\epsilon_j]=I_{ij} E[ϵiϵj]=Iij. We can then get the mean and covariance of the joint distribution:

E [ x i ] = ∑ j ∈ p a i w i j E [ x j ] + b i E[x_{i}] = \sum_{j\in pa_{i}} w_{ij}E[x_{j}] + b_{i} E[xi]=jpaiwijE[xj]+bi

c o v [ x i , x j ] = ∑ k ∈ p a j w j k c o v [ x i , x k ] + I i j v j cov[x_{i}, x_{j}] = \sum_{k\in pa_{j}} w_{jk}cov[x_{i}, x_{k}] + I_{ij}v_{j} cov[xi,xj]=kpajwjkcov[xi,xk]+Iijvj

8.2 Conditional Independence

We say that a a a is conditionally independent of b b b given c c c if we have:

p ( a ∣ b , c ) = p ( a ∣ c ) p(a | b, c) = p(a | c) p(ab,c)=p(ac)

We use a shorthand notation for conditional independence in which:

a ⊥ ⊥ b ∣ c a ⊥⊥ b | c abc

8.2.1 Three example graphs

  • Tail-to-Tail Node
  • Head-to-Tail Node
  • Head-to-Head Node

8.2.2 D-separation

Suppose we condition on μ \mu μ and the joint distribution of the observations can be written as:

p ( D ∣ μ ) = ∏ n = 1 N p ( x n ∣ μ ) p(D|\mu)=\prod_{n=1}^N p(x_n|\mu) p(Dμ)=n=1Np(xnμ)

The conditional independence properties can also explored by Markov blanked or Markov boundary, a conditional distribution can be expressed in the form:

p ( x i ∣ x { j ≠ i } ) = p ( x 1 , . . . , x D ) ∫ p ( x 1 , . . . , x D ) d x i = ∏ k p ( x k ∣ p a k ) ∫ ∏ k p ( x k ∣ p a k ) d x i p(x_i|x_{\{j\not =i\}})=\frac{p(x_1,...,x_D)}{\int p(x_1,...,x_D)dx_i}=\frac{\prod_k p(x_k|pa_k)}{\int \prod_k p(x_k|pa_k)dx_i} p(xix{j=i})=p(x1,...,xD)dxip(x1,...,xD)=kp(xkpak)dxikp(xkpak)

8.3 Markov Random Fields

8.3.1 Conditional independence properties

8.3.2 Factorization properties

The joint distribution can be written as a product of potential functions p s i C ( x C ) psi_C(x_C) psiC(xC) over the maximal cliques of the graph:

p ( x ) = 1 Z ∏ C ψ C ( x C ) p(x) = \frac{1}{Z}\prod_{C}\psi_{C}(x_{C}) p(x)=Z1CψC(xC)

Z Z Z is called the partition function and it acts as a normalization constant and is given by:

Z = ∑ x ∏ C ψ C ( x C ) Z = \sum_{x}\prod_{C}\psi_{C}(x_{C}) Z=xCψC(xC)

The potential functions can be expressed by energy function:

ψ C ( x C ) = e x p { − E ( x C ) } \psi_{C}(x_{C}) = exp\{-E(x_{C})\} ψC(xC)=exp{E(xC)}

8.3.3 Illustration: Image de-noising

The complete energy function for the model takes the form:

E ( x , y ) = h ∑ i x i − β ∑ { i , j } x i x j − η ∑ i x i y i E(x,y)=h\sum_ix_i-\beta\sum_{\{i,j\}}x_ix_j-\eta\sum_ix_iy_i E(x,y)=hixiβ{i,j}xixjηixiyi

which defines a joint distribution over x x x and y y y given by:

p ( x , y ) = 1 Z e x p { − E ( x , y ) } p(x,y)=\frac{1}{Z}exp\{-E(x,y)\} p(x,y)=Z1exp{E(x,y)}

8.3.4 Relation to directed graphs

8.4 Inference in Graphical Models

8.4.1 Inference on a chain

The joint distribution for a graph takes the form:

p ( x ) = 1 Z ψ 1 , 2 ( x 1 , x 2 ) ψ x 2 , x 3 ( x 2 , x 3 ) … ψ N − 1 , N ( x N − 1 , x N ) p(x) = \frac{1}{Z}\psi_{1,2}(x_{1}, x_{2})\psi_{x_{2}, x_{3}}(x_{2}, x_{3})\dots\psi_{N-1, N}(x_{N-1}, x_{N}) p(x)=Z1ψ1,2(x1,x2)ψx2,x3(x2,x3)ψN1,N(xN1,xN)

where for the marginal distribution for a specific node x n x_n xn:

p ( x n ) = ∑ x 1 . . . ∑ x n − 1 ∑ x n + 1 . . . ∑ x N p ( x ) p(x_n)=\sum_{x_1}...\sum_{x_{n-1}}\sum_{x_{n+1}}...\sum_{x_N}p(x) p(xn)=x1...xn1xn+1...xNp(x)

A more efficient algorithm by exploiting the conditional independece properties of the graphical model, the marginal expression can be expresed as:

p ( x n ) = 1 Z [ ∑ x n − 1 ψ n − 1 , n ( x n − 1 , x n ) … [ ∑ x 2 ψ 2 , 3 ( x 2 , x 3 ) [ ∑ x 1 ψ 1 , 2 ( x 1 , x 2 ) ] ] …   ] [ ∑ x n + 1 ψ n , n + 1 ( x n , x n + 1 ) … [ ∑ N ψ N − 1 , N ( x N − 1 , x N ) ] …   ] \begin{aligned} p(x_{n}) = \frac{1}{Z} & \left[ \sum_{x_{n}-1}\psi_{n-1, n}(x_{n-1}, x_{n}) \dots \left[ \sum_{x_{2}}\psi_{2, 3}(x_{2}, x_{3})\left[ \sum_{x_{1}}\psi_{1,2}(x_{1}, x_{2}) \right] \right] \dots \right] \\ & \left[ \sum_{x_{n+1}}\psi_{n, n+1}(x_{n}, x_{n+1})\dots \left[ \sum_{N}\psi_{N-1, N}(x_{N-1}, x_{N}) \right] \dots \right] \end{aligned} p(xn)=Z1[xn1ψn1,n(xn1,xn)[x2ψ2,3(x2,x3)[x1ψ1,2(x1,x2)]]][xn+1ψn,n+1(xn,xn+1)[NψN1,N(xN1,xN)]]

We can see that the expression for the marginal p ( x n ) p(x_n) p(xn) decomposes into the product of two factors times the normalization constant:

p ( x n ) = 1 Z μ α ( x n ) μ β ( x n ) p(x_{n}) = \frac{1}{Z}\mu_{\alpha}(x_{n})\mu_{\beta}(x_{n}) p(xn)=Z1μα(xn)μβ(xn)

The message μ α ( x n ) \mu_{\alpha}(x_{n}) μα(xn) and m u β ( x n ) mu_{\beta}(x_{n}) muβ(xn) can be evaluated recursively:

μ α ( x n ) = ∑ x n − 1 ψ n − 1 , n ( x n − 1 , x n ) μ α ( x n − 1 ) \mu_{\alpha}(x_{n}) = \sum_{x_{n-1}}\psi_{n-1,n}(x_{n-1}, x_{n})\mu_{\alpha}(x_{n-1}) μα(xn)=xn1ψn1,n(xn1,xn)μα(xn1)

μ β ( x n ) = ∑ x n + 1 ψ n , n + 1 ( x n , x n + 1 ) μ β ( x n + 1 ) \mu_{\beta(x_{n})} = \sum_{x_{n+1}}\psi_{n,n+1}(x_{n},x_{n+1})\mu_{\beta}(x_{n+1}) μβ(xn)=xn+1ψn,n+1(xn,xn+1)μβ(xn+1)

We can calculate the joint distribution of the neighboring nodes:

p ( x n − 1 , x n ) = 1 Z μ α ( x n − 1 ) ψ n − 1 , n ( x n − 1 , x n ) μ β ( x n ) p(x_{n-1}, x_{n}) = \frac{1}{Z}\mu_{\alpha}(x_{n-1})\psi_{n-1, n}(x_{n-1}, x_{n})\mu_{\beta}(x_{n}) p(xn1,xn)=Z1μα(xn1)ψn1,n(xn1,xn)μβ(xn)

8.4.2 Trees

  • Undirected tree
  • Directed tree
  • Polytree

8.4.3 Factor graphs

Factor graphs introduces additional nodes for the factors themselves in addition to the nodes representing the variables. To achieve this we write the joint distribution over a set of variables in the form of a product of factors:

p ( x ) = ∏ s f s ( x s ) p(x)=\prod_s f_s(x_s) p(x)=sfs(xs)

8.4.4 The sum-product algorithm

Our goal is to calculate the marginal for variable node x x x and this marginal is given by the product of incoming messages along all of the links arriving at that node.

If a leaf node is a variable node, then the message taht it sends along its one and only link is given by:

μ x → f ( x ) = 1 \mu_{x\rightarrow f}(x) = 1 μxf(x)=1

IF the leaf node is a factor node, the message sent should take the form:

μ f → x ( x ) = f ( x ) \mu_{f\rightarrow x}(x) = f(x) μfx(x)=f(x)

We can compute recursively that:

μ f s → x ( x ) = ∑ x 1 ⋯ ∑ x M f s ( x , x 1 , … , x M ) ∏ m ∈ n e ( f s ) x μ x m → f s ( x m ) \mu_{f_{s}\rightarrow x}(x) = \sum_{x_{1}}\dots\sum_{x_{M}}f_{s}(x,x_{1}, \dots, x_{M})\prod_{m\in ne(f_{s}) \\ x}\mu_{x_{m}\rightarrow f_{s}}(x_{m}) μfsx(x)=x1xMfs(x,x1,,xM)mne(fs)xμxmfs(xm)

μ x m → f s ( x m ) = ∑ l ∈ n e ( x m ) f s μ f l → x m ( x m ) \mu_{x_{m}\rightarrow f_{s}}(x_{m}) = \sum_{l\in ne(x_{m}) \\f_{s}}\mu_{f_{l}\rightarrow x_{m}}(x_{m}) μxmfs(xm)=lne(xm)fsμflxm(xm)

The marginal distribution of p ( x i ) p(x_i) p(xi) will be:

p ( x ) = ∏ s ∈ n e ( x ) μ f s → x ( x ) p(x) = \prod_{s\in ne(x)}\mu_{f_{s}\rightarrow x}(x) p(x)=sne(x)μfsx(x)

8.4.5 The max-sum algorithm

Our goal is to find a setting of the variables that has the largest probability and to find the value of that probability, we shall simply write out the max operator in terms of its components:

max ⁡ x p ( x ) = max ⁡ x 1 … max ⁡ x M p ( x ) \max_{x}p(x) = \max_{x_{1}}\dots \max_{x_{M}}p(x) xmaxp(x)=x1maxxMmaxp(x)

p ( x ) = ∏ s f s ( x s ) p(x) = \prod_{s}f_{s}(x_{s}) p(x)=sfs(xs)

The max-sum algorithm in terms of message passing can be:

μ f → x ( x ) = max ⁡ x 1 , … , x M [ ln ⁡ f ( x , x 1 , … , x M ) + ∑ m ∈ n e ( f ) / x μ x m → f ( x m ) ] \mu_{f\rightarrow x}(x) = \max_{x_{1}, \dots, x_{M}}\left[ \ln f(x, x_{1}, \dots, x_{M}) + \sum_{m\in ne(f)/x} \mu_{x_{m}\rightarrow f}(x_{m}) \right] μfx(x)=x1,,xMmaxlnf(x,x1,,xM)+mne(f)/xμxmf(xm)

μ x → f ( x ) = ∑ l ∈ n e ( x ) / f μ f l → x ( x ) \mu_{x\rightarrow f}(x) = \sum_{l\in ne(x)/f}\mu_{f_{l}\rightarrow x}(x) μxf(x)=lne(x)/fμflx(x)

The initial message sent by the leaf nodes are given by:

μ x → f ( x ) = 0 \mu_{x\rightarrow f}(x) = 0 μxf(x)=0

μ f → x ( x ) = ln ⁡ f ( x ) \mu_{f\rightarrow x}(x) = \ln f(x) μfx(x)=lnf(x)

while at the root node the maximum probability can then be computed, ay analogy in max-sum algorithm, with using:

p m a x = max ⁡ x [ ∑ s ∈ n e ( x ) μ f s → x ( x ) ] p^{max} = \max_{x}\left[ \sum_{s\in ne(x)}\mu_{f_{s}\rightarrow x}(x) \right] pmax=xmaxsne(x)μfsx(x)

8.4.6 Exact inference in general graphs

8.4.7 Loopy belief propagation

8.4.8 Learning the graph structure

From a Bayesian viewpoint, we would like to compute a posterior distribution over graph structures and to make predictions by averaging with respect to this distribution.

p ( m ∣ D ) ∝ p ( m ) p ( D ∣ m ) p(m|D)\propto p(m)p(D|m) p(mD)p(m)p(Dm)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值