PRML Chapter 8 Graphical Models
8.1 Bayesian Networks
A specific graph can make probabilistic statements for a broad class of distributions. We can writh the joint distribution in the form:
p ( a , b , c ) = p ( c ∣ a , b ) p ( a , b ) = p ( c ∣ a , b ) p ( b ∣ a ) p ( a ) p(a, b, c) = p(c|a,b)p(a,b)=p(c|a, b)p(b | a)p(a) p(a,b,c)=p(c∣a,b)p(a,b)=p(c∣a,b)p(b∣a)p(a)
For a graph with K nodes, the joint distribution is given by:
p ( x ) = ∏ k = 1 K p ( x k ∣ p a k ) p(x) = \prod_{k=1}^{K} p(x_{k} | pa_{k}) p(x)=k=1∏Kp(xk∣pak)
8.1.1 Example: Polynomial regression
The joint distribution can be written as:
p ( t , w ) = p ( w ) ∏ n = 1 N p ( t n ∣ w ) p(\mathbf{t},w)=p(w)\prod_{n=1}^Np(t_n|w) p(t,w)=p(w)n=1∏Np(tn∣w)
It may be useful to make the parameters and stochastic variables explicit:
p ( t , w ∣ x , α , σ 2 ) = p ( w ∣ α ) ∏ n = 1 N p ( t n ∣ w , x n , σ 2 ) p(\mathbf{t},w | \mathbf{x}, \alpha, \sigma^{2}) = p(w|\alpha)\prod_{n=1}^{N}p(t_{n} | w, x_{n}, \sigma^{2}) p(t,w∣x,α,σ2)=p(w∣α)n=1∏Np(tn∣w,xn,σ2)
To calculate the posterior distribution of w w w, we note that:
p ( w ∣ T ) ∝ p ( w ) ∏ n = 1 N p ( t n ∣ w ) p(w|\textbf{T})\propto p(w)\prod_{n=1}^Np(t_n|w) p(w∣T)∝p(w)n=1∏Np(tn∣w)
To predict t ^ \hat{t} t^ for a new input value x ^ \hat{x} x^ we write down the joint distribution of all the random variables conditioned on the deterministic parameters:
p ( t ^ , t , w ∣ x ^ , x , α , σ 2 ) = [ ∏ n = 1 N p ( t n ∣ x n , w , σ 2 ) ] p ( w ∣ α ) p ( t ^ ∣ x ^ , w , σ 2 ) p(\hat{t},\mathbf{t},w|\hat{x},\mathbf{x},\alpha,\sigma^2)=[\prod_{n=1}^N p(t_n|x_n,w,\sigma^2)]p(w|\alpha)p(\hat{t}|\hat{x},w,\sigma^2) p(t^,t,w∣x^,x,α,σ2)=[n=1∏Np(tn∣xn,w,σ2)]p(w∣α)p(t^∣x^,w,σ2)
The required predictive distribution for t ^ \hat{t} t^ is then obtained:
p ( t ^ ∣ x ^ , x , t , α , σ 2 ) ∝ ∫ p ( t ^ , t , w ∣ x ^ , x , α , σ 2 ) d w p(\hat{t}|\hat{x},\mathbf{x},\mathbf{t},\alpha,\sigma^2)\propto\int p(\hat{t},\mathbf{t},w|\hat{x},\mathbf{x},\alpha,\sigma^2)dw p(t^∣x^,x,t,α,σ2)∝∫p(t^,t,w∣x^,x,α,σ2)dw
8.1.2 Generative models
8.1.3 Discrete variables
The probability distribution p ( x ∥ μ ) p(x\|\mu) p(x∥μ) for a single discrete variable x x x having K possible states is given by:
p ( x ∣ μ ) = ∏ k = 1 K μ k x k p(x|\mu)=\prod_{k=1}^K \mu_k^{x_k} p(x∣μ)=k=1∏Kμkxk
Suppose we have two discrete variables and each of them has K states:
p ( x 1 , x 2 ∣ μ ) = ∏ k = 1 K ∏ l = 1 K μ k l x 1 k x 2 l p(x_1,x_2|\mu)=\prod_{k=1}^K\prod_{l=1}^K \mu_{kl}^{x_{1k}x_{2l}} p(x1,x2∣μ)=k=1∏Kl=1∏Kμklx1kx2l
We can use parameterized models for the conditional distributions, a more parsimonious form for the conditional distribution can be obtained by using logistic sigmoid function:
p ( y = 1 ∣ x 1 , . . . , x M ) = σ ( w 0 + ∑ i = 1 M w i x i ) = σ ( w T x ) p(y=1|x_1,...,x_M)=\sigma (w_0+\sum_{i=1}^Mw_i x_i)=\sigma(w^T x) p(y=1∣x1,...,xM)=σ(w0+i=1∑Mwixi)=σ(wTx)
8.1.4 Linear-Gaussian models
Suppose a single continuous random variable x i x_i xi having a Gaussian distribution:
p ( x i ∣ p a i ) = N ( x i ∣ ∑ j ∈ p a i w i j x j + b i , v i ) p(x_{i} | pa_{i}) = N(x_{i} | \sum_{j\in pa_{i}} w_{ij}x_{j} + b_{i}, v_{i}) p(xi∣pai)=N(xi∣j∈pai∑wijxj+bi,vi)
The log of the distribution is the log of the product of conditionals over all nodes:
ln p ( x ) = ∑ i = 1 D p ( x i ∣ p a i ) = − ∑ i = 1 D 1 2 v i ( x i − ∑ j ∈ p a i w i j x j − b i ) 2 + C o n s t \ln p(x) = \sum_{i=1}^{D}p(x_{i} | pa_{i}) = -\sum_{i=1}^{D}\frac{1}{2v_{i}}\left( x_{i} - \sum_{j\in pa_{i}} w_{ij}x_{j} - b_{i} \right)^{2} + Const lnp(x)=i=1∑Dp(xi∣pai)=−i=1∑D2vi1(xi−j∈pai∑wijxj−bi)2+Const
Each variable x i x_i xi has a Gaussian distribution and we can get:
x i = ∑ j ∈ p a i w i j x j + b i + v i ϵ i x_i=\sum_{j\in pa_i}w_{ij}x_j+b_i+\sqrt{v_i}\epsilon_i xi=j∈pai∑wijxj+bi+viϵi
where ϵ i \epsilon_i ϵi satisfies: E [ ϵ i ] = 0 E[\epsilon_i]=0 E[ϵi]=0 and E [ ϵ i ϵ j ] = I i j E[\epsilon_i\epsilon_j]=I_{ij} E[ϵiϵj]=Iij. We can then get the mean and covariance of the joint distribution:
E [ x i ] = ∑ j ∈ p a i w i j E [ x j ] + b i E[x_{i}] = \sum_{j\in pa_{i}} w_{ij}E[x_{j}] + b_{i} E[xi]=j∈pai∑wijE[xj]+bi
c o v [ x i , x j ] = ∑ k ∈ p a j w j k c o v [ x i , x k ] + I i j v j cov[x_{i}, x_{j}] = \sum_{k\in pa_{j}} w_{jk}cov[x_{i}, x_{k}] + I_{ij}v_{j} cov[xi,xj]=k∈paj∑wjkcov[xi,xk]+Iijvj
8.2 Conditional Independence
We say that a a a is conditionally independent of b b b given c c c if we have:
p ( a ∣ b , c ) = p ( a ∣ c ) p(a | b, c) = p(a | c) p(a∣b,c)=p(a∣c)
We use a shorthand notation for conditional independence in which:
a ⊥ ⊥ b ∣ c a ⊥⊥ b | c a⊥⊥b∣c
8.2.1 Three example graphs
- Tail-to-Tail Node
- Head-to-Tail Node
- Head-to-Head Node
8.2.2 D-separation
Suppose we condition on μ \mu μ and the joint distribution of the observations can be written as:
p ( D ∣ μ ) = ∏ n = 1 N p ( x n ∣ μ ) p(D|\mu)=\prod_{n=1}^N p(x_n|\mu) p(D∣μ)=n=1∏Np(xn∣μ)
The conditional independence properties can also explored by Markov blanked or Markov boundary, a conditional distribution can be expressed in the form:
p ( x i ∣ x { j ≠ i } ) = p ( x 1 , . . . , x D ) ∫ p ( x 1 , . . . , x D ) d x i = ∏ k p ( x k ∣ p a k ) ∫ ∏ k p ( x k ∣ p a k ) d x i p(x_i|x_{\{j\not =i\}})=\frac{p(x_1,...,x_D)}{\int p(x_1,...,x_D)dx_i}=\frac{\prod_k p(x_k|pa_k)}{\int \prod_k p(x_k|pa_k)dx_i} p(xi∣x{j=i})=∫p(x1,...,xD)dxip(x1,...,xD)=∫∏kp(xk∣pak)dxi∏kp(xk∣pak)
8.3 Markov Random Fields
8.3.1 Conditional independence properties
8.3.2 Factorization properties
The joint distribution can be written as a product of potential functions p s i C ( x C ) psi_C(x_C) psiC(xC) over the maximal cliques of the graph:
p ( x ) = 1 Z ∏ C ψ C ( x C ) p(x) = \frac{1}{Z}\prod_{C}\psi_{C}(x_{C}) p(x)=Z1C∏ψC(xC)
Z Z Z is called the partition function and it acts as a normalization constant and is given by:
Z = ∑ x ∏ C ψ C ( x C ) Z = \sum_{x}\prod_{C}\psi_{C}(x_{C}) Z=x∑C∏ψC(xC)
The potential functions can be expressed by energy function:
ψ C ( x C ) = e x p { − E ( x C ) } \psi_{C}(x_{C}) = exp\{-E(x_{C})\} ψC(xC)=exp{−E(xC)}
8.3.3 Illustration: Image de-noising
The complete energy function for the model takes the form:
E ( x , y ) = h ∑ i x i − β ∑ { i , j } x i x j − η ∑ i x i y i E(x,y)=h\sum_ix_i-\beta\sum_{\{i,j\}}x_ix_j-\eta\sum_ix_iy_i E(x,y)=hi∑xi−β{i,j}∑xixj−ηi∑xiyi
which defines a joint distribution over x x x and y y y given by:
p ( x , y ) = 1 Z e x p { − E ( x , y ) } p(x,y)=\frac{1}{Z}exp\{-E(x,y)\} p(x,y)=Z1exp{−E(x,y)}
8.3.4 Relation to directed graphs
8.4 Inference in Graphical Models
8.4.1 Inference on a chain
The joint distribution for a graph takes the form:
p ( x ) = 1 Z ψ 1 , 2 ( x 1 , x 2 ) ψ x 2 , x 3 ( x 2 , x 3 ) … ψ N − 1 , N ( x N − 1 , x N ) p(x) = \frac{1}{Z}\psi_{1,2}(x_{1}, x_{2})\psi_{x_{2}, x_{3}}(x_{2}, x_{3})\dots\psi_{N-1, N}(x_{N-1}, x_{N}) p(x)=Z1ψ1,2(x1,x2)ψx2,x3(x2,x3)…ψN−1,N(xN−1,xN)
where for the marginal distribution for a specific node x n x_n xn:
p ( x n ) = ∑ x 1 . . . ∑ x n − 1 ∑ x n + 1 . . . ∑ x N p ( x ) p(x_n)=\sum_{x_1}...\sum_{x_{n-1}}\sum_{x_{n+1}}...\sum_{x_N}p(x) p(xn)=x1∑...xn−1∑xn+1∑...xN∑p(x)
A more efficient algorithm by exploiting the conditional independece properties of the graphical model, the marginal expression can be expresed as:
p ( x n ) = 1 Z [ ∑ x n − 1 ψ n − 1 , n ( x n − 1 , x n ) … [ ∑ x 2 ψ 2 , 3 ( x 2 , x 3 ) [ ∑ x 1 ψ 1 , 2 ( x 1 , x 2 ) ] ] … ] [ ∑ x n + 1 ψ n , n + 1 ( x n , x n + 1 ) … [ ∑ N ψ N − 1 , N ( x N − 1 , x N ) ] … ] \begin{aligned} p(x_{n}) = \frac{1}{Z} & \left[ \sum_{x_{n}-1}\psi_{n-1, n}(x_{n-1}, x_{n}) \dots \left[ \sum_{x_{2}}\psi_{2, 3}(x_{2}, x_{3})\left[ \sum_{x_{1}}\psi_{1,2}(x_{1}, x_{2}) \right] \right] \dots \right] \\ & \left[ \sum_{x_{n+1}}\psi_{n, n+1}(x_{n}, x_{n+1})\dots \left[ \sum_{N}\psi_{N-1, N}(x_{N-1}, x_{N}) \right] \dots \right] \end{aligned} p(xn)=Z1[xn−1∑ψn−1,n(xn−1,xn)…[x2∑ψ2,3(x2,x3)[x1∑ψ1,2(x1,x2)]]…][xn+1∑ψn,n+1(xn,xn+1)…[N∑ψN−1,N(xN−1,xN)]…]
We can see that the expression for the marginal p ( x n ) p(x_n) p(xn) decomposes into the product of two factors times the normalization constant:
p ( x n ) = 1 Z μ α ( x n ) μ β ( x n ) p(x_{n}) = \frac{1}{Z}\mu_{\alpha}(x_{n})\mu_{\beta}(x_{n}) p(xn)=Z1μα(xn)μβ(xn)
The message μ α ( x n ) \mu_{\alpha}(x_{n}) μα(xn) and m u β ( x n ) mu_{\beta}(x_{n}) muβ(xn) can be evaluated recursively:
μ α ( x n ) = ∑ x n − 1 ψ n − 1 , n ( x n − 1 , x n ) μ α ( x n − 1 ) \mu_{\alpha}(x_{n}) = \sum_{x_{n-1}}\psi_{n-1,n}(x_{n-1}, x_{n})\mu_{\alpha}(x_{n-1}) μα(xn)=xn−1∑ψn−1,n(xn−1,xn)μα(xn−1)
μ β ( x n ) = ∑ x n + 1 ψ n , n + 1 ( x n , x n + 1 ) μ β ( x n + 1 ) \mu_{\beta(x_{n})} = \sum_{x_{n+1}}\psi_{n,n+1}(x_{n},x_{n+1})\mu_{\beta}(x_{n+1}) μβ(xn)=xn+1∑ψn,n+1(xn,xn+1)μβ(xn+1)
We can calculate the joint distribution of the neighboring nodes:
p ( x n − 1 , x n ) = 1 Z μ α ( x n − 1 ) ψ n − 1 , n ( x n − 1 , x n ) μ β ( x n ) p(x_{n-1}, x_{n}) = \frac{1}{Z}\mu_{\alpha}(x_{n-1})\psi_{n-1, n}(x_{n-1}, x_{n})\mu_{\beta}(x_{n}) p(xn−1,xn)=Z1μα(xn−1)ψn−1,n(xn−1,xn)μβ(xn)
8.4.2 Trees
- Undirected tree
- Directed tree
- Polytree
8.4.3 Factor graphs
Factor graphs introduces additional nodes for the factors themselves in addition to the nodes representing the variables. To achieve this we write the joint distribution over a set of variables in the form of a product of factors:
p ( x ) = ∏ s f s ( x s ) p(x)=\prod_s f_s(x_s) p(x)=s∏fs(xs)
8.4.4 The sum-product algorithm
Our goal is to calculate the marginal for variable node x x x and this marginal is given by the product of incoming messages along all of the links arriving at that node.
If a leaf node is a variable node, then the message taht it sends along its one and only link is given by:
μ x → f ( x ) = 1 \mu_{x\rightarrow f}(x) = 1 μx→f(x)=1
IF the leaf node is a factor node, the message sent should take the form:
μ f → x ( x ) = f ( x ) \mu_{f\rightarrow x}(x) = f(x) μf→x(x)=f(x)
We can compute recursively that:
μ f s → x ( x ) = ∑ x 1 ⋯ ∑ x M f s ( x , x 1 , … , x M ) ∏ m ∈ n e ( f s ) x μ x m → f s ( x m ) \mu_{f_{s}\rightarrow x}(x) = \sum_{x_{1}}\dots\sum_{x_{M}}f_{s}(x,x_{1}, \dots, x_{M})\prod_{m\in ne(f_{s}) \\ x}\mu_{x_{m}\rightarrow f_{s}}(x_{m}) μfs→x(x)=x1∑⋯xM∑fs(x,x1,…,xM)m∈ne(fs)x∏μxm→fs(xm)
μ x m → f s ( x m ) = ∑ l ∈ n e ( x m ) f s μ f l → x m ( x m ) \mu_{x_{m}\rightarrow f_{s}}(x_{m}) = \sum_{l\in ne(x_{m}) \\f_{s}}\mu_{f_{l}\rightarrow x_{m}}(x_{m}) μxm→fs(xm)=l∈ne(xm)fs∑μfl→xm(xm)
The marginal distribution of p ( x i ) p(x_i) p(xi) will be:
p ( x ) = ∏ s ∈ n e ( x ) μ f s → x ( x ) p(x) = \prod_{s\in ne(x)}\mu_{f_{s}\rightarrow x}(x) p(x)=s∈ne(x)∏μfs→x(x)
8.4.5 The max-sum algorithm
Our goal is to find a setting of the variables that has the largest probability and to find the value of that probability, we shall simply write out the max operator in terms of its components:
max x p ( x ) = max x 1 … max x M p ( x ) \max_{x}p(x) = \max_{x_{1}}\dots \max_{x_{M}}p(x) xmaxp(x)=x1max…xMmaxp(x)
p ( x ) = ∏ s f s ( x s ) p(x) = \prod_{s}f_{s}(x_{s}) p(x)=s∏fs(xs)
The max-sum algorithm in terms of message passing can be:
μ f → x ( x ) = max x 1 , … , x M [ ln f ( x , x 1 , … , x M ) + ∑ m ∈ n e ( f ) / x μ x m → f ( x m ) ] \mu_{f\rightarrow x}(x) = \max_{x_{1}, \dots, x_{M}}\left[ \ln f(x, x_{1}, \dots, x_{M}) + \sum_{m\in ne(f)/x} \mu_{x_{m}\rightarrow f}(x_{m}) \right] μf→x(x)=x1,…,xMmax⎣⎡lnf(x,x1,…,xM)+m∈ne(f)/x∑μxm→f(xm)⎦⎤
μ x → f ( x ) = ∑ l ∈ n e ( x ) / f μ f l → x ( x ) \mu_{x\rightarrow f}(x) = \sum_{l\in ne(x)/f}\mu_{f_{l}\rightarrow x}(x) μx→f(x)=l∈ne(x)/f∑μfl→x(x)
The initial message sent by the leaf nodes are given by:
μ x → f ( x ) = 0 \mu_{x\rightarrow f}(x) = 0 μx→f(x)=0
μ f → x ( x ) = ln f ( x ) \mu_{f\rightarrow x}(x) = \ln f(x) μf→x(x)=lnf(x)
while at the root node the maximum probability can then be computed, ay analogy in max-sum algorithm, with using:
p m a x = max x [ ∑ s ∈ n e ( x ) μ f s → x ( x ) ] p^{max} = \max_{x}\left[ \sum_{s\in ne(x)}\mu_{f_{s}\rightarrow x}(x) \right] pmax=xmax⎣⎡s∈ne(x)∑μfs→x(x)⎦⎤
8.4.6 Exact inference in general graphs
8.4.7 Loopy belief propagation
8.4.8 Learning the graph structure
From a Bayesian viewpoint, we would like to compute a posterior distribution over graph structures and to make predictions by averaging with respect to this distribution.
p ( m ∣ D ) ∝ p ( m ) p ( D ∣ m ) p(m|D)\propto p(m)p(D|m) p(m∣D)∝p(m)p(D∣m)