PRML Chapter 4 Linear Models for Classification
4.1 Discriminant Functions
4.1.1 Two classes
The simplest representation of a linear discriminant function can be expressed as:
y ( x ) = w T x + w 0 y(x) = w^{T}x + w_{0} y(x)=wTx+w0
The normal distance from the origin to the decision surface is given by
w T x ∣ ∣ w ∣ ∣ = − w 0 ∣ ∣ x ∣ ∣ \frac{w^{T}x}{||w||} = - \frac{w_{0}}{||x||} ∣∣w∣∣wTx=−∣∣x∣∣w0
4.1.2 Multiple classes
Considering a single K-class discriminant comprising K linear functions of the form
y k ( x ) = w k T x + w k 0 y_{k}(x) = w_{k}^{T}x + w_{k0} yk(x)=wkTx+wk0
and we assume a point x x x to class C k C_k Ck if y k ( x ) > y j ( x ) y_k(x)>y_j(x) yk(x)>yj(x). The decision boundary between class C k C_k Ck and class C j C_j Cj is given by y k ( x ) = y j ( x ) y_k(x)=y_j(x) yk(x)=yj(x), and the corresponding D-1 dimensional hyperplane can be
( w k − w j ) T x + ( w k 0 − w j 0 ) = 0 (w_{k} - w_{j})^{T}x + (w_{k0} - w_{j0}) = 0 (wk−wj)Tx+(wk0−wj0)=0
4.1.3 Least squares for classification
Each class C k C_k Ck is described by its own linear model so that
y k ( x ) = w k T x + w k 0 y_k(x)=w^T_kx+w_{k0} yk(x)=wkTx+wk0
and we can group these together to get
y ( x ) = W ~ T x ~ y(x) = \tilde{W}^{T}\tilde{x} y(x)=W~Tx~
The sum-of-squares error function can be written as
E D ( W ~ ) = 1 2 T r { ( X ~ W ~ − T ) T ( X ~ W ~ − T ) } E_{D}(\tilde{W}) = \frac{1}{2}Tr\left\{ (\tilde{X}\tilde{W} - T)^{T}(\tilde{X}\tilde{W} - T) \right\} ED(W~)=21Tr{(X~W~−T)T(X~W~−T)}
Let the derivative of W ~ \tilde{W} W~ equal to zero and we can obtain the solution for W ~ \tilde{W} W~:
W ~ = ( X ~ T X ~ ) − 1 X ~ T T = X ~ † T \tilde{W} = (\tilde{X}^{T}\tilde{X})^{-1}\tilde{X}^{T}T = \tilde{X}^{\dag}T W~=(X~TX~)−1X~TT=X~†T
We then obtain the discriminant function in the form
y ( x ) = W ~ T x ~ = T T ( X ~ † ) T x ~ y(x) = \tilde{W}^{T}\tilde{x} = T^{T}(\tilde{X}^{\dag})^{T}\tilde{x} y(x)=W~Tx~=TT(X~†)Tx~
4.1.4 Fisher’s linear discriminant
The mean vectors of the two classes are given by
m 1 = 1 N 1 ∑ n ∈ C 1 x n , m 2 = 1 N 2 ∑ n ∈ C 2 x n m_1=\frac{1}{N_1}\sum_{n\in C_1} x_n,\ m_2=\frac{1}{N_2}\sum_{n\in C_2}x_n m1=N11n∈C1∑xn, m2=N21n∈C2∑xn
To avoid the overlap in the projection space, we might choose w w w to maximize m 2 − m 1 = w T ( m 2 − m 1 ) m_{2} - m_{1} = w^{T}(\mathbf{m_{2} - m_{1}}) m2−m1=wT(m2−m1).
The with-in class variance of the transform data from class C k C_k Ck is given by:
s k 2 = ∑ n ∈ C k ( y n − m k ) 2 s_{k}^{2} = \sum_{n\in C_{k}}(y_{n} - m_{k})^{2} sk2=n∈Ck∑(yn−mk)2
where y n = w T x n y_n=w^T x_n yn=wTxn. The Fisher criterion is defined to be the ratio of the between-class variance to the within clss variance:
J ( w ) = ( m 2 − m 1 ) 2 s 1 2 − s 2 2 J(w) = \frac{(m_{2} - m_{1})^{2}}{s_{1}^{2} - s_{2}^{2}} J(w)=s12−s22(m2−m1)2
Rewrite the Fisher criterion in the following form, here we have the between-class covariance matirx S B S_B SB and within-class covariance matrix S W S_W SW.
S B = ( m 2 − m 1 ) ( m 2 − m 1 ) T S_{B} = (m_{2} - m_{1})(m_{2} - m_{1})^{T} SB=(m2−m1)(m2−m1)T
S W = ∑ n ∈ C 1 ( x n − m 1 ) ( x n − m 1 ) T + ∑ n ∈ C 2 ( x n − m 2 ) ( x n − m 2 ) T S_{W} = \sum_{n\in C_{1}}(x_{n} - m_{1})(x_{n} - m_{1})^{T} + \sum_{n\in C_{2}}(x_{n}-m_{2})(x_{n} - m_{2})^{T} SW=n∈C1∑(xn−m1)(xn−m1)T+n∈C2∑(xn−m2)(xn−m2)T
J ( w ) = w T S B w w T S W w J(w) = \frac{w^{T}S_{B}w}{w^{T}S_{W}w} J(w)=wTSWwwTSBw
Differentiating the formula above and we can see that J ( w ) J(w) J(w) is maximized when:
( w T S B w ) S W w = ( w T S W w ) S B w (w^{T}S_{B}w)S_{W}w = (w^{T}S_{W}w)S_{B}w (wTSBw)SWw=(wTSWw)SBw
we then obtain:
w ∝ S W − 1 ( m 2 − m 1 ) w \propto S_{W}^{-1}(m_{2} - m_{1}) w∝SW−1(m2−m1)
4.1.5 Relation to least squares
The Fisher solution can be obtained as a special case of least squares for the two class problem. For class C 1 C_1 C1 we shall take the targets to be N / N 1 N/N_1 N/N1 and for C 2 C_2 C2 to be − N / N 2 -N/N_2 −N/N2.
The sum-of-square error function can be written:
E = 1 2 ∑ n = 1 N ( w T x n + w 0 − t n ) 2 E=\frac{1}{2}\sum_{n=1}^N(w^Tx_n+w_0-t_n)^2 E=21n=1∑N(wTxn+w0−tn)2
Setting the derivatives of E with respect to w 0 w_0 w0 and w w w to zero, we can obtain:
∑ n = 1 N ( w T x n + w 0 − t n ) = 0 \sum_{n=1}^N(w^Tx_n+w_0-t_n)=0 n=1∑N(wTxn+w0−tn)=0
∑ n = 1 N ( w T x n + w 0 − t n ) x n = 0 \sum_{n=1}^N(w^Tx_n+w_0-t_n)x_n=0 n=1∑N(wTxn+w0−tn)xn=0
Thus we can get:
w
0
=
−
w
T
m
w_0=-w^T m
w0=−wTm
(
S
W
+
N
1
N
2
N
S
B
)
w
=
N
(
m
1
−
m
2
)
→
w
∝
S
W
−
1
(
m
2
−
m
1
)
(S_W+\frac{N_1N_2}{N}S_B)w=N(m_1-m_2)\ \rightarrow w\propto S_W^{-1}(m_2-m_1)
(SW+NN1N2SB)w=N(m1−m2) →w∝SW−1(m2−m1)
4.1.6 Fisher’s discriminant for multiple classes
For multiple classes problem, similar to the two classes, the input space may contain:
-
Mean vector
m k = 1 N k ∑ n ∈ C k x n , m = 1 N ∑ n = 1 N x n = 1 N ∑ k = 1 K N k m k m_k=\frac{1}{N_k}\sum_{n\in C_k}x_n,\ \ m=\frac{1}{N}\sum_{n=1}^Nx_n=\frac{1}{N}\sum_{k=1}^KN_km_k mk=Nk1n∈Ck∑xn, m=N1n=1∑Nxn=N1k=1∑KNkmk -
Within-class covariance matrix
S W = ∑ k = 1 K S k , S k = ∑ n ∈ C k ( x n − m k ) ( x n − m k ) T S_W=\sum_{k=1}^K S_k,\ \ S_k=\sum_{n\in C_k}(x_n-m_k)(x_n-m_k)^T SW=k=1∑KSk, Sk=n∈Ck∑(xn−mk)(xn−mk)T -
Between-class covariance matrix
S B = ∑ k = 1 K N k ( m k − m ) ( m k − m ) T S_B=\sum_{k=1}^K N_k(m_k-m)(m_k-m)^T SB=k=1∑KNk(mk−m)(mk−m)T -
The total covariance matrix
S T = ∑ n = 1 N ( x n − m ) ( x n − m ) T , S T = S W + S B S_T=\sum_{n=1}^N(x_n-m)(x_n-m)^T,\ \ S_T=S_W+S_B ST=n=1∑N(xn−m)(xn−m)T, ST=SW+SB
Next we introduce D ′ > 1 D'>1 D′>1 linear ‘features’ y k = w k T x y_k=w_k^Tx yk=wkTx and the feature values can be grouped together: y = W T x y=W^Tx y=WTx. We can define similar matrices in the projected D’-dimensional y-space.
s W = ∑ k = 1 K ∑ n ∈ C k ( y n − μ k ) ( y n − μ k ) T s_W=\sum_{k=1}^K\sum_{n\in C_k}(y_n-\mu_k)(y_n-\mu_k)^T sW=k=1∑Kn∈Ck∑(yn−μk)(yn−μk)T
s B = ∑ k = 1 K N k ( μ k − μ ) ( μ k − μ ) T s_B=\sum_{k=1}^K N_k(\mu_k-\mu)(\mu_k-\mu)^T sB=k=1∑KNk(μk−μ)(μk−μ)T
μ k = 1 N k ∑ n ∈ C k y n , μ = 1 N ∑ k = 1 K N k μ k \mu_k=\frac{1}{N_k}\sum_{n\in C_k}y_n,\ \ \mu=\frac{1}{N}\sum_{k=1}^KN_k \mu_k μk=Nk1n∈Ck∑yn, μ=N1k=1∑KNkμk
One of the many choices of criterion is J ( W ) = T r ( s W − 1 s B ) J(W)=Tr(s_W^{-1}s_B) J(W)=Tr(sW−1sB) and it is straightforward to see that we should maximize J ( W ) = T r [ ( W S W W T ) − 1 ( W S B W T ) ] J(W)=Tr[(WS_WW^T)^{-1}(WS_BW^T)] J(W)=Tr[(WSWWT)−1(WSBWT)]
4.1.7 The perceptron algorithm
A generalized linear model will be the form:
y ( x ) = f ( w T ϕ ( x ) ) y(x) = f(w^{T}\phi(x)) y(x)=f(wTϕ(x))
and the nonlinear activation function is given by:
f ( a ) = { + 1 , a ≥ 0 − 1 , a < 0 f(a) = \left\{ \begin{aligned} +1, ~~~a\geq 0 \\ -1, ~~~a\lt 0 \end{aligned} \right. f(a)={+1, a≥0−1, a<0
Here’s an alternative error function known as perceptron criterion:
E P ( w ) = − ∑ n ∈ M w T ϕ n t n E_{P}(w) = -\sum_{n\in M}w^T\phi_{n}t_{n} EP(w)=−n∈M∑wTϕntn
And the training process towards this problem will be the stochastic gradient descent algorithm:
w ( τ + 1 ) = w ( τ ) − η ∇ E P ( w ) = w ( τ ) + η ϕ n t n w^{(\tau+1)}=w^{(\tau)}-\eta\nabla E_P(w)=w^{(\tau)}+\eta\phi_n t_n w(τ+1)=w(τ)−η∇EP(w)=w(τ)+ηϕntn
4.2 Probabilistic Generative Models
For the problem of two classes, the posterior probability for class one can be:
p ( C 1 ∣ x ) = σ ( a ) p(C_{1} | x) = \sigma(a) p(C1∣x)=σ(a)
where a = ln p ( x ∣ C 1 ) p ( C 1 ) p ( x ∣ C 2 ) p ( C 2 ) a = \ln\frac{p(x|C_{1})p(C_{1})}{p(x|C_{2})p(C_{2})} a=lnp(x∣C2)p(C2)p(x∣C1)p(C1) and σ ( a ) = 1 1 + exp ( − 1 ) \sigma(a) = \frac{1}{1 + \exp(-1)} σ(a)=1+exp(−1)1 (logistic sigmoid).
For the case of K>2, we have:
p ( C k ∣ x ) = p ( x ∣ C k ) p ( C k ) ∑ j p ( x ∣ C j ) p ( C j ) = e x p ( a k ) ∑ j exp ( a j ) p(C_{k} | x) = \frac{p(x|C_{k})p(C_{k})}{\sum_{j}p(x|C_{j})p(C_{j})} = \frac{exp(a_{k})}{\sum_{j}\exp(a_{j})} p(Ck∣x)=∑jp(x∣Cj)p(Cj)p(x∣Ck)p(Ck)=∑jexp(aj)exp(ak)
where a k = ln p ( ( x ∣ C k ) p ( C k ) ) a_{k} = \ln p((x | C_{k})p(C_{k})) ak=lnp((x∣Ck)p(Ck)) (equals to softmax function).
4.2.1 Continuous inputs
Assume that the class-conditional densities are Gaussian and all classes share the same covariance matrix. So the density for class C k C_k Ck is given by:
p ( x ∣ C k ) = 1 2 π D 2 1 ∣ Σ ∣ 1 2 exp { − 1 2 ( x − μ k ) T Σ − 1 ( x − μ k ) } p(x | C_{k}) = \frac{1}{2\pi^{\frac{D}{2}}}\frac{1}{|\Sigma|^{\frac{1}{2}}}\exp\left\{ -\frac{1}{2}(x-\mu_{k})^{T}\Sigma^{-1}(x-\mu_{k}) \right\} p(x∣Ck)=2π2D1∣Σ∣211exp{−21(x−μk)TΣ−1(x−μk)}
Consider the first two classes, we have:
p ( C 1 ∣ x ) = σ ( w T x + w 0 ) p(C_{1} | x) = \sigma(w^{T}x + w_{0}) p(C1∣x)=σ(wTx+w0)
Where w = Σ − 1 ( μ 1 − μ 2 ) w = \Sigma^{-1}(\mu_{1} - \mu_{2}) w=Σ−1(μ1−μ2) and w 0 = − 1 2 μ 1 T Σ − 1 μ 1 + 1 2 μ 2 T Σ − 1 μ 2 + ln p ( C 1 ) p ( C 2 ) w_{0} = -\frac{1}{2}\mu_{1}^{T}\Sigma^{-1}\mu_{1} + \frac{1}{2}\mu_{2}^{T}\Sigma^{-1}\mu_{2} + \ln\frac{p(C_{1})}{p(C_{2})} w0=−21μ1TΣ−1μ1+21μ2TΣ−1μ2+lnp(C2)p(C1).
For the general case of K classes we have, we have:
a k ( x ) = w k T x + w k 0 a_k(x)=w_k^Tx+w_{k0} ak(x)=wkTx+wk0
where w k = Σ − 1 μ k w_k=\Sigma^{-1}\mu_k wk=Σ−1μk and w k 0 = − 1 2 μ k T Σ − 1 μ k + l n p ( C k ) w_{k0}=-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+lnp(C_k) wk0=−21μkTΣ−1μk+lnp(Ck).
4.2.2 Maximum likelihood solution
After we specified the class-conditional densities, we can determine the parameters’ value and prior probabilities using maximum likelihood.
For the case of two classes, we have:
p ( t , X ∣ π , μ 1 , μ 2 , Σ ) = ∏ n = 1 N [ π N ( x 1 ∣ μ 1 , Σ ) ] t n [ ( 1 − π ) N ( x n ∣ μ 2 , Σ ) ] 1 − t n p(\mathbf{t}, \mathbf{X} | \pi, \mu_{1}, \mu_{2}, \Sigma) = \prod_{n=1}^{N}[\pi N(x_{1} | \mu_{1}, \Sigma)]^{t_{n}}[(1-\pi)N(x_{n} | \mu_{2}, \Sigma)]^{1-t_{n}} p(t,X∣π,μ1,μ2,Σ)=n=1∏N[πN(x1∣μ1,Σ)]tn[(1−π)N(xn∣μ2,Σ)]1−tn
The solution will be:
μ 1 = 1 N 1 ∑ n = 1 N t n x n \mu_{1} = \frac{1}{N_{1}}\sum_{n=1}^{N}t_{n}x_{n} μ1=N11n=1∑Ntnxn
μ 2 = 1 N 2 ∑ n = 1 N ( 1 − t n ) x n \mu_{2} = \frac{1}{N_{2}}\sum_{n=1}^{N}(1-t_{n})x_{n} μ2=N21n=1∑N(1−tn)xn
Σ = N 1 N S 1 + N 2 N S 2 \Sigma = \frac{N_{1}}{N}S_{1} + \frac{N_{2}}{N}S_{2} Σ=NN1S1+NN2S2
where S 1 = 1 N 1 ∑ n ∈ C 1 ( x n − μ 1 ) ( x n − μ 1 ) T S_{1} = \frac{1}{N_{1}}\sum_{n\in C_{1}}(x_{n} - \mu_{1})(x_{n} - \mu_{1})^{T} S1=N11∑n∈C1(xn−μ1)(xn−μ1)T and S 2 = 1 N 2 ∑ n ∈ C 2 ( x n − μ 2 ) ( x n − μ 2 ) T S_{2} = \frac{1}{N_{2}}\sum_{n\in C_{2}}(x_{n} - \mu_{2})(x_{n} - \mu_{2})^{T} S2=N21∑n∈C2(xn−μ2)(xn−μ2)T.
The idea of multi-classes problem will be the same.
4.2.3 Discrete features
In this case, we have class-conditional distributions of the form:
p ( x ∣ C k ) = ∏ i = 1 D μ k i x i ( 1 − μ k i ) 1 − x i p(x|C_k)=\prod_{i=1}^D\mu_{ki}^{x_i}(1-\mu_{ki})^{1-x_i} p(x∣Ck)=i=1∏Dμkixi(1−μki)1−xi
The posterior probability density will be:
a k ( x ) = ∑ i = 1 D [ x i l n μ k i + ( 1 − x i ) l n ( 1 − μ k i ) ] + l n p ( C k ) a_k(x)=\sum_{i=1}^D[x_i ln\mu_{ki}+(1-x_i)ln(1-\mu_{ki})]+lnp(C_k) ak(x)=i=1∑D[xilnμki+(1−xi)ln(1−μki)]+lnp(Ck)
4.2.4 Exponential family
For members of the exponential family, the distribution of
x
x
x can be written in the form:
p
(
x
∣
λ
k
)
=
h
(
x
)
g
(
λ
k
)
e
x
p
{
λ
k
T
u
(
x
)
}
p(x|\lambda_k)=h(x)g(\lambda_k)exp\{\lambda_k^Tu(x)\}
p(x∣λk)=h(x)g(λk)exp{λkTu(x)}
if we let u ( x ) = x u(x)=x u(x)=x and introduce a scaling parameter s s s:
p ( x ∣ λ k , s ) = 1 s h ( 1 s x ) g ( λ k ) exp { 1 s λ k T x } p(x | \lambda_{k}, s) = \frac{1}{s}h(\frac{1}{s}x)g(\lambda_{k})\exp\left\{\frac{1}{s}\lambda_{k}^{T}x\right\} p(x∣λk,s)=s1h(s1x)g(λk)exp{s1λkTx}
Consequently, for two-class problem the posterior class probability is given by a logistic sigmoid acting on a linear function a ( x ) a(x) a(x):
a ( x ) = ( λ 1 − λ 2 ) T x + l n g ( λ 1 ) − l n g ( λ 2 ) + l n p ( C 1 ) − l n p ( C 2 ) a(x)=(\lambda_1-\lambda_2)^Tx+lng(\lambda_1)-lng(\lambda_2)+lnp(C_1)-lnp(C_2) a(x)=(λ1−λ2)Tx+lng(λ1)−lng(λ2)+lnp(C1)−lnp(C2)
And for K-classes problem:
a k ( x ) = 1 s λ k T x + ln g ( λ k ) + ln p ( C k ) a_{k}(x) = \frac{1}{s}\mathbf{\lambda_{k}^{T}x} + \ln g(\lambda_{k}) + \ln p(C_{k}) ak(x)=s1λkTx+lng(λk)+lnp(Ck)
4.3 Probabilistic Discriminative Models
4.3.1 Fixed basis functions
4.3.2 Logistic regression
For two-class classification problem, the posterior probability of class C 1 C_1 C1 can be written as a logistic sigmoid acting on a linear function of the feature vector:
p ( C 1 ∣ ϕ ) = y ( x ) = σ ( w T ϕ ) p(C_{1} | \phi) = y(x) = \sigma(w^{T}\phi) p(C1∣ϕ)=y(x)=σ(wTϕ)
We now use MLE to determine the parameters of logistic regression:
p ( t ∣ w ) = ∏ n = 1 N y n t n { 1 − y n } 1 − t n p(\mathbf{t} | w) = \prod_{n=1}^{N}y_{n}^{t_{n}}\{1-y_{n}\}^{1-t_{n}} p(t∣w)=n=1∏Nyntn{1−yn}1−tn
where t = ( t 1 , . . . , t n ) \mathbf{t}=(t_1,...,t_n) t=(t1,...,tn) and y n = p ( C 1 ∥ ϕ n ) y_n=p(C_1\|\phi_n) yn=p(C1∥ϕn). We can define an error function by taking the negative logarithm of the likelihood, which gives the cross-entropy error function in the form:
E ( w ) = − ln p ( t ∣ w ) = − ∑ n = 1 N { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } E(w) = -\ln p(\mathbf{t} | w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \} E(w)=−lnp(t∣w)=−n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
where y n = σ ( a n ) y_n=\sigma(a_n) yn=σ(an) and a n = w T ϕ n a_n=w^T\phi_n an=wTϕn. Taking the gradient of the error function with respect to w w w, we obtain:
∇ E ( w ) = ∑ n = 1 N ( y n − t n ) ϕ n \nabla E(w) = \sum_{n=1}^{N}(y_{n} - t_{n})\phi_{n} ∇E(w)=n=1∑N(yn−tn)ϕn
4.3.3 Iterative reweighted least squares
The Newton-Raphson update, for minimizing a function E ( w ) E(w) E(w), takes the form:
w n e w = w o l d − H − 1 ∇ E ( w ) w^{new} = w^{old} - \mathbf{H}^{-1}\nabla E(w) wnew=wold−H−1∇E(w)
Apply the Newton-Raphson update to the cross-entropy error function for the logistic regression model, the gradient and Hessian of the error function are given by:
∇ E ( w ) = ∑ n = 1 N ( y n − t n ) ϕ n = Φ T ( y − t ) \nabla E(w) = \sum_{n=1}^{N}(y_{n} - t_{n})\phi_{n} = \Phi^{T}(\mathbf{y} - t) ∇E(w)=n=1∑N(yn−tn)ϕn=ΦT(y−t)
H = ∇ ∇ E ( w ) = ∑ n = 1 N y n ( 1 − y n ) ϕ n ϕ n T = Φ T R Φ H = \nabla\nabla E(w) = \sum_{n=1}^{N}y_{n}(1-y_{n})\phi_{n}\phi_{n}^{T} = \Phi^{T}R\Phi H=∇∇E(w)=n=1∑Nyn(1−yn)ϕnϕnT=ΦTRΦ
R n n = y n ( 1 − y n ) R_{nn} = y_{n}(1-y_{n}) Rnn=yn(1−yn)
The Newton-Raphson update formula for the logistic regression model becomes:
w n e w = w o l d − ( Φ T R Φ ) − 1 Φ T ( y − t ) = ( Φ T R Φ ) − 1 Φ T R z w^{new} = w^{old} - (\Phi^{T}R\Phi)^{-1}\Phi^{T}(\mathbf{y} - \mathbf{t}) = (\Phi^{T}R\Phi)^{-1}\Phi^{T}Rz wnew=wold−(ΦTRΦ)−1ΦT(y−t)=(ΦTRΦ)−1ΦTRz
z = Φ w o l d − R − 1 ( y − t ) z = \Phi w^{old} - R^{-1}(\mathbf{y} - \mathbf{t}) z=Φwold−R−1(y−t)
4.3.4 Multiclass logistic regression
For this problem, the posterior probabilities are given by:
p ( C k ∣ ϕ ) = y k ( ϕ ) = e x p ( a k ) ∑ j e x p ( a j ) p(C_k|\phi)=y_k(\phi)=\frac{exp(a_k)}{\sum_j exp(a_j)} p(Ck∣ϕ)=yk(ϕ)=∑jexp(aj)exp(ak)
where a k = w k T ϕ a_k=w^T_k\phi ak=wkTϕ.
Similarly, we can write down the likelihood function:
p ( T ∣ w 1 , . . . , w K ) = ∏ n = 1 N ∏ k = 1 K p ( C k ∣ ϕ n ) t n k = ∏ n = 1 N ∏ k = 1 K y n k t n k p(\textbf{T}|w_1,...,w_K)=\prod_{n=1}^N\prod_{k=1}^K p(C_k|\phi_n)^{t_{nk}}=\prod_{n=1}^N\prod_{k=1}^K y_{nk}^{t_{nk}} p(T∣w1,...,wK)=n=1∏Nk=1∏Kp(Ck∣ϕn)tnk=n=1∏Nk=1∏Kynktnk
The cross-entropy error function for the multiclass classification problem:
E ( w 1 , . . . , w K ) = − l n p ( T ∣ w 1 , . . . , w k ) = − ∑ n = 1 N ∑ k = 1 K t n k l n y n k E(w_1,...,w_K)=-lnp(\textbf{T}|w_1,...,w_k)=-\sum_{n=1}^N\sum_{k=1}^K t_{nk}lny_{nk} E(w1,...,wK)=−lnp(T∣w1,...,wk)=−n=1∑Nk=1∑Ktnklnynk
The derivatives will be
∇ w j E ( w 1 , . . . , w K ) = ∑ n = 1 N ( y n j − t n j ) ϕ n \nabla_{w_j}E(w_1,...,w_K)=\sum_{n=1}^N(y_{nj}-t_{nj})\phi_n ∇wjE(w1,...,wK)=n=1∑N(ynj−tnj)ϕn
∇ w k ∇ w j E ( w 1 , . . . , w K ) = − ∑ n = 1 N y n j ( I k j − y n j ) ϕ n ϕ n T \nabla_{w_k}\nabla_{w_j}E(w_1,...,w_K)=-\sum_{n=1}^N y_{nj}(I_{kj}-y_{nj})\phi_n\phi_n^T ∇wk∇wjE(w1,...,wK)=−n=1∑Nynj(Ikj−ynj)ϕnϕnT
4.3.5 Probit regression
If the value of θ \theta θ is drawn from a probability density p ( θ ) p(\theta) p(θ), then the corresponding activation function will be given by the cumulative distribution function:
f ( a ) = ∫ − ∞ a p ( θ ) d θ f(a)=\int^a_{-\infin}p(\theta)d\theta f(a)=∫−∞ap(θ)dθ
And we suppose the density is given by a zero mean, unit variance Gaussian:
Φ ( a ) = ∫ − ∞ a N ( θ ∣ 0 , 1 ) d θ \Phi(a) = \int_{-\infty}^{a} N(\theta | 0, 1) d\theta Φ(a)=∫−∞aN(θ∣0,1)dθ
which is known as probit function. Many numerical packages provide for evaluation of a closely related function defined by:
e r f ( a ) = 2 π ∫ 0 a exp ( − θ 2 ) d θ erf(a) = \frac{2}{\sqrt{\pi}}\int_{0}^{a}\exp(-\theta^{2})d\theta erf(a)=π2∫0aexp(−θ2)dθ
which is known as erf function. It is related to the probit function by:
Φ ( a ) = 1 2 { 1 + 1 2 e r f ( a ) } \Phi(a) = \frac{1}{2}\{ 1 + \frac{1}{\sqrt{2}}erf(a)\} Φ(a)=21{1+21erf(a)}
The generalized linear model based on a probit activation function is known as probit regression.
4.3.6 Canonical link functions
If we assume that the conditional distribution of the target variable comes from the exponential family distribution, the corresponding activation function is selected as the standard link function (the link function is the inverse of the activation function), then we have:
∇ E ( w ) = 1 s ∑ n = 1 N { y n − t n } ϕ n \nabla E(w) = \frac{1}{s}\sum_{n=1}^{N}\{y_{n} - t_{n}\}\phi_{n} ∇E(w)=s1n=1∑N{yn−tn}ϕn
For the Gaussian s = β − 1 s=\beta^{-1} s=β−1, whereas for the logistic model s = 1 s=1 s=1.
4.4 The Laplace Approximation
Laplace approximation aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. Suppose the distribution is difined by:
p ( z ) = 1 z f ( z ) p(z) = \frac{1}{z}f(z) p(z)=z1f(z)
Expanding around the stationary point:
ln f ( z ) ≃ ln f ( z 0 ) − 1 2 ( z − z 0 ) T A ( z − z 0 ) \ln f(z) \simeq \ln f(z_{0}) - \frac{1}{2}(z-z_{0})^{T}A(z-z_{0}) lnf(z)≃lnf(z0)−21(z−z0)TA(z−z0)
where A = − ∇ ∇ ln f ( z ) ∣ z = z 0 A = -\nabla\nabla\ln f(z)|_{z=z_{0}} A=−∇∇lnf(z)∣z=z0. Taking the exponential of both sides we obtain:
f ( z ) ≃ f ( z 0 ) exp { − f r a c 12 ( z − z 0 ) T A ( z − z 0 ) } f(z) \simeq f(z_{0})\exp \left\{ -frac{1}{2}(z - z_{0})^{T}A(z-z_{0}) \right\} f(z)≃f(z0)exp{−frac12(z−z0)TA(z−z0)}
and we know that q ( z ) q(z) q(z) is proportional to f ( z ) f(z) f(z) so:
q ( z ) = ∣ A ∣ 1 2 ( 2 π ) M 2 exp { − 1 2 ( z − z 0 ) T A ( z − z 0 ) } = N ( z ∣ z 0 , A − 1 ) q(z) = \frac{|A|^{\frac{1}{2}}}{(2\pi)^{\frac{M}{2}}}\exp \left\{ -\frac{1}{2}(z-z_{0})^{T}A(z-z_{0}) \right\} = N(z | z_{0}, A^{-1}) q(z)=(2π)2M∣A∣21exp{−21(z−z0)TA(z−z0)}=N(z∣z0,A−1)
4.4.1 Model comparison and BIC
4.5 Bayesian Logistic Regression
4.5.1 Laplace approximation
Seek a Gaussian representation for the posterior distribution and get the log form:
ln p ( w ∣ t ) = − 1 2 ( w − w 0 ) T S 0 − 1 ( w − w 0 ) + ∑ n = 1 N { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } + c o n s t \ln p(w | t) = -\frac{1}{2}(w - w_{0})^{T}S_{0}^{-1}(w - w_{0}) + \sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n})\} + const lnp(w∣t)=−21(w−w0)TS0−1(w−w0)+n=1∑N{tnlnyn+(1−tn)ln(1−yn)}+const
The covariance is then given by the inverse of the matrix of second derivatives of the negative log likelihood:
S N − 1 = − ∇ ∇ ln p ( w ∣ t ) = S 0 − 1 + ∑ n = 1 N y n ( 1 − y n ) ϕ n ϕ n T S_{N}^{-1} = -\nabla\nabla \ln p(w | t) = S_{0}^{-1} + \sum_{n=1}^{N}y_{n}(1-y_{n})\phi_{n}\phi_{n}^{T} SN−1=−∇∇lnp(w∣t)=S0−1+n=1∑Nyn(1−yn)ϕnϕnT
The Gaussian approximation to the posterior distribution therefore takes the form:
q ( w ) = N ( w ∣ w M A P , S N ) q(w) = N(w | w_{MAP}, S_{N}) q(w)=N(w∣wMAP,SN)
4.5.2 Predictive distribution
The variational approximation to the predictive distribution:
p ( C 1 ∣ t ) = ∫ σ ( a ) p ( a ) d a = ∫ σ ( a ) N ( a ∣ μ a , σ a 2 ) d a p(C_{1} | t) = \int \sigma(a)p(a)d a = \int\sigma(a)N(a | \mu_{a}, \sigma_{a}^{2})da p(C1∣t)=∫σ(a)p(a)da=∫σ(a)N(a∣μa,σa2)da