模式识别 | PRML Chapter 4 Linear Models for Classification

PRML Chapter 4 Linear Models for Classification

4.1 Discriminant Functions

4.1.1 Two classes

The simplest representation of a linear discriminant function can be expressed as:

y ( x ) = w T x + w 0 y(x) = w^{T}x + w_{0} y(x)=wTx+w0

The normal distance from the origin to the decision surface is given by

w T x ∣ ∣ w ∣ ∣ = − w 0 ∣ ∣ x ∣ ∣ \frac{w^{T}x}{||w||} = - \frac{w_{0}}{||x||} wwTx=xw0

4.1.2 Multiple classes

Considering a single K-class discriminant comprising K linear functions of the form

y k ( x ) = w k T x + w k 0 y_{k}(x) = w_{k}^{T}x + w_{k0} yk(x)=wkTx+wk0

and we assume a point x x x to class C k C_k Ck if y k ( x ) > y j ( x ) y_k(x)>y_j(x) yk(x)>yj(x). The decision boundary between class C k C_k Ck and class C j C_j Cj is given by y k ( x ) = y j ( x ) y_k(x)=y_j(x) yk(x)=yj(x), and the corresponding D-1 dimensional hyperplane can be

( w k − w j ) T x + ( w k 0 − w j 0 ) = 0 (w_{k} - w_{j})^{T}x + (w_{k0} - w_{j0}) = 0 (wkwj)Tx+(wk0wj0)=0

4.1.3 Least squares for classification

Each class C k C_k Ck is described by its own linear model so that

y k ( x ) = w k T x + w k 0 y_k(x)=w^T_kx+w_{k0} yk(x)=wkTx+wk0

and we can group these together to get

y ( x ) = W ~ T x ~ y(x) = \tilde{W}^{T}\tilde{x} y(x)=W~Tx~

The sum-of-squares error function can be written as

E D ( W ~ ) = 1 2 T r { ( X ~ W ~ − T ) T ( X ~ W ~ − T ) } E_{D}(\tilde{W}) = \frac{1}{2}Tr\left\{ (\tilde{X}\tilde{W} - T)^{T}(\tilde{X}\tilde{W} - T) \right\} ED(W~)=21Tr{(X~W~T)T(X~W~T)}

Let the derivative of W ~ \tilde{W} W~ equal to zero and we can obtain the solution for W ~ \tilde{W} W~:

W ~ = ( X ~ T X ~ ) − 1 X ~ T T = X ~ † T \tilde{W} = (\tilde{X}^{T}\tilde{X})^{-1}\tilde{X}^{T}T = \tilde{X}^{\dag}T W~=(X~TX~)1X~TT=X~T

We then obtain the discriminant function in the form

y ( x ) = W ~ T x ~ = T T ( X ~ † ) T x ~ y(x) = \tilde{W}^{T}\tilde{x} = T^{T}(\tilde{X}^{\dag})^{T}\tilde{x} y(x)=W~Tx~=TT(X~)Tx~

4.1.4 Fisher’s linear discriminant

The mean vectors of the two classes are given by

m 1 = 1 N 1 ∑ n ∈ C 1 x n ,   m 2 = 1 N 2 ∑ n ∈ C 2 x n m_1=\frac{1}{N_1}\sum_{n\in C_1} x_n,\ m_2=\frac{1}{N_2}\sum_{n\in C_2}x_n m1=N11nC1xn, m2=N21nC2xn

To avoid the overlap in the projection space, we might choose w w w to maximize m 2 − m 1 = w T ( m 2 − m 1 ) m_{2} - m_{1} = w^{T}(\mathbf{m_{2} - m_{1}}) m2m1=wT(m2m1).

The with-in class variance of the transform data from class C k C_k Ck is given by:

s k 2 = ∑ n ∈ C k ( y n − m k ) 2 s_{k}^{2} = \sum_{n\in C_{k}}(y_{n} - m_{k})^{2} sk2=nCk(ynmk)2

where y n = w T x n y_n=w^T x_n yn=wTxn. The Fisher criterion is defined to be the ratio of the between-class variance to the within clss variance:

J ( w ) = ( m 2 − m 1 ) 2 s 1 2 − s 2 2 J(w) = \frac{(m_{2} - m_{1})^{2}}{s_{1}^{2} - s_{2}^{2}} J(w)=s12s22(m2m1)2

Rewrite the Fisher criterion in the following form, here we have the between-class covariance matirx S B S_B SB and within-class covariance matrix S W S_W SW.

S B = ( m 2 − m 1 ) ( m 2 − m 1 ) T S_{B} = (m_{2} - m_{1})(m_{2} - m_{1})^{T} SB=(m2m1)(m2m1)T

S W = ∑ n ∈ C 1 ( x n − m 1 ) ( x n − m 1 ) T + ∑ n ∈ C 2 ( x n − m 2 ) ( x n − m 2 ) T S_{W} = \sum_{n\in C_{1}}(x_{n} - m_{1})(x_{n} - m_{1})^{T} + \sum_{n\in C_{2}}(x_{n}-m_{2})(x_{n} - m_{2})^{T} SW=nC1(xnm1)(xnm1)T+nC2(xnm2)(xnm2)T

J ( w ) = w T S B w w T S W w J(w) = \frac{w^{T}S_{B}w}{w^{T}S_{W}w} J(w)=wTSWwwTSBw

Differentiating the formula above and we can see that J ( w ) J(w) J(w) is maximized when:

( w T S B w ) S W w = ( w T S W w ) S B w (w^{T}S_{B}w)S_{W}w = (w^{T}S_{W}w)S_{B}w (wTSBw)SWw=(wTSWw)SBw

we then obtain:

w ∝ S W − 1 ( m 2 − m 1 ) w \propto S_{W}^{-1}(m_{2} - m_{1}) wSW1(m2m1)

4.1.5 Relation to least squares

The Fisher solution can be obtained as a special case of least squares for the two class problem. For class C 1 C_1 C1 we shall take the targets to be N / N 1 N/N_1 N/N1 and for C 2 C_2 C2 to be − N / N 2 -N/N_2 N/N2.

The sum-of-square error function can be written:

E = 1 2 ∑ n = 1 N ( w T x n + w 0 − t n ) 2 E=\frac{1}{2}\sum_{n=1}^N(w^Tx_n+w_0-t_n)^2 E=21n=1N(wTxn+w0tn)2

Setting the derivatives of E with respect to w 0 w_0 w0 and w w w to zero, we can obtain:

∑ n = 1 N ( w T x n + w 0 − t n ) = 0 \sum_{n=1}^N(w^Tx_n+w_0-t_n)=0 n=1N(wTxn+w0tn)=0

∑ n = 1 N ( w T x n + w 0 − t n ) x n = 0 \sum_{n=1}^N(w^Tx_n+w_0-t_n)x_n=0 n=1N(wTxn+w0tn)xn=0

Thus we can get:

w 0 = − w T m w_0=-w^T m w0=wTm
( S W + N 1 N 2 N S B ) w = N ( m 1 − m 2 )   → w ∝ S W − 1 ( m 2 − m 1 ) (S_W+\frac{N_1N_2}{N}S_B)w=N(m_1-m_2)\ \rightarrow w\propto S_W^{-1}(m_2-m_1) (SW+NN1N2SB)w=N(m1m2) wSW1(m2m1)

4.1.6 Fisher’s discriminant for multiple classes

For multiple classes problem, similar to the two classes, the input space may contain:

  • Mean vector
    m k = 1 N k ∑ n ∈ C k x n ,    m = 1 N ∑ n = 1 N x n = 1 N ∑ k = 1 K N k m k m_k=\frac{1}{N_k}\sum_{n\in C_k}x_n,\ \ m=\frac{1}{N}\sum_{n=1}^Nx_n=\frac{1}{N}\sum_{k=1}^KN_km_k mk=Nk1nCkxn,  m=N1n=1Nxn=N1k=1KNkmk

  • Within-class covariance matrix
    S W = ∑ k = 1 K S k ,    S k = ∑ n ∈ C k ( x n − m k ) ( x n − m k ) T S_W=\sum_{k=1}^K S_k,\ \ S_k=\sum_{n\in C_k}(x_n-m_k)(x_n-m_k)^T SW=k=1KSk,  Sk=nCk(xnmk)(xnmk)T

  • Between-class covariance matrix
    S B = ∑ k = 1 K N k ( m k − m ) ( m k − m ) T S_B=\sum_{k=1}^K N_k(m_k-m)(m_k-m)^T SB=k=1KNk(mkm)(mkm)T

  • The total covariance matrix
    S T = ∑ n = 1 N ( x n − m ) ( x n − m ) T ,    S T = S W + S B S_T=\sum_{n=1}^N(x_n-m)(x_n-m)^T,\ \ S_T=S_W+S_B ST=n=1N(xnm)(xnm)T,  ST=SW+SB

Next we introduce D ′ > 1 D'>1 D>1 linear ‘features’ y k = w k T x y_k=w_k^Tx yk=wkTx and the feature values can be grouped together: y = W T x y=W^Tx y=WTx. We can define similar matrices in the projected D’-dimensional y-space.

s W = ∑ k = 1 K ∑ n ∈ C k ( y n − μ k ) ( y n − μ k ) T s_W=\sum_{k=1}^K\sum_{n\in C_k}(y_n-\mu_k)(y_n-\mu_k)^T sW=k=1KnCk(ynμk)(ynμk)T

s B = ∑ k = 1 K N k ( μ k − μ ) ( μ k − μ ) T s_B=\sum_{k=1}^K N_k(\mu_k-\mu)(\mu_k-\mu)^T sB=k=1KNk(μkμ)(μkμ)T

μ k = 1 N k ∑ n ∈ C k y n ,    μ = 1 N ∑ k = 1 K N k μ k \mu_k=\frac{1}{N_k}\sum_{n\in C_k}y_n,\ \ \mu=\frac{1}{N}\sum_{k=1}^KN_k \mu_k μk=Nk1nCkyn,  μ=N1k=1KNkμk

One of the many choices of criterion is J ( W ) = T r ( s W − 1 s B ) J(W)=Tr(s_W^{-1}s_B) J(W)=Tr(sW1sB) and it is straightforward to see that we should maximize J ( W ) = T r [ ( W S W W T ) − 1 ( W S B W T ) ] J(W)=Tr[(WS_WW^T)^{-1}(WS_BW^T)] J(W)=Tr[(WSWWT)1(WSBWT)]

4.1.7 The perceptron algorithm

A generalized linear model will be the form:

y ( x ) = f ( w T ϕ ( x ) ) y(x) = f(w^{T}\phi(x)) y(x)=f(wTϕ(x))

and the nonlinear activation function is given by:

f ( a ) = { + 1 ,     a ≥ 0 − 1 ,     a < 0 f(a) = \left\{ \begin{aligned} +1, ~~~a\geq 0 \\ -1, ~~~a\lt 0 \end{aligned} \right. f(a)={+1,   a01,   a<0

Here’s an alternative error function known as perceptron criterion:

E P ( w ) = − ∑ n ∈ M w T ϕ n t n E_{P}(w) = -\sum_{n\in M}w^T\phi_{n}t_{n} EP(w)=nMwTϕntn

And the training process towards this problem will be the stochastic gradient descent algorithm:

w ( τ + 1 ) = w ( τ ) − η ∇ E P ( w ) = w ( τ ) + η ϕ n t n w^{(\tau+1)}=w^{(\tau)}-\eta\nabla E_P(w)=w^{(\tau)}+\eta\phi_n t_n w(τ+1)=w(τ)ηEP(w)=w(τ)+ηϕntn

4.2 Probabilistic Generative Models

For the problem of two classes, the posterior probability for class one can be:

p ( C 1 ∣ x ) = σ ( a ) p(C_{1} | x) = \sigma(a) p(C1x)=σ(a)

where a = ln ⁡ p ( x ∣ C 1 ) p ( C 1 ) p ( x ∣ C 2 ) p ( C 2 ) a = \ln\frac{p(x|C_{1})p(C_{1})}{p(x|C_{2})p(C_{2})} a=lnp(xC2)p(C2)p(xC1)p(C1) and σ ( a ) = 1 1 + exp ⁡ ( − 1 ) \sigma(a) = \frac{1}{1 + \exp(-1)} σ(a)=1+exp(1)1 (logistic sigmoid).

For the case of K>2, we have:

p ( C k ∣ x ) = p ( x ∣ C k ) p ( C k ) ∑ j p ( x ∣ C j ) p ( C j ) = e x p ( a k ) ∑ j exp ⁡ ( a j ) p(C_{k} | x) = \frac{p(x|C_{k})p(C_{k})}{\sum_{j}p(x|C_{j})p(C_{j})} = \frac{exp(a_{k})}{\sum_{j}\exp(a_{j})} p(Ckx)=jp(xCj)p(Cj)p(xCk)p(Ck)=jexp(aj)exp(ak)

where a k = ln ⁡ p ( ( x ∣ C k ) p ( C k ) ) a_{k} = \ln p((x | C_{k})p(C_{k})) ak=lnp((xCk)p(Ck)) (equals to softmax function).

4.2.1 Continuous inputs

Assume that the class-conditional densities are Gaussian and all classes share the same covariance matrix. So the density for class C k C_k Ck is given by:

p ( x ∣ C k ) = 1 2 π D 2 1 ∣ Σ ∣ 1 2 exp ⁡ { − 1 2 ( x − μ k ) T Σ − 1 ( x − μ k ) } p(x | C_{k}) = \frac{1}{2\pi^{\frac{D}{2}}}\frac{1}{|\Sigma|^{\frac{1}{2}}}\exp\left\{ -\frac{1}{2}(x-\mu_{k})^{T}\Sigma^{-1}(x-\mu_{k}) \right\} p(xCk)=2π2D1Σ211exp{21(xμk)TΣ1(xμk)}

Consider the first two classes, we have:

p ( C 1 ∣ x ) = σ ( w T x + w 0 ) p(C_{1} | x) = \sigma(w^{T}x + w_{0}) p(C1x)=σ(wTx+w0)

Where w = Σ − 1 ( μ 1 − μ 2 ) w = \Sigma^{-1}(\mu_{1} - \mu_{2}) w=Σ1(μ1μ2) and w 0 = − 1 2 μ 1 T Σ − 1 μ 1 + 1 2 μ 2 T Σ − 1 μ 2 + ln ⁡ p ( C 1 ) p ( C 2 ) w_{0} = -\frac{1}{2}\mu_{1}^{T}\Sigma^{-1}\mu_{1} + \frac{1}{2}\mu_{2}^{T}\Sigma^{-1}\mu_{2} + \ln\frac{p(C_{1})}{p(C_{2})} w0=21μ1TΣ1μ1+21μ2TΣ1μ2+lnp(C2)p(C1).

For the general case of K classes we have, we have:

a k ( x ) = w k T x + w k 0 a_k(x)=w_k^Tx+w_{k0} ak(x)=wkTx+wk0

where w k = Σ − 1 μ k w_k=\Sigma^{-1}\mu_k wk=Σ1μk and w k 0 = − 1 2 μ k T Σ − 1 μ k + l n p ( C k ) w_{k0}=-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+lnp(C_k) wk0=21μkTΣ1μk+lnp(Ck).

4.2.2 Maximum likelihood solution

After we specified the class-conditional densities, we can determine the parameters’ value and prior probabilities using maximum likelihood.

For the case of two classes, we have:

p ( t , X ∣ π , μ 1 , μ 2 , Σ ) = ∏ n = 1 N [ π N ( x 1 ∣ μ 1 , Σ ) ] t n [ ( 1 − π ) N ( x n ∣ μ 2 , Σ ) ] 1 − t n p(\mathbf{t}, \mathbf{X} | \pi, \mu_{1}, \mu_{2}, \Sigma) = \prod_{n=1}^{N}[\pi N(x_{1} | \mu_{1}, \Sigma)]^{t_{n}}[(1-\pi)N(x_{n} | \mu_{2}, \Sigma)]^{1-t_{n}} p(t,Xπ,μ1,μ2,Σ)=n=1N[πN(x1μ1,Σ)]tn[(1π)N(xnμ2,Σ)]1tn

The solution will be:

μ 1 = 1 N 1 ∑ n = 1 N t n x n \mu_{1} = \frac{1}{N_{1}}\sum_{n=1}^{N}t_{n}x_{n} μ1=N11n=1Ntnxn

μ 2 = 1 N 2 ∑ n = 1 N ( 1 − t n ) x n \mu_{2} = \frac{1}{N_{2}}\sum_{n=1}^{N}(1-t_{n})x_{n} μ2=N21n=1N(1tn)xn

Σ = N 1 N S 1 + N 2 N S 2 \Sigma = \frac{N_{1}}{N}S_{1} + \frac{N_{2}}{N}S_{2} Σ=NN1S1+NN2S2

where S 1 = 1 N 1 ∑ n ∈ C 1 ( x n − μ 1 ) ( x n − μ 1 ) T S_{1} = \frac{1}{N_{1}}\sum_{n\in C_{1}}(x_{n} - \mu_{1})(x_{n} - \mu_{1})^{T} S1=N11nC1(xnμ1)(xnμ1)T and S 2 = 1 N 2 ∑ n ∈ C 2 ( x n − μ 2 ) ( x n − μ 2 ) T S_{2} = \frac{1}{N_{2}}\sum_{n\in C_{2}}(x_{n} - \mu_{2})(x_{n} - \mu_{2})^{T} S2=N21nC2(xnμ2)(xnμ2)T.

The idea of multi-classes problem will be the same.

4.2.3 Discrete features

In this case, we have class-conditional distributions of the form:

p ( x ∣ C k ) = ∏ i = 1 D μ k i x i ( 1 − μ k i ) 1 − x i p(x|C_k)=\prod_{i=1}^D\mu_{ki}^{x_i}(1-\mu_{ki})^{1-x_i} p(xCk)=i=1Dμkixi(1μki)1xi

The posterior probability density will be:

a k ( x ) = ∑ i = 1 D [ x i l n μ k i + ( 1 − x i ) l n ( 1 − μ k i ) ] + l n p ( C k ) a_k(x)=\sum_{i=1}^D[x_i ln\mu_{ki}+(1-x_i)ln(1-\mu_{ki})]+lnp(C_k) ak(x)=i=1D[xilnμki+(1xi)ln(1μki)]+lnp(Ck)

4.2.4 Exponential family

For members of the exponential family, the distribution of x x x can be written in the form:
p ( x ∣ λ k ) = h ( x ) g ( λ k ) e x p { λ k T u ( x ) } p(x|\lambda_k)=h(x)g(\lambda_k)exp\{\lambda_k^Tu(x)\} p(xλk)=h(x)g(λk)exp{λkTu(x)}

if we let u ( x ) = x u(x)=x u(x)=x and introduce a scaling parameter s s s:

p ( x ∣ λ k , s ) = 1 s h ( 1 s x ) g ( λ k ) exp ⁡ { 1 s λ k T x } p(x | \lambda_{k}, s) = \frac{1}{s}h(\frac{1}{s}x)g(\lambda_{k})\exp\left\{\frac{1}{s}\lambda_{k}^{T}x\right\} p(xλk,s)=s1h(s1x)g(λk)exp{s1λkTx}

Consequently, for two-class problem the posterior class probability is given by a logistic sigmoid acting on a linear function a ( x ) a(x) a(x):

a ( x ) = ( λ 1 − λ 2 ) T x + l n g ( λ 1 ) − l n g ( λ 2 ) + l n p ( C 1 ) − l n p ( C 2 ) a(x)=(\lambda_1-\lambda_2)^Tx+lng(\lambda_1)-lng(\lambda_2)+lnp(C_1)-lnp(C_2) a(x)=(λ1λ2)Tx+lng(λ1)lng(λ2)+lnp(C1)lnp(C2)

And for K-classes problem:

a k ( x ) = 1 s λ k T x + ln ⁡ g ( λ k ) + ln ⁡ p ( C k ) a_{k}(x) = \frac{1}{s}\mathbf{\lambda_{k}^{T}x} + \ln g(\lambda_{k}) + \ln p(C_{k}) ak(x)=s1λkTx+lng(λk)+lnp(Ck)

4.3 Probabilistic Discriminative Models

4.3.1 Fixed basis functions

4.3.2 Logistic regression

For two-class classification problem, the posterior probability of class C 1 C_1 C1 can be written as a logistic sigmoid acting on a linear function of the feature vector:

p ( C 1 ∣ ϕ ) = y ( x ) = σ ( w T ϕ ) p(C_{1} | \phi) = y(x) = \sigma(w^{T}\phi) p(C1ϕ)=y(x)=σ(wTϕ)

We now use MLE to determine the parameters of logistic regression:

p ( t ∣ w ) = ∏ n = 1 N y n t n { 1 − y n } 1 − t n p(\mathbf{t} | w) = \prod_{n=1}^{N}y_{n}^{t_{n}}\{1-y_{n}\}^{1-t_{n}} p(tw)=n=1Nyntn{1yn}1tn

where t = ( t 1 , . . . , t n ) \mathbf{t}=(t_1,...,t_n) t=(t1,...,tn) and y n = p ( C 1 ∥ ϕ n ) y_n=p(C_1\|\phi_n) yn=p(C1ϕn). We can define an error function by taking the negative logarithm of the likelihood, which gives the cross-entropy error function in the form:

E ( w ) = − ln ⁡ p ( t ∣ w ) = − ∑ n = 1 N { t n ln ⁡ y n + ( 1 − t n ) ln ⁡ ( 1 − y n ) } E(w) = -\ln p(\mathbf{t} | w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \} E(w)=lnp(tw)=n=1N{tnlnyn+(1tn)ln(1yn)}

where y n = σ ( a n ) y_n=\sigma(a_n) yn=σ(an) and a n = w T ϕ n a_n=w^T\phi_n an=wTϕn. Taking the gradient of the error function with respect to w w w, we obtain:

∇ E ( w ) = ∑ n = 1 N ( y n − t n ) ϕ n \nabla E(w) = \sum_{n=1}^{N}(y_{n} - t_{n})\phi_{n} E(w)=n=1N(yntn)ϕn

4.3.3 Iterative reweighted least squares

The Newton-Raphson update, for minimizing a function E ( w ) E(w) E(w), takes the form:

w n e w = w o l d − H − 1 ∇ E ( w ) w^{new} = w^{old} - \mathbf{H}^{-1}\nabla E(w) wnew=woldH1E(w)

Apply the Newton-Raphson update to the cross-entropy error function for the logistic regression model, the gradient and Hessian of the error function are given by:

∇ E ( w ) = ∑ n = 1 N ( y n − t n ) ϕ n = Φ T ( y − t ) \nabla E(w) = \sum_{n=1}^{N}(y_{n} - t_{n})\phi_{n} = \Phi^{T}(\mathbf{y} - t) E(w)=n=1N(yntn)ϕn=ΦT(yt)

H = ∇ ∇ E ( w ) = ∑ n = 1 N y n ( 1 − y n ) ϕ n ϕ n T = Φ T R Φ H = \nabla\nabla E(w) = \sum_{n=1}^{N}y_{n}(1-y_{n})\phi_{n}\phi_{n}^{T} = \Phi^{T}R\Phi H=E(w)=n=1Nyn(1yn)ϕnϕnT=ΦTRΦ

R n n = y n ( 1 − y n ) R_{nn} = y_{n}(1-y_{n}) Rnn=yn(1yn)

The Newton-Raphson update formula for the logistic regression model becomes:

w n e w = w o l d − ( Φ T R Φ ) − 1 Φ T ( y − t ) = ( Φ T R Φ ) − 1 Φ T R z w^{new} = w^{old} - (\Phi^{T}R\Phi)^{-1}\Phi^{T}(\mathbf{y} - \mathbf{t}) = (\Phi^{T}R\Phi)^{-1}\Phi^{T}Rz wnew=wold(ΦTRΦ)1ΦT(yt)=(ΦTRΦ)1ΦTRz

z = Φ w o l d − R − 1 ( y − t ) z = \Phi w^{old} - R^{-1}(\mathbf{y} - \mathbf{t}) z=ΦwoldR1(yt)

4.3.4 Multiclass logistic regression

For this problem, the posterior probabilities are given by:

p ( C k ∣ ϕ ) = y k ( ϕ ) = e x p ( a k ) ∑ j e x p ( a j ) p(C_k|\phi)=y_k(\phi)=\frac{exp(a_k)}{\sum_j exp(a_j)} p(Ckϕ)=yk(ϕ)=jexp(aj)exp(ak)

where a k = w k T ϕ a_k=w^T_k\phi ak=wkTϕ.

Similarly, we can write down the likelihood function:

p ( T ∣ w 1 , . . . , w K ) = ∏ n = 1 N ∏ k = 1 K p ( C k ∣ ϕ n ) t n k = ∏ n = 1 N ∏ k = 1 K y n k t n k p(\textbf{T}|w_1,...,w_K)=\prod_{n=1}^N\prod_{k=1}^K p(C_k|\phi_n)^{t_{nk}}=\prod_{n=1}^N\prod_{k=1}^K y_{nk}^{t_{nk}} p(Tw1,...,wK)=n=1Nk=1Kp(Ckϕn)tnk=n=1Nk=1Kynktnk

The cross-entropy error function for the multiclass classification problem:

E ( w 1 , . . . , w K ) = − l n p ( T ∣ w 1 , . . . , w k ) = − ∑ n = 1 N ∑ k = 1 K t n k l n y n k E(w_1,...,w_K)=-lnp(\textbf{T}|w_1,...,w_k)=-\sum_{n=1}^N\sum_{k=1}^K t_{nk}lny_{nk} E(w1,...,wK)=lnp(Tw1,...,wk)=n=1Nk=1Ktnklnynk

The derivatives will be

∇ w j E ( w 1 , . . . , w K ) = ∑ n = 1 N ( y n j − t n j ) ϕ n \nabla_{w_j}E(w_1,...,w_K)=\sum_{n=1}^N(y_{nj}-t_{nj})\phi_n wjE(w1,...,wK)=n=1N(ynjtnj)ϕn

∇ w k ∇ w j E ( w 1 , . . . , w K ) = − ∑ n = 1 N y n j ( I k j − y n j ) ϕ n ϕ n T \nabla_{w_k}\nabla_{w_j}E(w_1,...,w_K)=-\sum_{n=1}^N y_{nj}(I_{kj}-y_{nj})\phi_n\phi_n^T wkwjE(w1,...,wK)=n=1Nynj(Ikjynj)ϕnϕnT

4.3.5 Probit regression

If the value of θ \theta θ is drawn from a probability density p ( θ ) p(\theta) p(θ), then the corresponding activation function will be given by the cumulative distribution function:

f ( a ) = ∫ − ∞ a p ( θ ) d θ f(a)=\int^a_{-\infin}p(\theta)d\theta f(a)=ap(θ)dθ

And we suppose the density is given by a zero mean, unit variance Gaussian:

Φ ( a ) = ∫ − ∞ a N ( θ ∣ 0 , 1 ) d θ \Phi(a) = \int_{-\infty}^{a} N(\theta | 0, 1) d\theta Φ(a)=aN(θ0,1)dθ

which is known as probit function. Many numerical packages provide for evaluation of a closely related function defined by:

e r f ( a ) = 2 π ∫ 0 a exp ⁡ ( − θ 2 ) d θ erf(a) = \frac{2}{\sqrt{\pi}}\int_{0}^{a}\exp(-\theta^{2})d\theta erf(a)=π 20aexp(θ2)dθ

which is known as erf function. It is related to the probit function by:

Φ ( a ) = 1 2 { 1 + 1 2 e r f ( a ) } \Phi(a) = \frac{1}{2}\{ 1 + \frac{1}{\sqrt{2}}erf(a)\} Φ(a)=21{1+2 1erf(a)}

The generalized linear model based on a probit activation function is known as probit regression.

4.3.6 Canonical link functions

If we assume that the conditional distribution of the target variable comes from the exponential family distribution, the corresponding activation function is selected as the standard link function (the link function is the inverse of the activation function), then we have:

∇ E ( w ) = 1 s ∑ n = 1 N { y n − t n } ϕ n \nabla E(w) = \frac{1}{s}\sum_{n=1}^{N}\{y_{n} - t_{n}\}\phi_{n} E(w)=s1n=1N{yntn}ϕn

For the Gaussian s = β − 1 s=\beta^{-1} s=β1, whereas for the logistic model s = 1 s=1 s=1.

4.4 The Laplace Approximation

Laplace approximation aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. Suppose the distribution is difined by:

p ( z ) = 1 z f ( z ) p(z) = \frac{1}{z}f(z) p(z)=z1f(z)

Expanding around the stationary point:

ln ⁡ f ( z ) ≃ ln ⁡ f ( z 0 ) − 1 2 ( z − z 0 ) T A ( z − z 0 ) \ln f(z) \simeq \ln f(z_{0}) - \frac{1}{2}(z-z_{0})^{T}A(z-z_{0}) lnf(z)lnf(z0)21(zz0)TA(zz0)

where A = − ∇ ∇ ln ⁡ f ( z ) ∣ z = z 0 A = -\nabla\nabla\ln f(z)|_{z=z_{0}} A=lnf(z)z=z0. Taking the exponential of both sides we obtain:

f ( z ) ≃ f ( z 0 ) exp ⁡ { − f r a c 12 ( z − z 0 ) T A ( z − z 0 ) } f(z) \simeq f(z_{0})\exp \left\{ -frac{1}{2}(z - z_{0})^{T}A(z-z_{0}) \right\} f(z)f(z0)exp{frac12(zz0)TA(zz0)}

and we know that q ( z ) q(z) q(z) is proportional to f ( z ) f(z) f(z) so:

q ( z ) = ∣ A ∣ 1 2 ( 2 π ) M 2 exp ⁡ { − 1 2 ( z − z 0 ) T A ( z − z 0 ) } = N ( z ∣ z 0 , A − 1 ) q(z) = \frac{|A|^{\frac{1}{2}}}{(2\pi)^{\frac{M}{2}}}\exp \left\{ -\frac{1}{2}(z-z_{0})^{T}A(z-z_{0}) \right\} = N(z | z_{0}, A^{-1}) q(z)=(2π)2MA21exp{21(zz0)TA(zz0)}=N(zz0,A1)

4.4.1 Model comparison and BIC

4.5 Bayesian Logistic Regression

4.5.1 Laplace approximation

Seek a Gaussian representation for the posterior distribution and get the log form:

ln ⁡ p ( w ∣ t ) = − 1 2 ( w − w 0 ) T S 0 − 1 ( w − w 0 ) + ∑ n = 1 N { t n ln ⁡ y n + ( 1 − t n ) ln ⁡ ( 1 − y n ) } + c o n s t \ln p(w | t) = -\frac{1}{2}(w - w_{0})^{T}S_{0}^{-1}(w - w_{0}) + \sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n})\} + const lnp(wt)=21(ww0)TS01(ww0)+n=1N{tnlnyn+(1tn)ln(1yn)}+const

The covariance is then given by the inverse of the matrix of second derivatives of the negative log likelihood:

S N − 1 = − ∇ ∇ ln ⁡ p ( w ∣ t ) = S 0 − 1 + ∑ n = 1 N y n ( 1 − y n ) ϕ n ϕ n T S_{N}^{-1} = -\nabla\nabla \ln p(w | t) = S_{0}^{-1} + \sum_{n=1}^{N}y_{n}(1-y_{n})\phi_{n}\phi_{n}^{T} SN1=lnp(wt)=S01+n=1Nyn(1yn)ϕnϕnT

The Gaussian approximation to the posterior distribution therefore takes the form:

q ( w ) = N ( w ∣ w M A P , S N ) q(w) = N(w | w_{MAP}, S_{N}) q(w)=N(wwMAP,SN)

4.5.2 Predictive distribution

The variational approximation to the predictive distribution:

p ( C 1 ∣ t ) = ∫ σ ( a ) p ( a ) d a = ∫ σ ( a ) N ( a ∣ μ a , σ a 2 ) d a p(C_{1} | t) = \int \sigma(a)p(a)d a = \int\sigma(a)N(a | \mu_{a}, \sigma_{a}^{2})da p(C1t)=σ(a)p(a)da=σ(a)N(aμa,σa2)da

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值