Linear Classification

Reference:

Bishop C M. Pattern recognition and machine learning[M]. springer, 2006.

- Chapter 4 up to and including 4.3.2


In the linear regression models, the model prediction y ( x , w ) y(\mathbf x,\mathbf w) y(x,w) was given by a linear function of the parameter w \mathbf w w. In the simplest case, the model is also linear in the input variables and therefore takes the form y ( x ) = w T x + w 0 y(\mathbf x)=\mathbf w^T\mathbf x+w_0 y(x)=wTx+w0, so that y y y is a real number.

For classification problems, however, we wish to predict discrete class labels, or more generally posterior probabilities that lie in the range ( 0 , 1 ) (0,1) (0,1). To achieve this, we consider a generalization of this model in which we transform the linear function of w \mathbf w w using a nonlinear function f ( ⋅ ) f(\cdot) f() so that
g ( x ) = f ( w T x + w 0 ) (GLM) g(\mathbf x)=f(\mathbf w^T\mathbf x+w_0)\tag{GLM} g(x)=f(wTx+w0)(GLM)
f ( ⋅ ) f(\cdot) f() is known as an activation function.

The decision surfaces correspond to g ( x ) = c o n s t a n t g(\mathbf x)=\mathrm{constant} g(x)=constant, so that w T x + w 0 = c o n s t a n t \mathbf w^T \mathbf x+w_0=\mathrm{constant} wTx+w0=constant and hence the decision surfaces are linear functions of x \mathbf x x, even if the function f ( ⋅ ) f(\cdot) f() is nonlinear. For this reason, the class of models described by ( G L M ) (GLM) (GLM) are called generalized linear models.

Discriminant Functions (Nonprobabilistic Methods)

A discriminant is a function that takes an input vector x \mathbf x x and assigns it to one of K K K classes, denoted C k \mathcal C_k Ck. In this case, probabilities play no role. In this chapter, we shall restrict attention to linear discriminants, namely those for which the decision surfaces are hyperplanes.

Two classes

The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that
y ( x ) = w T x + w 0 y(\mathbf x)=\mathbf w^T \mathbf x+w_0 y(x)=wTx+w0
where w \mathbf w w is called a weight vector, and w 0 w_0 w0 is a bias. The negative of the bias is sometimes called a threshold.
assign  x  to class  C 1  if  y ( x ) ≥ 0   ( or  w T x ≥ − w 0 ) assign  x  to class  C 2  if  y ( x ) ≤ 0   ( or  w T x ≤ − w 0 ) \text{assign }\mathbf x\text{ to class }\mathcal C_1 \quad \text{ if }y(\mathbf x)\ge 0~(\text{or }\mathbf w^T \mathbf x\ge -w_0)\\ \text{assign }\mathbf x\text{ to class }\mathcal C_2 \quad \text{ if }y(\mathbf x)\le 0~(\text{or }\mathbf w^T \mathbf x\le -w_0) assign x to class C1 if y(x)0 (or wTxw0)assign x to class C2 if y(x)0 (or wTxw0)
The corresponding decision boundary is therefore defined by the relation y ( x ) = 0 y(\mathbf x) = 0 y(x)=0, which corresponds to a ( D − 1 ) (D − 1) (D1)-dimensional hyperplane within the D D D-dimensional input space.

The weight vector w \mathbf w w is orthogonal to every vector lying within the decision surface, and so w \mathbf w w determines the orientation of the decision surface. The normal distance from the origin to the decision surface is given by − w 0 / ∥ w ∥ -w_0/\|\mathbf w\| w0/w, so the bias parameter w 0 w_0 w0 determines the location of the decision surface (the normal distance between hyperplane y = a T x + b 0 \mathbf y=\mathbf a^T \mathbf x+b_0 y=aTx+b0 and hyperplane y = a T x + b 1 \mathbf y=\mathbf a^T\mathbf x+b_1 y=aTx+b1 is ∣ b 1 − b 2 ∣ / ∥ a ∥ |b_1-b_2|/\|\mathbf a\| b1b2/a).

在这里插入图片描述

As with the linear regression models, it is sometimes convenient to use a more compact notation in which we introduce an additional dummy ‘input’ value x 0 = 1 x_0 = 1 x0=1 and then define w ~ = ( w 0 , w ) \tilde {\mathbf w}=(w_0,\mathbf w) w~=(w0,w) and x ~ = ( x 0 , x ) \tilde {\mathbf x}=(x_0, \mathbf x) x~=(x0,x) so that
y ( x ) = w ~ T x ~ y(\mathbf x)=\tilde {\mathbf w}^T\tilde{\mathbf x} y(x)=w~Tx~

Multiple classes

A single K K K-class discriminant comprising K K K linear functions of the form
y k ( x ) = w k T x + w k 0 y_k(\mathbf x)=\mathbf w_k^T\mathbf x+w_{k0} yk(x)=wkTx+wk0

assign  x  to class  C k  if  y k ( x ) ≥ y j ( x ) , ∀ j ≠ k \text{assign }\mathbf x\text{ to class }\mathcal C_k \quad \text{ if }y_k(\mathbf x)\ge y_j(\mathbf x),\forall j\ne k assign x to class Ck if yk(x)yj(x),j=k

The decision boundary between class C k \mathcal C_k Ck and class C j \mathcal C_j Cj is therefore given by y k ( x ) = y j ( x ) y_k(\mathbf x)=y_j(\mathbf x) yk(x)=yj(x) and hence correspond to a ( D − 1 ) (D-1) (D1)-dimensional hyperplane defined by
( w k − w j ) T x + ( w k 0 − w j 0 ) = 0 (\mathbf w_k-\mathbf w_j)^T\mathbf x+(w_{k0}-w_{j0})=0 (wkwj)Tx+(wk0wj0)=0

在这里插入图片描述

Least Squares for Classification

Consider a general classification problem with K K K classes, with a 1-of-K binary coding scheme for the target vector t \mathbf t t. For instance, if we have K = 5 K=5 K=5 classes, then a pattern from class 2 would be given the target vector t = ( 0 , 1 , 0 , 0 , 0 ) T \mathbf t=(0,1,0,0,0)^T t=(0,1,0,0,0)T.

Each class C k \mathcal C_k Ck is described by its own linear model so that
y k ( x ) = w k T x + w k 0 y_k(\mathbf x)=\mathbf w_k^T\mathbf x+w_{k0} yk(x)=wkTx+wk0
where k = 1 , ⋯   , K k=1,\cdots,K k=1,,K. We can conveniently group these together using vector notation so that
y ( x ) = W ~ T x ~ \mathbf y(\mathbf x)=\tilde {\mathbf W}^T\tilde{\mathbf x} y(x)=W~Tx~
where , W \mathbf W W is a matrix whose k k kth column comprises the D + 1 D + 1 D+1-dimensional vector w ~ k = ( w k 0 , w k T ) T \tilde{\mathbf w}_k=(w_{k0},\mathbf w_k^T)^T w~k=(wk0,wkT)T and x ~ \tilde {\mathbf x} x~ is the corresponding augmented input vector ( 1 , x T ) T (1,\mathbf x^T)^T (1,xT)T with a dummy input x 0 = 1 x_0 = 1 x0=1. We can obtain t \mathbf t t by assigning x \mathbf x x to the class for which the output y k = w ~ k T x ~ y_k=\tilde {\mathbf w}_k^T\tilde{\mathbf x} yk=w~kTx~ is largest.

Then consider a training data set { x n , t n } \{\mathbf x_n,\mathbf t_n\} {xn,tn} where n = 1 , ⋯   , N n=1,\cdots,N n=1,,N, and define a matrix T \mathbf T T whose n t h n^{th} nth row is the vector t n T \mathbf t_n^T tnT, together with a matrix X ~ \tilde {\mathbf X} X~ whose n t h n^{th} nth row is x ~ n T \tilde{\mathbf x}_n^T x~nT. The sum-of-squares error function can then be written as
E D ( W ~ ) = 1 2 T r { ( X ~ W ~ − T ) T ( X ~ W ~ − T ) } E_D(\tilde{\mathbf W})=\frac{1}{2}\mathrm{Tr}\left\{ (\tilde {\mathbf X}\tilde{\mathbf W}-\mathbf T)^T(\tilde {\mathbf X}\tilde{\mathbf W}-\mathbf T) \right\} ED(W~)=21Tr{(X~W~T)T(X~W~T)}
Setting the derivative w.r.t. W ~ \tilde {\mathbf W} W~ to zero, and rearranging, we then obtain the solution for W ~ \tilde{\mathbf W} W~ in the form
W ~ = ( X ~ T X ~ ) − 1 X ~ T T = X ~ † T \tilde{\mathbf W}=(\tilde {\mathbf X}^T\tilde{\mathbf X})^{-1}\tilde {\mathbf X}^T\mathbf T=\tilde {\mathbf X}^{\dagger}\mathbf T W~=(X~TX~)1X~TT=X~T
We then obtain the discriminant function in the form
y ( x ) = W ~ T x ~ = T T ( X ~ † ) T x ~ \mathbf y(\mathbf x)=\tilde{\mathbf W}^T\tilde{\mathbf x}=\mathbf T^T(\tilde {\mathbf X}^{\dagger})^T\tilde{\mathbf x} y(x)=W~Tx~=TT(X~)Tx~
However, recall that least squares corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution, binary target vectors clearly have a distribution that is far from Gaussian. Therefore, LS may suffer from some severe problems.

在这里插入图片描述

Fisher’s linear discriminant

Another way to view a linear classification model without probabilistic interpretation is in term of dimensionality reduction. Consider the case of two classes, and suppose we take the D D D-dimensional input vector x \mathbf x x and project it down to one dimension using
y = w T x y=\mathbf w^T\mathbf x y=wTx
If we place a threshold on y y y and classify y ≥ − w 0 y\ge -w_0 yw0 as class C 1 \mathcal C_1 C1, and otherwise class C 2 \mathcal C_2 C2, then we obtain our standard linear classifier discussed in the previous section.

In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D-dimensional space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector w \mathbf w w, we can select a projection that maximizes the class separation.

To begin with, consider a two-class problem in which there are N 1 N_1 N1 points of class C 1 \mathcal C_1 C1 and N 2 N_2 N2 points of class C 2 \mathcal C_2 C2, so that the mean vectors of the two classes are given by
m 1 = 1 N 1 ∑ n ∈ C 1 x n m 2 = 1 N 2 ∑ n ∈ C 2 x n \mathbf m_1=\frac{1}{N_1}\sum_{n\in \mathcal C_1} \mathbf x_n\quad \quad\quad\mathbf m_2=\frac{1}{N_2}\sum_{n\in \mathcal C_2} \mathbf x_n m1=N11nC1xnm2=N21nC2xn
The simplest measure of the separation of the classes, when projected onto w \mathbf w w, is the separation of the projected class means. This suggests that we might choose w \mathbf w w so as to maximize
m 2 − m 1 = w T ( m 2 − m 1 ) m_2-m_1=\mathbf w^T(\mathbf m_2-\mathbf m_1) m2m1=wT(m2m1)
where m k = w T m k m_k=\mathbf w^T\mathbf m_k mk=wTmk is the mean of the projected data from class C k \mathcal C_k Ck.

However, it can happen that the two classes are well separated in the original two-dimensional space ( x 1 , x 2 ) (x_1,x_2) (x1,x2) but have considerable overlap when project onto the line joining their means, as is shown in the left figure below.

在这里插入图片描述

The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap.

The within-class variance of the transformed data from class C k \mathcal C_k Ck is given by
s k 2 = ∑ n ∈ C k ( y n − m k ) 2 s_k^2=\sum_{n\in \mathcal C_k}(y_n-m_k)^2 sk2=nCk(ynmk)2
where y n = w T x n y_n=\mathbf w^T\mathbf x_n yn=wTxn. The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by
J ( w ) = ( m 2 − m 1 ) 2 s 1 2 + s 2 2 = w T S B w w T S W w J(\mathbf w)=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}=\frac{\mathbf w^T\mathbf S_\mathrm{B}\mathbf w}{\mathbf w^T\mathbf S_\mathrm{W}\mathbf w} J(w)=s12+s22(m2m1)2=wTSWwwTSBw
where S B \mathbf{S}_{\mathrm{B}} SB is the between-class covariance matrix and is given by
S B = ( m 2 − m 1 ) ( m 2 − m 1 ) T \mathbf{S}_{\mathrm{B}}=\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)^{{T}} SB=(m2m1)(m2m1)T
and S W \mathbf{S}_{\mathrm{W}} SW is the total within-class covariance matrix, given by
S W = ∑ n ∈ C 1 ( x n − m 1 ) ( x n − m 1 ) T + ∑ n ∈ C 2 ( x n − m 2 ) ( x n − m 2 ) T \mathbf{S}_{\mathrm{W}}=\sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)^{{T}}+\sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)^{T} SW=nC1(xnm1)(xnm1)T+nC2(xnm2)(xnm2)T
Differentiating J ( w ) J(\mathbf w) J(w) with respect to w \mathbf w w, we find that J ( w ) J(\mathrm{w}) J(w) is maximized when
( w T S B w ) S W w = ( w T S W w ) S B w \left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{B}} \mathbf{w}\right) \mathbf{S}_{\mathrm{W}} \mathbf{w}=\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{W}} \mathbf{w}\right) \mathbf{S}_{\mathrm{B}} \mathbf{w} (wTSBw)SWw=(wTSWw)SBw
From the expression of S B \mathbf{S}_{\mathrm{B}} SB, we see that S B w \mathbf{S}_{\mathrm{B}} \mathbf w SBw is always in the direction of ( m 2 − m 1 ) . \left(\mathbf{m}_{2}-\mathbf{m}_{1}\right). (m2m1). Furthermore, we do not care about the magnitude of w , \mathbf{w}, w, only its direction, and so we can drop the scalar factors ( w T S B w ) \left(\mathbf{w}^{{T}} \mathbf{S}_{\mathrm{B}} \mathbf{w}\right) (wTSBw) and ( w T S W w ) \left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{W}} \mathbf{w}\right) (wTSWw). Therefore we obtain
w ∝ S W − 1 ( m 2 − m 1 ) \mathbf w\propto \mathbf{S}_{\mathrm{W}}^{-1} \left(\mathbf{m}_{2}-\mathbf{m}_{1}\right) wSW1(m2m1)
Now we obtain a specific choice of direction for projection of the data down to one dimension. The projected data can subsequently be used to construct a discriminant, by choosing a threshold on y y y and classify y ≥ − w 0 y\ge -w_0 yw0 as class C 1 \mathcal C_1 C1, and otherwise class C 2 \mathcal C_2 C2.

The perceptron algorithm

The perceptron corresponds to a two-class model in which the input vector x \mathbf x x is first transformed using a fixed nonlinear transformation to give a feature vector ϕ ( x ) \boldsymbol \phi(\mathbf x) ϕ(x), and this is then used to construct a generalized linear model of the form
y ( x ) = f ( w T ϕ ( x ) ) y(\mathbf x)=f(\mathbf w^T\boldsymbol\phi(\mathbf x)) y(x)=f(wTϕ(x))
where the nonlinear activation function f ( ⋅ ) f(·) f() is given by a step function of the form
f ( a ) = { + 1 , a ≥ 0 − 1 , a < 0 f(a)=\left\{\begin{aligned} &+1,&& a\ge 0\\ &-1,&& a<0\end{aligned}\right. f(a)={+1,1,a0a<0
Assign x \mathbf x x to class C 1 \mathcal C_1 C1 when target values t = + 1 t=+1 t=+1 and C 2 \mathcal C_2 C2 when t = − 1 t=-1 t=1.

Then let us see how to define the error function. We are seeking a weight vector w \mathbf w w such that patterns x n \mathbf x_n xn in class C 1 \mathcal C_1 C1 will have w T ϕ ( x n ) > 0 \mathbf w^T\boldsymbol \phi(\mathbf x_n)>0 wTϕ(xn)>0, whereas patterns in class C 2 \mathcal C_2 C2 have w T ϕ ( x n ) < 0 \mathbf w^T\boldsymbol \phi(\mathbf x_n)<0 wTϕ(xn)<0. Using the t ∈ { − 1 , + 1 } t ∈ \{−1, +1\} t{1,+1} target coding scheme it follows that we would like all patterns to satisfy w T ϕ ( x n ) t n > 0 \mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n>0 wTϕ(xn)tn>0. The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern x n \mathbf x_n xn it tries to maximize the quantity w T ϕ ( x n ) t n \mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n wTϕ(xn)tn, or minimize − w T ϕ ( x n ) t n -\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n wTϕ(xn)tn. The perception criterion is therefore given by
E P ( w ) = − ∑ n ∈ M w T ϕ ( x n ) t n E_P(\mathbf w)=-\sum_{n\in \mathcal M}\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n EP(w)=nMwTϕ(xn)tn
where M \mathcal M M denotes the set of all misclassified patterns.

We now apply the stochastic gradient descent algorithm to this error function. The change in the weight vector w \mathbf{w} w is then given by
w ( τ + 1 ) = w ( τ ) − η ∇ E P ( w ) = w ( τ ) + η ϕ n t n \mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E_{\mathrm{P}}(\mathbf{w})=\mathbf{w}^{(\tau)}+\eta \boldsymbol\phi_{n} t_{n} w(τ+1)=w(τ)ηEP(w)=w(τ)+ηϕntn
where η \eta η is the learning rate parameter and τ \tau τ is an integer that indexes the steps of the algorithm. Because the perceptron function y ( x , w ) y(\mathbf{x}, \mathbf{w}) y(x,w) is unchanged if we multiply w \mathbf w w by a constant, we can set the learning rate parameter η \eta η equal to 1 1 1 without of generality.

The perceptron learning algorithm has a simple interpretation, as follows. If the pattern is correctly classified, then the weight vector remains unchanged, whereas if it is incorrectly classified, then for class C 1 \mathcal{C}_{1} C1 we add the vector ϕ ( x n ) \boldsymbol \phi\left(\mathbf{x}_{n}\right) ϕ(xn) onto the current estimate of weight vector w \mathbf{w} w while for class C 2 \mathcal{C}_{2} C2 we subtract the vector ϕ ( x n ) \boldsymbol \phi\left(\mathbf{x}_{n}\right) ϕ(xn) from w \mathbf{w} w.

在这里插入图片描述

Probabilistic Generative Models

We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from simple assumptions about the distribution of the data. Here we shall adopt a generative approach in which we model the class-conditional densities p ( x ∣ C k ) p(\mathbf x|\mathcal C_k) p(xCk), as well as the class priors p ( C k ) p(\mathcal C_k) p(Ck), and then use these to compute posterior probabilities p ( C k ∣ x ) p(\mathcal C_k|\mathbf x) p(Ckx) through Baye’s theorem.

Consider first of all the case of two classes. The posterior probability for class C 1 \mathcal C_1 C1 can be written as
p ( C 1 ∣ x ) = p ( x ∣ C 1 ) p ( C 1 ) p ( x ∣ C 1 ) p ( C 1 ) + p ( x ∣ C 2 ) p ( C 2 ) = 1 1 + exp ⁡ ( − a ) = σ ( a ) \begin{aligned} p(\mathcal C_1|\mathbf x)&=\frac{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)}{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)+p(\mathbf x|\mathcal C_2)p(\mathcal C_2)}\\ &=\frac{1}{1+\exp(-a)}=\sigma (a) \end{aligned} p(C1x)=p(xC1)p(C1)+p(xC2)p(C2)p(xC1)p(C1)=1+exp(a)1=σ(a)
where we have defined
a = ln ⁡ p ( x ∣ C 1 ) p ( C 1 ) p ( x ∣ C 2 ) p ( C 2 ) a=\ln \frac{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)}{p(\mathbf x|\mathcal C_2)p(\mathcal C_2)} a=lnp(xC2)p(C2)p(xC1)p(C1)
and σ ( a ) \sigma(a) σ(a) is the logistic sigmoid function defined by
σ ( a ) = 1 1 + exp ⁡ ( − a ) \sigma (a)=\frac{1}{1+\exp(-a)} σ(a)=1+exp(a)1

在这里插入图片描述

It satisfies the following symmetry property
σ ( − a ) = 1 − σ ( a ) \sigma (-a)=1-\sigma (a) σ(a)=1σ(a)
and the derivative of σ ( a ) \sigma (a) σ(a) is
σ ′ ( a ) = exp ⁡ ( − a ) ( 1 + exp ⁡ ( − a ) ) 2 = 1 1 + exp ⁡ ( − a ) ( 1 − 1 1 + exp ⁡ ( − a ) ) = σ ( a ) ( 1 − σ ( a ) ) \sigma '(a)=\frac{\exp(-a)}{(1+\exp(-a))^2}=\frac{1}{1+\exp (-a)}(1-\frac{1}{1+\exp (-a)})=\sigma(a)(1-\sigma(a)) σ(a)=(1+exp(a))2exp(a)=1+exp(a)1(11+exp(a)1)=σ(a)(1σ(a))
The inverse of the logistic sigmoid is given by
a = ln ⁡ ( σ 1 − σ ) a=\ln (\frac{\sigma}{1-\sigma}) a=ln(1σσ)
and is known as the logit function. It represents the log of the ratio of probabilities ln ⁡ [ p ( C 1 ∣ x ) / p ( C 1 ∣ x ) ] \ln [p(\mathcal C_1|\mathbf x)/p(\mathcal C_1|\mathbf x)] ln[p(C1x)/p(C1x)] for the two classes, also known as the log odds.

Note that we have simply rewritten the posterior probabilities in an equivalent form, and so the appearance of the logistic sigmoid may seem rather vacuous. However, it will have significance provided a ( x ) a(\mathbf x) a(x) takes a simple functional form. We shall shortly consider situations in which a ( x ) a(\mathbf x) a(x) is linear function of x \mathbf x x, in which case the posterior probability is governed by a generalized linear model.

For the case of K > 2 K>2 K>2 classes, we have
p ( C k ∣ x ) = p ( x ∣ C k ) p ( C k ) ∑ j p ( x ∣ C j ) p ( C j ) = exp ⁡ ( a k ) ∑ j exp ⁡ ( a j ) p(\mathcal C_k|\mathbf x)=\frac{p(\mathbf x|\mathcal C_k)p(\mathcal C_k)}{\sum_j p(\mathbf x|\mathcal C_j)p(\mathcal C_j)}=\frac{\exp(a_k)}{\sum_j\exp(a_j)} p(Ckx)=jp(xCj)p(Cj)p(xCk)p(Ck)=jexp(aj)exp(ak)
which is known as the softmax function. Here the quantities a k a_k ak are defined by
a k = ln ⁡ p ( x ∣ C k ) p ( C k ) a_k=\ln p(\mathbf x|\mathcal C_k)p(\mathcal C_k) ak=lnp(xCk)p(Ck)

Gaussian class-conditional densities

Let us assume that the class-conditional densities are Gaussian and then explore the resulting form for the posterior probabilities. To start with, we shall assume that all classes share the same covariance matrix. (See Classifiers Based on Bayes Decision Theory: The Bayesian Classifier for Normally Distributed Classes)

The density for class C k \mathcal C_k Ck is given by
p ( x ∣ C k ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 exp ⁡ { − 1 2 ( x − μ k ) T Σ − 1 ( x − μ k ) } p(\mathbf x|\mathcal C_k)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\boldsymbol \Sigma|^{1/2}}\exp \left\{-\frac{1}{2}(\mathbf x-\boldsymbol \mu_k)^T{\boldsymbol \Sigma}^{-1}(\mathbf x-\boldsymbol \mu_k) \right\} p(xCk)=(2π)D/21Σ1/21exp{21(xμk)TΣ1(xμk)}
Consider first the case of two classes, we have
p ( C 1 ∣ x ) = σ ( w T x + w 0 ) p\left(\mathcal{C}_{1} | \mathbf{x}\right)=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}\right) p(C1x)=σ(wTx+w0)
where we have defined
w = Σ − 1 ( μ 1 − μ 2 ) w 0 = − 1 2 μ 1 T Σ − 1 μ 1 + 1 2 μ 2 T Σ − 1 μ 2 + ln ⁡ p ( C 1 ) p ( C 2 ) \begin{aligned} \mathbf{w} &=\boldsymbol \Sigma^{-1}\left(\boldsymbol \mu_{1}-\boldsymbol \mu_{2}\right) \\ w_{0} &=-\frac{1}{2} \boldsymbol \mu_{1}^{\mathrm{T}}\boldsymbol \Sigma^{-1} \boldsymbol \mu_{1}+\frac{1}{2} \boldsymbol \mu_{2}^{\mathrm{T}} \boldsymbol \Sigma^{-1} \boldsymbol \mu_{2}+\ln \frac{p\left(\mathcal{C}_{1}\right)}{p\left(\mathcal{C}_{2}\right)} \end{aligned} ww0=Σ1(μ1μ2)=21μ1TΣ1μ1+21μ2TΣ1μ2+lnp(C2)p(C1)
We see that the quadratic terms in x \mathbf{x} x from the exponents of the Gaussian densities have cancelled (due to the assumption of common covariance matrices) leading to a linear function of x \mathbf{x} x in the argument of the logistic sigmoid.

The resulting decision boundaries correspond to surfaces along which the posterior probabilities p ( C k ∣ x ) p(\mathcal C_k|\mathbf x) p(Ckx) are constant and so will be given by linear functions of x \mathbf x x, and therefore the decision boundaries are linear in input space. The prior probabilities p ( C k ) p(\mathcal C_k) p(Ck) enter only through the bias parameter w 0 w_0 w0 so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.

For the general case of K K K classes
a k ( x ) = w k T x + w k 0 a_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0} ak(x)=wkTx+wk0
where we have defined
w k = Σ − 1 μ k w k 0 = − 1 2 μ k T Σ − 1 μ k + ln ⁡ p ( C k ) \begin{aligned} \mathbf{w}_{k} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k} \\ w_{k 0} &=-\frac{1}{2} \boldsymbol{\mu}_{k}^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k}+\ln p\left(\mathcal{C}_{k}\right) \end{aligned} wkwk0=Σ1μk=21μkTΣ1μk+lnp(Ck)
We see that the a k ( x ) a_{k}(\mathbf{x}) ak(x) are again linear functions of x \mathbf x x as a consequence of the cancellation of the quadratic terms due to the shared covariances. The resulting decision boundaries, corresponding to the minimum misclassification rate, will occur when two of the posterior probabilities (the two largest) are equal, and so will be defined by linear functions of x \mathbf x x, and so again we have a generalized linear model.

Once we have specified a parametric functional form for the class-conditional densities p ( x ∣ C k ) p(\mathbf x|\mathcal C_k) p(xCk), we can then determine the values of the parameters, together with the prior class probabilities p ( C k ) p(\mathcal C_k) p(Ck), using maximum likelihood.

Consider first the case of two classes, each having a Gaussian class-conditional density with a shared covariance matrix, and suppose we have a data set { x n , t n } \{ \mathbf x_n,t_n\} {xn,tn} where n = 1 , ⋯   , N n=1,\cdots,N n=1,,N. Here t n = 1 t_n=1 tn=1 denotes class C 1 \mathcal C_1 C1 and t n = 0 t_n=0 tn=0 denotes class C 2 \mathcal C_2 C2. We denote the prior class probability p ( C 1 ) = π p(\mathcal C_1)=\pi p(C1)=π, so that p ( C 2 ) = 1 − π p(\mathcal C_2)=1-\pi p(C2)=1π. For a data point x n \mathbf x_n xn from class C 1 \mathcal C_1 C1, we have t n = 1 t_n=1 tn=1 and hence
p ( x n , C 1 ) = p ( C 1 ) p ( x n ∣ C 1 ) = π N ( x n ∣ μ 1 , Σ ) p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n}|\mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n}| \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) p(xn,C1)=p(C1)p(xnC1)=πN(xnμ1,Σ)
Similarly for class C 2 , \mathcal{C}_{2}, C2, we have t n = 0 t_{n}=0 tn=0 and hence
p ( x n , C 2 ) = p ( C 2 ) p ( x n ∣ C 2 ) = ( 1 − π ) N ( x n ∣ μ 2 , Σ ) p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} |\mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) p(xn,C2)=p(C2)p(xnC2)=(1π)N(xnμ2,Σ)
Thus the likelihood function is given by
p ( t ∣ π , μ 1 , μ 2 , Σ ) = ∏ n = 1 N [ π N ( x n ∣ μ 1 , Σ ) ] t n [ ( 1 − π ) N ( x n ∣ μ 2 , Σ ) ] 1 − t n p\left(\mathbf{t} \mid \pi, \boldsymbol \mu_{1}, \boldsymbol \mu_{2}, \boldsymbol \Sigma\right)=\prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol \mu_{1}, \boldsymbol \Sigma\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol \mu_{2}, \boldsymbol \Sigma\right)\right]^{1-t_{n}} p(tπ,μ1,μ2,Σ)=n=1N[πN(xnμ1,Σ)]tn[(1π)N(xnμ2,Σ)]1tn
where t = ( t 1 , … , t N ) T \mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{\mathrm{T}} t=(t1,,tN)T. As usual, it is convenient to maximize the log of the likelihood function. Consider first the maximization with respect to π . \pi . π. The terms in the log likelihood function that depend on π π π are
∑ n = 1 N { t n ln ⁡ π + ( 1 − t n ) ln ⁡ ( 1 − π ) } \sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\} n=1N{tnlnπ+(1tn)ln(1π)}
Setting the derivative with respect to π \pi π equal to zero and rearranging, we obtain
π = 1 N ∑ n = 1 N t n = N 1 N = N 1 N 1 + N 2 \pi=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}} π=N1n=1Ntn=NN1=N1+N2N1
where N 1 N_{1} N1 denotes the total number of data points in class C 1 , \mathcal{C}_{1}, C1, and N 2 N_{2} N2 denotes the total number of data points in class C 2 \mathcal{C}_{2} C2. Thus the maximum likelihood estimate for π \pi π is simply the fraction of points in class C 1 \mathcal{C}_{1} C1 as expected. This result is easily generalized to the multiclass case where again the maximum likelihood estimate of the prior probability associated with class C k \mathcal{C}_{k} Ck is given by the fraction of the training set points assigned to that class.

Now consider the maximization with respect to μ 1 \mu_{1} μ1. Again we can pick out of the log likelihood function those terms that depend on μ 1 \boldsymbol \mu_{1} μ1 giving
∑ n = 1 N t n ln ⁡ N ( x n ∣ μ 1 , Σ ) = − 1 2 ∑ n = 1 N t n ( x n − μ 1 ) T Σ − 1 ( x n − μ 1 ) +  const.  \sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)=-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)+\text { const. } n=1NtnlnN(xnμ1,Σ)=21n=1Ntn(xnμ1)TΣ1(xnμ1)+ const. 
Setting the derivative with respect to μ 1 \boldsymbol \mu_{1} μ1 to zero and rearranging, we obtain
μ 1 = 1 N 1 ∑ n = 1 N t n x n \boldsymbol \mu_1=\frac{1}{N_1}\sum_{n=1}^N t_n \mathbf x_n μ1=N11n=1Ntnxn
which is simply the mean of all the input vectors x n \mathbf{x}_{n} xn assigned to class C 1 . \mathcal{C}_{1} . C1. By a similar argument, the corresponding result for μ 2 \boldsymbol \mu_{2} μ2 is given by
μ 2 = 1 N 2 ∑ n = 1 N ( 1 − t n ) x n \boldsymbol \mu_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N}\left(1-t_{n}\right) \mathbf{x}_{n} μ2=N21n=1N(1tn)xn
which again is the mean of all the input vectors x n \mathbf{x}_{n} xn assigned to class C 2 \mathcal{C}_{2} C2.

Finally, consider the maximum likelihood solution for the shared covariance matrix Σ \boldsymbol \Sigma Σ. Picking out the terms in the log likelihood function that depend on Σ \boldsymbol \Sigma Σ, we have
− 1 2 ∑ n = 1 N t n ln ⁡ ∣ Σ ∣ − 1 2 ∑ n = 1 N t n ( x n − μ 1 ) T Σ − 1 ( x n − μ 1 ) − 1 2 ∑ n = 1 N ( 1 − t n ) ln ⁡ ∣ Σ ∣ − 1 2 ∑ n = 1 N ( 1 − t n ) ( x n − μ 2 ) T Σ − 1 ( x n − μ 2 ) = − N 2 ln ⁡ ∣ Σ ∣ − N 2 Tr ⁡ { Σ − 1 S } \begin{aligned} &-\frac{1}{2} \sum_{n=1}^{N} t_{n} \ln |\mathbf{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right) \\ &-\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right) \ln |\mathbf{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right) \\ &=-\frac{N}{2} \ln |\mathbf{\Sigma}|-\frac{N}{2} \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\} \end{aligned} 21n=1NtnlnΣ21n=1Ntn(xnμ1)TΣ1(xnμ1)21n=1N(1tn)lnΣ21n=1N(1tn)(xnμ2)TΣ1(xnμ2)=2NlnΣ2NTr{Σ1S}
where we have defined
S = N 1 N S 1 + N 2 N S 2 S 1 = 1 N 1 ∑ n ∈ C 1 ( x n − μ 1 ) ( x n − μ 1 ) T S 2 = 1 N 2 ∑ n ∈ C 2 ( x n − μ 2 ) ( x n − μ 2 ) T \begin{aligned} \mathbf{S} &=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} \\ \mathbf{S}_{1} &=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol \mu_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol \mu_{1}\right)^{\mathrm{T}} \\ \mathbf{S}_{2} &=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol \mu_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol \mu_{2}\right)^{\mathrm{T}} \end{aligned} SS1S2=NN1S1+NN2S2=N11nC1(xnμ1)(xnμ1)T=N21nC2(xnμ2)(xnμ2)T
Setting the derivative to zero,
d ( ln ⁡ ∣ Σ ∣ + Tr ⁡ { Σ − 1 S } ) = Tr ⁡ ( Σ − 1 d Σ ) − Tr ⁡ ( Σ − 1 S Σ − 1 d Σ ) = 0 ⟹ Σ = S d(\ln |\mathbf{\Sigma}|+ \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\})=\operatorname{Tr}(\mathbf{\Sigma}^{-1} d\mathbf{\Sigma})-\operatorname{Tr}(\mathbf{\Sigma}^{-1} \mathbf{S}\mathbf{\Sigma}^{-1}d\mathbf{\Sigma})=0\Longrightarrow\boldsymbol \Sigma=\mathbf{S} d(lnΣ+Tr{Σ1S})=Tr(Σ1dΣ)Tr(Σ1SΣ1dΣ)=0Σ=S
We see that Σ = S , \boldsymbol \Sigma=\mathbf{S}, Σ=S, which represents a weighted average of the covariance matrices associated with each of the two classes separately.

Probabilistic Discriminative Models

We have seen how to model the class-conditional densities p ( x ∣ C k ) p(\mathbf x|\mathcal C_k) p(xCk), as well as the class priors p ( C k ) p(\mathcal C_k) p(Ck), and then use these to compute posterior probabilities p ( C k ∣ x ) p(\mathcal C_k|\mathbf x) p(Ckx) through Baye’s theorem. For [Gaussian class-conditional densities](#Gaussian class-conditional densities), the posterior probability can be written as a logistic sigmoid acting on a linear function of x \mathbf x x. An alternative approach is to directly restrict posteriors as the generalized linear model without the Gaussian class-conditional assumption.

So far, we have considered classification models that work directly with the original input vector x \mathbf x x. However, all of the algorithms are equally applicable if we first make a fixed nonlinear transformation of the inputs using a vector of basis functions ϕ ( x ) \phi(\mathbf x) ϕ(x) (as we did in linear regression). The resulting decision boundaries will be linear in the feature space ϕ \phi ϕ, and these correspond to nonlinear decision boundaries in the original x \mathbf x x space, as illustrated in Figure 4.12.

在这里插入图片描述

Logistic regression

Restrict the posterior probability of class C 1 \mathcal C_1 C1 as a logistic sigmoid acting on a linear function of the feature vector ϕ \boldsymbol \phi ϕ so that
p ( C 1 ∣ ϕ ) = y ( ϕ ) = σ ( w T ϕ ) p(\mathcal C_1|\boldsymbol \phi)=y(\boldsymbol \phi)=\sigma (\mathbf w^T\boldsymbol \phi) p(C1ϕ)=y(ϕ)=σ(wTϕ)
and p ( C 2 ∣ ϕ ) = 1 − p ( C 1 ∣ ϕ ) p(\mathcal C_2|\boldsymbol \phi)=1-p(\mathcal C_1|\boldsymbol \phi) p(C2ϕ)=1p(C1ϕ). This model is known as logistic regression, although it is a model for classification.

For an M M M-dimensional feature space ϕ \boldsymbol \phi ϕ, this model has M M M adjustable parameters. By contrast, if we had fitted [Gaussian class conditional densities](#Gaussian class-conditional densities) using maximum likelihood, we would have (for two-class classification) 2 M 2M 2M parameters for the means and M ( M + 1 ) / 2 M(M+1)/2 M(M+1)/2 parameters for the (shared) covariance matrix. Together with the class prior p ( C 1 ) p(\mathcal C_1) p(C1), this gives a total of M ( M + 5 ) / 2 + 1 M(M+5)/2+1 M(M+5)/2+1 parameters, which grows quadratically with M M M, in contrast to the linear dependence on M M M of the number of parameters in logistic regression.

We now use maximum likelihood to determine the parameters of the logistic regression model.

For a data set { ϕ n , t n } , \left\{\boldsymbol \phi_{n}, t_{n}\right\}, {ϕn,tn}, where t n ∈ { 0 , 1 } t_{n} \in\{0,1\} tn{0,1} and ϕ n = ϕ ( x n ) \boldsymbol \phi_n=\phi(\mathbf x_n) ϕn=ϕ(xn), the likelihood function can be written
p ( t ∣ w ) = ∏ n = 1 N y n t n { 1 − y n } 1 − t n p(\mathbf{t} \mid \mathbf{w})=\prod_{n=1}^{N} y_{n}^{t_{n}}\left\{1-y_{n}\right\}^{1-t_{n}} p(tw)=n=1Nyntn{1yn}1tn
where t = ( t 1 , … , t N ) T \mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{\mathrm{T}} t=(t1,,tN)T and y n = p ( C 1 ∣ ϕ n ) . y_{n}=p\left(\mathcal{C}_{1} |\boldsymbol \phi_{n}\right) . yn=p(C1ϕn). As usual, we can define an error function by taking the negative logarithm of the likelihood, which gives the cross entropy error function in the form
E ( w ) = − ln ⁡ p ( t ∣ w ) = − ∑ n = 1 N { t n ln ⁡ y n + ( 1 − t n ) ln ⁡ ( 1 − y n ) } E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\} E(w)=lnp(tw)=n=1N{tnlnyn+(1tn)ln(1yn)}
where y n = σ ( a n ) y_{n}=\sigma\left(a_{n}\right) yn=σ(an) and a n = w T ϕ n . a_{n}=\mathbf{w}^{\mathrm{T}}\boldsymbol \phi_{n} . an=wTϕn. Differentiate E ( w ) E(\mathbf w) E(w) w.r.t. w \mathbf w w, we obtain
d E ( w ) = − ∑ n = 1 N [ t n 1 y n d y n − ( 1 − t n ) 1 1 − y n d y n ] = a − ∑ n = 1 N [ t n ( 1 − y n ) ϕ n T d w − ( 1 − t n ) y n ϕ n T d w ] = ∑ n = 1 N ( y n − t n ) ϕ n T d w \begin{aligned} d E(\mathbf{w})&=-\sum_{n=1}^{N}[t_n\frac{1}{y_n}dy_n-(1-t_n)\frac{1}{1-y_n}dy_n]\\ &\stackrel{a}{=}-\sum_{n=1}^{N}[t_n(1-y_n)\boldsymbol \phi_n^Td\mathbf w-(1-t_n)y_n\boldsymbol \phi_n^Td\mathbf w]\\ &=\sum_{n=1}^{N}(y_n-t_n)\boldsymbol \phi_n^Td\mathbf w \end{aligned} dE(w)=n=1N[tnyn1dyn(1tn)1yn1dyn]=an=1N[tn(1yn)ϕnTdw(1tn)ynϕnTdw]=n=1N(yntn)ϕnTdw
where = a \stackrel{a}{=} =a is due to the property y n ′ = σ ′ ( a n ) = σ ( a n ) ( 1 − σ ( a n ) ) y_n'=\sigma '(a_n)=\sigma(a_n)(1-\sigma(a_n)) yn=σ(an)=σ(an)(1σ(an)).

The contribution to the gradient from data point n n n is given by the ‘error’ y n − t n y_n − t_n yntn between the target value and the prediction of the model, times the basis function vector ϕ n \boldsymbol \phi_n ϕn. Furthermore, it takes precisely the same form as the gradient of the sum-of-squares error function for the linear regression model. (See Linear Regression: Maximum likelihood and least squares ( L M . 9 ) (LM.9) (LM.9))

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值