Reference:
Bishop C M. Pattern recognition and machine learning[M]. springer, 2006.
- Chapter 4 up to and including 4.3.2
Content
In the linear regression models, the model prediction y ( x , w ) y(\mathbf x,\mathbf w) y(x,w) was given by a linear function of the parameter w \mathbf w w. In the simplest case, the model is also linear in the input variables and therefore takes the form y ( x ) = w T x + w 0 y(\mathbf x)=\mathbf w^T\mathbf x+w_0 y(x)=wTx+w0, so that y y y is a real number.
For classification problems, however, we wish to predict discrete class labels, or more generally posterior probabilities that lie in the range
(
0
,
1
)
(0,1)
(0,1). To achieve this, we consider a generalization of this model in which we transform the linear function of
w
\mathbf w
w using a nonlinear function
f
(
⋅
)
f(\cdot)
f(⋅) so that
g
(
x
)
=
f
(
w
T
x
+
w
0
)
(GLM)
g(\mathbf x)=f(\mathbf w^T\mathbf x+w_0)\tag{GLM}
g(x)=f(wTx+w0)(GLM)
f
(
⋅
)
f(\cdot)
f(⋅) is known as an activation function.
The decision surfaces correspond to g ( x ) = c o n s t a n t g(\mathbf x)=\mathrm{constant} g(x)=constant, so that w T x + w 0 = c o n s t a n t \mathbf w^T \mathbf x+w_0=\mathrm{constant} wTx+w0=constant and hence the decision surfaces are linear functions of x \mathbf x x, even if the function f ( ⋅ ) f(\cdot) f(⋅) is nonlinear. For this reason, the class of models described by ( G L M ) (GLM) (GLM) are called generalized linear models.
Discriminant Functions (Nonprobabilistic Methods)
A discriminant is a function that takes an input vector x \mathbf x x and assigns it to one of K K K classes, denoted C k \mathcal C_k Ck. In this case, probabilities play no role. In this chapter, we shall restrict attention to linear discriminants, namely those for which the decision surfaces are hyperplanes.
Two classes
The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that
y
(
x
)
=
w
T
x
+
w
0
y(\mathbf x)=\mathbf w^T \mathbf x+w_0
y(x)=wTx+w0
where
w
\mathbf w
w is called a weight vector, and
w
0
w_0
w0 is a bias. The negative of the bias is sometimes called a threshold.
assign
x
to class
C
1
if
y
(
x
)
≥
0
(
or
w
T
x
≥
−
w
0
)
assign
x
to class
C
2
if
y
(
x
)
≤
0
(
or
w
T
x
≤
−
w
0
)
\text{assign }\mathbf x\text{ to class }\mathcal C_1 \quad \text{ if }y(\mathbf x)\ge 0~(\text{or }\mathbf w^T \mathbf x\ge -w_0)\\ \text{assign }\mathbf x\text{ to class }\mathcal C_2 \quad \text{ if }y(\mathbf x)\le 0~(\text{or }\mathbf w^T \mathbf x\le -w_0)
assign x to class C1 if y(x)≥0 (or wTx≥−w0)assign x to class C2 if y(x)≤0 (or wTx≤−w0)
The corresponding decision boundary is therefore defined by the relation
y
(
x
)
=
0
y(\mathbf x) = 0
y(x)=0, which corresponds to a
(
D
−
1
)
(D − 1)
(D−1)-dimensional hyperplane within the
D
D
D-dimensional input space.
The weight vector w \mathbf w w is orthogonal to every vector lying within the decision surface, and so w \mathbf w w determines the orientation of the decision surface. The normal distance from the origin to the decision surface is given by − w 0 / ∥ w ∥ -w_0/\|\mathbf w\| −w0/∥w∥, so the bias parameter w 0 w_0 w0 determines the location of the decision surface (the normal distance between hyperplane y = a T x + b 0 \mathbf y=\mathbf a^T \mathbf x+b_0 y=aTx+b0 and hyperplane y = a T x + b 1 \mathbf y=\mathbf a^T\mathbf x+b_1 y=aTx+b1 is ∣ b 1 − b 2 ∣ / ∥ a ∥ |b_1-b_2|/\|\mathbf a\| ∣b1−b2∣/∥a∥).
As with the linear regression models, it is sometimes convenient to use a more compact notation in which we introduce an additional dummy ‘input’ value
x
0
=
1
x_0 = 1
x0=1 and then define
w
~
=
(
w
0
,
w
)
\tilde {\mathbf w}=(w_0,\mathbf w)
w~=(w0,w) and
x
~
=
(
x
0
,
x
)
\tilde {\mathbf x}=(x_0, \mathbf x)
x~=(x0,x) so that
y
(
x
)
=
w
~
T
x
~
y(\mathbf x)=\tilde {\mathbf w}^T\tilde{\mathbf x}
y(x)=w~Tx~
Multiple classes
A single
K
K
K-class discriminant comprising
K
K
K linear functions of the form
y
k
(
x
)
=
w
k
T
x
+
w
k
0
y_k(\mathbf x)=\mathbf w_k^T\mathbf x+w_{k0}
yk(x)=wkTx+wk0
assign x to class C k if y k ( x ) ≥ y j ( x ) , ∀ j ≠ k \text{assign }\mathbf x\text{ to class }\mathcal C_k \quad \text{ if }y_k(\mathbf x)\ge y_j(\mathbf x),\forall j\ne k assign x to class Ck if yk(x)≥yj(x),∀j=k
The decision boundary between class
C
k
\mathcal C_k
Ck and class
C
j
\mathcal C_j
Cj is therefore given by
y
k
(
x
)
=
y
j
(
x
)
y_k(\mathbf x)=y_j(\mathbf x)
yk(x)=yj(x) and hence correspond to a
(
D
−
1
)
(D-1)
(D−1)-dimensional hyperplane defined by
(
w
k
−
w
j
)
T
x
+
(
w
k
0
−
w
j
0
)
=
0
(\mathbf w_k-\mathbf w_j)^T\mathbf x+(w_{k0}-w_{j0})=0
(wk−wj)Tx+(wk0−wj0)=0
Least Squares for Classification
Consider a general classification problem with K K K classes, with a 1-of-K binary coding scheme for the target vector t \mathbf t t. For instance, if we have K = 5 K=5 K=5 classes, then a pattern from class 2 would be given the target vector t = ( 0 , 1 , 0 , 0 , 0 ) T \mathbf t=(0,1,0,0,0)^T t=(0,1,0,0,0)T.
Each class
C
k
\mathcal C_k
Ck is described by its own linear model so that
y
k
(
x
)
=
w
k
T
x
+
w
k
0
y_k(\mathbf x)=\mathbf w_k^T\mathbf x+w_{k0}
yk(x)=wkTx+wk0
where
k
=
1
,
⋯
,
K
k=1,\cdots,K
k=1,⋯,K. We can conveniently group these together using vector notation so that
y
(
x
)
=
W
~
T
x
~
\mathbf y(\mathbf x)=\tilde {\mathbf W}^T\tilde{\mathbf x}
y(x)=W~Tx~
where ,
W
\mathbf W
W is a matrix whose
k
k
kth column comprises the
D
+
1
D + 1
D+1-dimensional vector
w
~
k
=
(
w
k
0
,
w
k
T
)
T
\tilde{\mathbf w}_k=(w_{k0},\mathbf w_k^T)^T
w~k=(wk0,wkT)T and
x
~
\tilde {\mathbf x}
x~ is the corresponding augmented input vector
(
1
,
x
T
)
T
(1,\mathbf x^T)^T
(1,xT)T with a dummy input
x
0
=
1
x_0 = 1
x0=1. We can obtain
t
\mathbf t
t by assigning
x
\mathbf x
x to the class for which the output
y
k
=
w
~
k
T
x
~
y_k=\tilde {\mathbf w}_k^T\tilde{\mathbf x}
yk=w~kTx~ is largest.
Then consider a training data set
{
x
n
,
t
n
}
\{\mathbf x_n,\mathbf t_n\}
{xn,tn} where
n
=
1
,
⋯
,
N
n=1,\cdots,N
n=1,⋯,N, and define a matrix
T
\mathbf T
T whose
n
t
h
n^{th}
nth row is the vector
t
n
T
\mathbf t_n^T
tnT, together with a matrix
X
~
\tilde {\mathbf X}
X~ whose
n
t
h
n^{th}
nth row is
x
~
n
T
\tilde{\mathbf x}_n^T
x~nT. The sum-of-squares error function can then be written as
E
D
(
W
~
)
=
1
2
T
r
{
(
X
~
W
~
−
T
)
T
(
X
~
W
~
−
T
)
}
E_D(\tilde{\mathbf W})=\frac{1}{2}\mathrm{Tr}\left\{ (\tilde {\mathbf X}\tilde{\mathbf W}-\mathbf T)^T(\tilde {\mathbf X}\tilde{\mathbf W}-\mathbf T) \right\}
ED(W~)=21Tr{(X~W~−T)T(X~W~−T)}
Setting the derivative w.r.t.
W
~
\tilde {\mathbf W}
W~ to zero, and rearranging, we then obtain the solution for
W
~
\tilde{\mathbf W}
W~ in the form
W
~
=
(
X
~
T
X
~
)
−
1
X
~
T
T
=
X
~
†
T
\tilde{\mathbf W}=(\tilde {\mathbf X}^T\tilde{\mathbf X})^{-1}\tilde {\mathbf X}^T\mathbf T=\tilde {\mathbf X}^{\dagger}\mathbf T
W~=(X~TX~)−1X~TT=X~†T
We then obtain the discriminant function in the form
y
(
x
)
=
W
~
T
x
~
=
T
T
(
X
~
†
)
T
x
~
\mathbf y(\mathbf x)=\tilde{\mathbf W}^T\tilde{\mathbf x}=\mathbf T^T(\tilde {\mathbf X}^{\dagger})^T\tilde{\mathbf x}
y(x)=W~Tx~=TT(X~†)Tx~
However, recall that least squares corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution, binary target vectors clearly have a distribution that is far from Gaussian. Therefore, LS may suffer from some severe problems.
Fisher’s linear discriminant
Another way to view a linear classification model without probabilistic interpretation is in term of dimensionality reduction. Consider the case of two classes, and suppose we take the
D
D
D-dimensional input vector
x
\mathbf x
x and project it down to one dimension using
y
=
w
T
x
y=\mathbf w^T\mathbf x
y=wTx
If we place a threshold on
y
y
y and classify
y
≥
−
w
0
y\ge -w_0
y≥−w0 as class
C
1
\mathcal C_1
C1, and otherwise class
C
2
\mathcal C_2
C2, then we obtain our standard linear classifier discussed in the previous section.
In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D-dimensional space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector w \mathbf w w, we can select a projection that maximizes the class separation.
To begin with, consider a two-class problem in which there are
N
1
N_1
N1 points of class
C
1
\mathcal C_1
C1 and
N
2
N_2
N2 points of class
C
2
\mathcal C_2
C2, so that the mean vectors of the two classes are given by
m
1
=
1
N
1
∑
n
∈
C
1
x
n
m
2
=
1
N
2
∑
n
∈
C
2
x
n
\mathbf m_1=\frac{1}{N_1}\sum_{n\in \mathcal C_1} \mathbf x_n\quad \quad\quad\mathbf m_2=\frac{1}{N_2}\sum_{n\in \mathcal C_2} \mathbf x_n
m1=N11n∈C1∑xnm2=N21n∈C2∑xn
The simplest measure of the separation of the classes, when projected onto
w
\mathbf w
w, is the separation of the projected class means. This suggests that we might choose
w
\mathbf w
w so as to maximize
m
2
−
m
1
=
w
T
(
m
2
−
m
1
)
m_2-m_1=\mathbf w^T(\mathbf m_2-\mathbf m_1)
m2−m1=wT(m2−m1)
where
m
k
=
w
T
m
k
m_k=\mathbf w^T\mathbf m_k
mk=wTmk is the mean of the projected data from class
C
k
\mathcal C_k
Ck.
However, it can happen that the two classes are well separated in the original two-dimensional space ( x 1 , x 2 ) (x_1,x_2) (x1,x2) but have considerable overlap when project onto the line joining their means, as is shown in the left figure below.
The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap.
The within-class variance of the transformed data from class
C
k
\mathcal C_k
Ck is given by
s
k
2
=
∑
n
∈
C
k
(
y
n
−
m
k
)
2
s_k^2=\sum_{n\in \mathcal C_k}(y_n-m_k)^2
sk2=n∈Ck∑(yn−mk)2
where
y
n
=
w
T
x
n
y_n=\mathbf w^T\mathbf x_n
yn=wTxn. The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by
J
(
w
)
=
(
m
2
−
m
1
)
2
s
1
2
+
s
2
2
=
w
T
S
B
w
w
T
S
W
w
J(\mathbf w)=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}=\frac{\mathbf w^T\mathbf S_\mathrm{B}\mathbf w}{\mathbf w^T\mathbf S_\mathrm{W}\mathbf w}
J(w)=s12+s22(m2−m1)2=wTSWwwTSBw
where
S
B
\mathbf{S}_{\mathrm{B}}
SB is the between-class covariance matrix and is given by
S
B
=
(
m
2
−
m
1
)
(
m
2
−
m
1
)
T
\mathbf{S}_{\mathrm{B}}=\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)^{{T}}
SB=(m2−m1)(m2−m1)T
and
S
W
\mathbf{S}_{\mathrm{W}}
SW is the total within-class covariance matrix, given by
S
W
=
∑
n
∈
C
1
(
x
n
−
m
1
)
(
x
n
−
m
1
)
T
+
∑
n
∈
C
2
(
x
n
−
m
2
)
(
x
n
−
m
2
)
T
\mathbf{S}_{\mathrm{W}}=\sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)^{{T}}+\sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)^{T}
SW=n∈C1∑(xn−m1)(xn−m1)T+n∈C2∑(xn−m2)(xn−m2)T
Differentiating
J
(
w
)
J(\mathbf w)
J(w) with respect to
w
\mathbf w
w, we find that
J
(
w
)
J(\mathrm{w})
J(w) is maximized when
(
w
T
S
B
w
)
S
W
w
=
(
w
T
S
W
w
)
S
B
w
\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{B}} \mathbf{w}\right) \mathbf{S}_{\mathrm{W}} \mathbf{w}=\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{W}} \mathbf{w}\right) \mathbf{S}_{\mathrm{B}} \mathbf{w}
(wTSBw)SWw=(wTSWw)SBw
From the expression of
S
B
\mathbf{S}_{\mathrm{B}}
SB, we see that
S
B
w
\mathbf{S}_{\mathrm{B}} \mathbf w
SBw is always in the direction of
(
m
2
−
m
1
)
.
\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right).
(m2−m1). Furthermore, we do not care about the magnitude of
w
,
\mathbf{w},
w, only its direction, and so we can drop the scalar factors
(
w
T
S
B
w
)
\left(\mathbf{w}^{{T}} \mathbf{S}_{\mathrm{B}} \mathbf{w}\right)
(wTSBw) and
(
w
T
S
W
w
)
\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{W}} \mathbf{w}\right)
(wTSWw). Therefore we obtain
w
∝
S
W
−
1
(
m
2
−
m
1
)
\mathbf w\propto \mathbf{S}_{\mathrm{W}}^{-1} \left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)
w∝SW−1(m2−m1)
Now we obtain a specific choice of direction for projection of the data down to one dimension. The projected data can subsequently be used to construct a discriminant, by choosing a threshold on
y
y
y and classify
y
≥
−
w
0
y\ge -w_0
y≥−w0 as class
C
1
\mathcal C_1
C1, and otherwise class
C
2
\mathcal C_2
C2.
The perceptron algorithm
The perceptron corresponds to a two-class model in which the input vector
x
\mathbf x
x is first transformed using a fixed nonlinear transformation to give a feature vector
ϕ
(
x
)
\boldsymbol \phi(\mathbf x)
ϕ(x), and this is then used to construct a generalized linear model of the form
y
(
x
)
=
f
(
w
T
ϕ
(
x
)
)
y(\mathbf x)=f(\mathbf w^T\boldsymbol\phi(\mathbf x))
y(x)=f(wTϕ(x))
where the nonlinear activation function
f
(
⋅
)
f(·)
f(⋅) is given by a step function of the form
f
(
a
)
=
{
+
1
,
a
≥
0
−
1
,
a
<
0
f(a)=\left\{\begin{aligned} &+1,&& a\ge 0\\ &-1,&& a<0\end{aligned}\right.
f(a)={+1,−1,a≥0a<0
Assign
x
\mathbf x
x to class
C
1
\mathcal C_1
C1 when target values
t
=
+
1
t=+1
t=+1 and
C
2
\mathcal C_2
C2 when
t
=
−
1
t=-1
t=−1.
Then let us see how to define the error function. We are seeking a weight vector
w
\mathbf w
w such that patterns
x
n
\mathbf x_n
xn in class
C
1
\mathcal C_1
C1 will have
w
T
ϕ
(
x
n
)
>
0
\mathbf w^T\boldsymbol \phi(\mathbf x_n)>0
wTϕ(xn)>0, whereas patterns in class
C
2
\mathcal C_2
C2 have
w
T
ϕ
(
x
n
)
<
0
\mathbf w^T\boldsymbol \phi(\mathbf x_n)<0
wTϕ(xn)<0. Using the
t
∈
{
−
1
,
+
1
}
t ∈ \{−1, +1\}
t∈{−1,+1} target coding scheme it follows that we would like all patterns to satisfy
w
T
ϕ
(
x
n
)
t
n
>
0
\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n>0
wTϕ(xn)tn>0. The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern
x
n
\mathbf x_n
xn it tries to maximize the quantity
w
T
ϕ
(
x
n
)
t
n
\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n
wTϕ(xn)tn, or minimize
−
w
T
ϕ
(
x
n
)
t
n
-\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n
−wTϕ(xn)tn. The perception criterion is therefore given by
E
P
(
w
)
=
−
∑
n
∈
M
w
T
ϕ
(
x
n
)
t
n
E_P(\mathbf w)=-\sum_{n\in \mathcal M}\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n
EP(w)=−n∈M∑wTϕ(xn)tn
where
M
\mathcal M
M denotes the set of all misclassified patterns.
We now apply the stochastic gradient descent algorithm to this error function. The change in the weight vector
w
\mathbf{w}
w is then given by
w
(
τ
+
1
)
=
w
(
τ
)
−
η
∇
E
P
(
w
)
=
w
(
τ
)
+
η
ϕ
n
t
n
\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E_{\mathrm{P}}(\mathbf{w})=\mathbf{w}^{(\tau)}+\eta \boldsymbol\phi_{n} t_{n}
w(τ+1)=w(τ)−η∇EP(w)=w(τ)+ηϕntn
where
η
\eta
η is the learning rate parameter and
τ
\tau
τ is an integer that indexes the steps of the algorithm. Because the perceptron function
y
(
x
,
w
)
y(\mathbf{x}, \mathbf{w})
y(x,w) is unchanged if we multiply
w
\mathbf w
w by a constant, we can set the learning rate parameter
η
\eta
η equal to
1
1
1 without of generality.
The perceptron learning algorithm has a simple interpretation, as follows. If the pattern is correctly classified, then the weight vector remains unchanged, whereas if it is incorrectly classified, then for class C 1 \mathcal{C}_{1} C1 we add the vector ϕ ( x n ) \boldsymbol \phi\left(\mathbf{x}_{n}\right) ϕ(xn) onto the current estimate of weight vector w \mathbf{w} w while for class C 2 \mathcal{C}_{2} C2 we subtract the vector ϕ ( x n ) \boldsymbol \phi\left(\mathbf{x}_{n}\right) ϕ(xn) from w \mathbf{w} w.
Probabilistic Generative Models
We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from simple assumptions about the distribution of the data. Here we shall adopt a generative approach in which we model the class-conditional densities p ( x ∣ C k ) p(\mathbf x|\mathcal C_k) p(x∣Ck), as well as the class priors p ( C k ) p(\mathcal C_k) p(Ck), and then use these to compute posterior probabilities p ( C k ∣ x ) p(\mathcal C_k|\mathbf x) p(Ck∣x) through Baye’s theorem.
Consider first of all the case of two classes. The posterior probability for class
C
1
\mathcal C_1
C1 can be written as
p
(
C
1
∣
x
)
=
p
(
x
∣
C
1
)
p
(
C
1
)
p
(
x
∣
C
1
)
p
(
C
1
)
+
p
(
x
∣
C
2
)
p
(
C
2
)
=
1
1
+
exp
(
−
a
)
=
σ
(
a
)
\begin{aligned} p(\mathcal C_1|\mathbf x)&=\frac{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)}{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)+p(\mathbf x|\mathcal C_2)p(\mathcal C_2)}\\ &=\frac{1}{1+\exp(-a)}=\sigma (a) \end{aligned}
p(C1∣x)=p(x∣C1)p(C1)+p(x∣C2)p(C2)p(x∣C1)p(C1)=1+exp(−a)1=σ(a)
where we have defined
a
=
ln
p
(
x
∣
C
1
)
p
(
C
1
)
p
(
x
∣
C
2
)
p
(
C
2
)
a=\ln \frac{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)}{p(\mathbf x|\mathcal C_2)p(\mathcal C_2)}
a=lnp(x∣C2)p(C2)p(x∣C1)p(C1)
and
σ
(
a
)
\sigma(a)
σ(a) is the logistic sigmoid function defined by
σ
(
a
)
=
1
1
+
exp
(
−
a
)
\sigma (a)=\frac{1}{1+\exp(-a)}
σ(a)=1+exp(−a)1
It satisfies the following symmetry property
σ
(
−
a
)
=
1
−
σ
(
a
)
\sigma (-a)=1-\sigma (a)
σ(−a)=1−σ(a)
and the derivative of
σ
(
a
)
\sigma (a)
σ(a) is
σ
′
(
a
)
=
exp
(
−
a
)
(
1
+
exp
(
−
a
)
)
2
=
1
1
+
exp
(
−
a
)
(
1
−
1
1
+
exp
(
−
a
)
)
=
σ
(
a
)
(
1
−
σ
(
a
)
)
\sigma '(a)=\frac{\exp(-a)}{(1+\exp(-a))^2}=\frac{1}{1+\exp (-a)}(1-\frac{1}{1+\exp (-a)})=\sigma(a)(1-\sigma(a))
σ′(a)=(1+exp(−a))2exp(−a)=1+exp(−a)1(1−1+exp(−a)1)=σ(a)(1−σ(a))
The inverse of the logistic sigmoid is given by
a
=
ln
(
σ
1
−
σ
)
a=\ln (\frac{\sigma}{1-\sigma})
a=ln(1−σσ)
and is known as the logit function. It represents the log of the ratio of probabilities
ln
[
p
(
C
1
∣
x
)
/
p
(
C
1
∣
x
)
]
\ln [p(\mathcal C_1|\mathbf x)/p(\mathcal C_1|\mathbf x)]
ln[p(C1∣x)/p(C1∣x)] for the two classes, also known as the log odds.
Note that we have simply rewritten the posterior probabilities in an equivalent form, and so the appearance of the logistic sigmoid may seem rather vacuous. However, it will have significance provided a ( x ) a(\mathbf x) a(x) takes a simple functional form. We shall shortly consider situations in which a ( x ) a(\mathbf x) a(x) is linear function of x \mathbf x x, in which case the posterior probability is governed by a generalized linear model.
For the case of
K
>
2
K>2
K>2 classes, we have
p
(
C
k
∣
x
)
=
p
(
x
∣
C
k
)
p
(
C
k
)
∑
j
p
(
x
∣
C
j
)
p
(
C
j
)
=
exp
(
a
k
)
∑
j
exp
(
a
j
)
p(\mathcal C_k|\mathbf x)=\frac{p(\mathbf x|\mathcal C_k)p(\mathcal C_k)}{\sum_j p(\mathbf x|\mathcal C_j)p(\mathcal C_j)}=\frac{\exp(a_k)}{\sum_j\exp(a_j)}
p(Ck∣x)=∑jp(x∣Cj)p(Cj)p(x∣Ck)p(Ck)=∑jexp(aj)exp(ak)
which is known as the softmax function. Here the quantities
a
k
a_k
ak are defined by
a
k
=
ln
p
(
x
∣
C
k
)
p
(
C
k
)
a_k=\ln p(\mathbf x|\mathcal C_k)p(\mathcal C_k)
ak=lnp(x∣Ck)p(Ck)
Gaussian class-conditional densities
Let us assume that the class-conditional densities are Gaussian and then explore the resulting form for the posterior probabilities. To start with, we shall assume that all classes share the same covariance matrix. (See Classifiers Based on Bayes Decision Theory: The Bayesian Classifier for Normally Distributed Classes)
The density for class
C
k
\mathcal C_k
Ck is given by
p
(
x
∣
C
k
)
=
1
(
2
π
)
D
/
2
1
∣
Σ
∣
1
/
2
exp
{
−
1
2
(
x
−
μ
k
)
T
Σ
−
1
(
x
−
μ
k
)
}
p(\mathbf x|\mathcal C_k)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\boldsymbol \Sigma|^{1/2}}\exp \left\{-\frac{1}{2}(\mathbf x-\boldsymbol \mu_k)^T{\boldsymbol \Sigma}^{-1}(\mathbf x-\boldsymbol \mu_k) \right\}
p(x∣Ck)=(2π)D/21∣Σ∣1/21exp{−21(x−μk)TΣ−1(x−μk)}
Consider first the case of two classes, we have
p
(
C
1
∣
x
)
=
σ
(
w
T
x
+
w
0
)
p\left(\mathcal{C}_{1} | \mathbf{x}\right)=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}\right)
p(C1∣x)=σ(wTx+w0)
where we have defined
w
=
Σ
−
1
(
μ
1
−
μ
2
)
w
0
=
−
1
2
μ
1
T
Σ
−
1
μ
1
+
1
2
μ
2
T
Σ
−
1
μ
2
+
ln
p
(
C
1
)
p
(
C
2
)
\begin{aligned} \mathbf{w} &=\boldsymbol \Sigma^{-1}\left(\boldsymbol \mu_{1}-\boldsymbol \mu_{2}\right) \\ w_{0} &=-\frac{1}{2} \boldsymbol \mu_{1}^{\mathrm{T}}\boldsymbol \Sigma^{-1} \boldsymbol \mu_{1}+\frac{1}{2} \boldsymbol \mu_{2}^{\mathrm{T}} \boldsymbol \Sigma^{-1} \boldsymbol \mu_{2}+\ln \frac{p\left(\mathcal{C}_{1}\right)}{p\left(\mathcal{C}_{2}\right)} \end{aligned}
ww0=Σ−1(μ1−μ2)=−21μ1TΣ−1μ1+21μ2TΣ−1μ2+lnp(C2)p(C1)
We see that the quadratic terms in
x
\mathbf{x}
x from the exponents of the Gaussian densities have cancelled (due to the assumption of common covariance matrices) leading to a linear function of
x
\mathbf{x}
x in the argument of the logistic sigmoid.
The resulting decision boundaries correspond to surfaces along which the posterior probabilities p ( C k ∣ x ) p(\mathcal C_k|\mathbf x) p(Ck∣x) are constant and so will be given by linear functions of x \mathbf x x, and therefore the decision boundaries are linear in input space. The prior probabilities p ( C k ) p(\mathcal C_k) p(Ck) enter only through the bias parameter w 0 w_0 w0 so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.
For the general case of
K
K
K classes
a
k
(
x
)
=
w
k
T
x
+
w
k
0
a_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0}
ak(x)=wkTx+wk0
where we have defined
w
k
=
Σ
−
1
μ
k
w
k
0
=
−
1
2
μ
k
T
Σ
−
1
μ
k
+
ln
p
(
C
k
)
\begin{aligned} \mathbf{w}_{k} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k} \\ w_{k 0} &=-\frac{1}{2} \boldsymbol{\mu}_{k}^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k}+\ln p\left(\mathcal{C}_{k}\right) \end{aligned}
wkwk0=Σ−1μk=−21μkTΣ−1μk+lnp(Ck)
We see that the
a
k
(
x
)
a_{k}(\mathbf{x})
ak(x) are again linear functions of
x
\mathbf x
x as a consequence of the cancellation of the quadratic terms due to the shared covariances. The resulting decision boundaries, corresponding to the minimum misclassification rate, will occur when two of the posterior probabilities (the two largest) are equal, and so will be defined by linear functions of
x
\mathbf x
x, and so again we have a generalized linear model.
Once we have specified a parametric functional form for the class-conditional densities p ( x ∣ C k ) p(\mathbf x|\mathcal C_k) p(x∣Ck), we can then determine the values of the parameters, together with the prior class probabilities p ( C k ) p(\mathcal C_k) p(Ck), using maximum likelihood.
Consider first the case of two classes, each having a Gaussian class-conditional density with a shared covariance matrix, and suppose we have a data set
{
x
n
,
t
n
}
\{ \mathbf x_n,t_n\}
{xn,tn} where
n
=
1
,
⋯
,
N
n=1,\cdots,N
n=1,⋯,N. Here
t
n
=
1
t_n=1
tn=1 denotes class
C
1
\mathcal C_1
C1 and
t
n
=
0
t_n=0
tn=0 denotes class
C
2
\mathcal C_2
C2. We denote the prior class probability
p
(
C
1
)
=
π
p(\mathcal C_1)=\pi
p(C1)=π, so that
p
(
C
2
)
=
1
−
π
p(\mathcal C_2)=1-\pi
p(C2)=1−π. For a data point
x
n
\mathbf x_n
xn from class
C
1
\mathcal C_1
C1, we have
t
n
=
1
t_n=1
tn=1 and hence
p
(
x
n
,
C
1
)
=
p
(
C
1
)
p
(
x
n
∣
C
1
)
=
π
N
(
x
n
∣
μ
1
,
Σ
)
p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n}|\mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n}| \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)
p(xn,C1)=p(C1)p(xn∣C1)=πN(xn∣μ1,Σ)
Similarly for class
C
2
,
\mathcal{C}_{2},
C2, we have
t
n
=
0
t_{n}=0
tn=0 and hence
p
(
x
n
,
C
2
)
=
p
(
C
2
)
p
(
x
n
∣
C
2
)
=
(
1
−
π
)
N
(
x
n
∣
μ
2
,
Σ
)
p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} |\mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)
p(xn,C2)=p(C2)p(xn∣C2)=(1−π)N(xn∣μ2,Σ)
Thus the likelihood function is given by
p
(
t
∣
π
,
μ
1
,
μ
2
,
Σ
)
=
∏
n
=
1
N
[
π
N
(
x
n
∣
μ
1
,
Σ
)
]
t
n
[
(
1
−
π
)
N
(
x
n
∣
μ
2
,
Σ
)
]
1
−
t
n
p\left(\mathbf{t} \mid \pi, \boldsymbol \mu_{1}, \boldsymbol \mu_{2}, \boldsymbol \Sigma\right)=\prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol \mu_{1}, \boldsymbol \Sigma\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol \mu_{2}, \boldsymbol \Sigma\right)\right]^{1-t_{n}}
p(t∣π,μ1,μ2,Σ)=n=1∏N[πN(xn∣μ1,Σ)]tn[(1−π)N(xn∣μ2,Σ)]1−tn
where
t
=
(
t
1
,
…
,
t
N
)
T
\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{\mathrm{T}}
t=(t1,…,tN)T. As usual, it is convenient to maximize the log of the likelihood function. Consider first the maximization with respect to
π
.
\pi .
π. The terms in the log likelihood function that depend on
π
π
π are
∑
n
=
1
N
{
t
n
ln
π
+
(
1
−
t
n
)
ln
(
1
−
π
)
}
\sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\}
n=1∑N{tnlnπ+(1−tn)ln(1−π)}
Setting the derivative with respect to
π
\pi
π equal to zero and rearranging, we obtain
π
=
1
N
∑
n
=
1
N
t
n
=
N
1
N
=
N
1
N
1
+
N
2
\pi=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}}
π=N1n=1∑Ntn=NN1=N1+N2N1
where
N
1
N_{1}
N1 denotes the total number of data points in class
C
1
,
\mathcal{C}_{1},
C1, and
N
2
N_{2}
N2 denotes the total number of data points in class
C
2
\mathcal{C}_{2}
C2. Thus the maximum likelihood estimate for
π
\pi
π is simply the fraction of points in class
C
1
\mathcal{C}_{1}
C1 as expected. This result is easily generalized to the multiclass case where again the maximum likelihood estimate of the prior probability associated with class
C
k
\mathcal{C}_{k}
Ck is given by the fraction of the training set points assigned to that class.
Now consider the maximization with respect to
μ
1
\mu_{1}
μ1. Again we can pick out of the log likelihood function those terms that depend on
μ
1
\boldsymbol \mu_{1}
μ1 giving
∑
n
=
1
N
t
n
ln
N
(
x
n
∣
μ
1
,
Σ
)
=
−
1
2
∑
n
=
1
N
t
n
(
x
n
−
μ
1
)
T
Σ
−
1
(
x
n
−
μ
1
)
+
const.
\sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)=-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)+\text { const. }
n=1∑NtnlnN(xn∣μ1,Σ)=−21n=1∑Ntn(xn−μ1)TΣ−1(xn−μ1)+ const.
Setting the derivative with respect to
μ
1
\boldsymbol \mu_{1}
μ1 to zero and rearranging, we obtain
μ
1
=
1
N
1
∑
n
=
1
N
t
n
x
n
\boldsymbol \mu_1=\frac{1}{N_1}\sum_{n=1}^N t_n \mathbf x_n
μ1=N11n=1∑Ntnxn
which is simply the mean of all the input vectors
x
n
\mathbf{x}_{n}
xn assigned to class
C
1
.
\mathcal{C}_{1} .
C1. By a similar argument, the corresponding result for
μ
2
\boldsymbol \mu_{2}
μ2 is given by
μ
2
=
1
N
2
∑
n
=
1
N
(
1
−
t
n
)
x
n
\boldsymbol \mu_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N}\left(1-t_{n}\right) \mathbf{x}_{n}
μ2=N21n=1∑N(1−tn)xn
which again is the mean of all the input vectors
x
n
\mathbf{x}_{n}
xn assigned to class
C
2
\mathcal{C}_{2}
C2.
Finally, consider the maximum likelihood solution for the shared covariance matrix
Σ
\boldsymbol \Sigma
Σ. Picking out the terms in the log likelihood function that depend on
Σ
\boldsymbol \Sigma
Σ, we have
−
1
2
∑
n
=
1
N
t
n
ln
∣
Σ
∣
−
1
2
∑
n
=
1
N
t
n
(
x
n
−
μ
1
)
T
Σ
−
1
(
x
n
−
μ
1
)
−
1
2
∑
n
=
1
N
(
1
−
t
n
)
ln
∣
Σ
∣
−
1
2
∑
n
=
1
N
(
1
−
t
n
)
(
x
n
−
μ
2
)
T
Σ
−
1
(
x
n
−
μ
2
)
=
−
N
2
ln
∣
Σ
∣
−
N
2
Tr
{
Σ
−
1
S
}
\begin{aligned} &-\frac{1}{2} \sum_{n=1}^{N} t_{n} \ln |\mathbf{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right) \\ &-\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right) \ln |\mathbf{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right) \\ &=-\frac{N}{2} \ln |\mathbf{\Sigma}|-\frac{N}{2} \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\} \end{aligned}
−21n=1∑Ntnln∣Σ∣−21n=1∑Ntn(xn−μ1)TΣ−1(xn−μ1)−21n=1∑N(1−tn)ln∣Σ∣−21n=1∑N(1−tn)(xn−μ2)TΣ−1(xn−μ2)=−2Nln∣Σ∣−2NTr{Σ−1S}
where we have defined
S
=
N
1
N
S
1
+
N
2
N
S
2
S
1
=
1
N
1
∑
n
∈
C
1
(
x
n
−
μ
1
)
(
x
n
−
μ
1
)
T
S
2
=
1
N
2
∑
n
∈
C
2
(
x
n
−
μ
2
)
(
x
n
−
μ
2
)
T
\begin{aligned} \mathbf{S} &=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} \\ \mathbf{S}_{1} &=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol \mu_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol \mu_{1}\right)^{\mathrm{T}} \\ \mathbf{S}_{2} &=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol \mu_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol \mu_{2}\right)^{\mathrm{T}} \end{aligned}
SS1S2=NN1S1+NN2S2=N11n∈C1∑(xn−μ1)(xn−μ1)T=N21n∈C2∑(xn−μ2)(xn−μ2)T
Setting the derivative to zero,
d
(
ln
∣
Σ
∣
+
Tr
{
Σ
−
1
S
}
)
=
Tr
(
Σ
−
1
d
Σ
)
−
Tr
(
Σ
−
1
S
Σ
−
1
d
Σ
)
=
0
⟹
Σ
=
S
d(\ln |\mathbf{\Sigma}|+ \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\})=\operatorname{Tr}(\mathbf{\Sigma}^{-1} d\mathbf{\Sigma})-\operatorname{Tr}(\mathbf{\Sigma}^{-1} \mathbf{S}\mathbf{\Sigma}^{-1}d\mathbf{\Sigma})=0\Longrightarrow\boldsymbol \Sigma=\mathbf{S}
d(ln∣Σ∣+Tr{Σ−1S})=Tr(Σ−1dΣ)−Tr(Σ−1SΣ−1dΣ)=0⟹Σ=S
We see that
Σ
=
S
,
\boldsymbol \Sigma=\mathbf{S},
Σ=S, which represents a weighted average of the covariance matrices associated with each of the two classes separately.
Probabilistic Discriminative Models
We have seen how to model the class-conditional densities p ( x ∣ C k ) p(\mathbf x|\mathcal C_k) p(x∣Ck), as well as the class priors p ( C k ) p(\mathcal C_k) p(Ck), and then use these to compute posterior probabilities p ( C k ∣ x ) p(\mathcal C_k|\mathbf x) p(Ck∣x) through Baye’s theorem. For [Gaussian class-conditional densities](#Gaussian class-conditional densities), the posterior probability can be written as a logistic sigmoid acting on a linear function of x \mathbf x x. An alternative approach is to directly restrict posteriors as the generalized linear model without the Gaussian class-conditional assumption.
So far, we have considered classification models that work directly with the original input vector x \mathbf x x. However, all of the algorithms are equally applicable if we first make a fixed nonlinear transformation of the inputs using a vector of basis functions ϕ ( x ) \phi(\mathbf x) ϕ(x) (as we did in linear regression). The resulting decision boundaries will be linear in the feature space ϕ \phi ϕ, and these correspond to nonlinear decision boundaries in the original x \mathbf x x space, as illustrated in Figure 4.12.
Logistic regression
Restrict the posterior probability of class
C
1
\mathcal C_1
C1 as a logistic sigmoid acting on a linear function of the feature vector
ϕ
\boldsymbol \phi
ϕ so that
p
(
C
1
∣
ϕ
)
=
y
(
ϕ
)
=
σ
(
w
T
ϕ
)
p(\mathcal C_1|\boldsymbol \phi)=y(\boldsymbol \phi)=\sigma (\mathbf w^T\boldsymbol \phi)
p(C1∣ϕ)=y(ϕ)=σ(wTϕ)
and
p
(
C
2
∣
ϕ
)
=
1
−
p
(
C
1
∣
ϕ
)
p(\mathcal C_2|\boldsymbol \phi)=1-p(\mathcal C_1|\boldsymbol \phi)
p(C2∣ϕ)=1−p(C1∣ϕ). This model is known as logistic regression, although it is a model for classification.
For an M M M-dimensional feature space ϕ \boldsymbol \phi ϕ, this model has M M M adjustable parameters. By contrast, if we had fitted [Gaussian class conditional densities](#Gaussian class-conditional densities) using maximum likelihood, we would have (for two-class classification) 2 M 2M 2M parameters for the means and M ( M + 1 ) / 2 M(M+1)/2 M(M+1)/2 parameters for the (shared) covariance matrix. Together with the class prior p ( C 1 ) p(\mathcal C_1) p(C1), this gives a total of M ( M + 5 ) / 2 + 1 M(M+5)/2+1 M(M+5)/2+1 parameters, which grows quadratically with M M M, in contrast to the linear dependence on M M M of the number of parameters in logistic regression.
We now use maximum likelihood to determine the parameters of the logistic regression model.
For a data set
{
ϕ
n
,
t
n
}
,
\left\{\boldsymbol \phi_{n}, t_{n}\right\},
{ϕn,tn}, where
t
n
∈
{
0
,
1
}
t_{n} \in\{0,1\}
tn∈{0,1} and
ϕ
n
=
ϕ
(
x
n
)
\boldsymbol \phi_n=\phi(\mathbf x_n)
ϕn=ϕ(xn), the likelihood function can be written
p
(
t
∣
w
)
=
∏
n
=
1
N
y
n
t
n
{
1
−
y
n
}
1
−
t
n
p(\mathbf{t} \mid \mathbf{w})=\prod_{n=1}^{N} y_{n}^{t_{n}}\left\{1-y_{n}\right\}^{1-t_{n}}
p(t∣w)=n=1∏Nyntn{1−yn}1−tn
where
t
=
(
t
1
,
…
,
t
N
)
T
\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{\mathrm{T}}
t=(t1,…,tN)T and
y
n
=
p
(
C
1
∣
ϕ
n
)
.
y_{n}=p\left(\mathcal{C}_{1} |\boldsymbol \phi_{n}\right) .
yn=p(C1∣ϕn). As usual, we can define an error function by taking the negative logarithm of the likelihood, which gives the cross entropy error function in the form
E
(
w
)
=
−
ln
p
(
t
∣
w
)
=
−
∑
n
=
1
N
{
t
n
ln
y
n
+
(
1
−
t
n
)
ln
(
1
−
y
n
)
}
E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}
E(w)=−lnp(t∣w)=−n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
where
y
n
=
σ
(
a
n
)
y_{n}=\sigma\left(a_{n}\right)
yn=σ(an) and
a
n
=
w
T
ϕ
n
.
a_{n}=\mathbf{w}^{\mathrm{T}}\boldsymbol \phi_{n} .
an=wTϕn. Differentiate
E
(
w
)
E(\mathbf w)
E(w) w.r.t.
w
\mathbf w
w, we obtain
d
E
(
w
)
=
−
∑
n
=
1
N
[
t
n
1
y
n
d
y
n
−
(
1
−
t
n
)
1
1
−
y
n
d
y
n
]
=
a
−
∑
n
=
1
N
[
t
n
(
1
−
y
n
)
ϕ
n
T
d
w
−
(
1
−
t
n
)
y
n
ϕ
n
T
d
w
]
=
∑
n
=
1
N
(
y
n
−
t
n
)
ϕ
n
T
d
w
\begin{aligned} d E(\mathbf{w})&=-\sum_{n=1}^{N}[t_n\frac{1}{y_n}dy_n-(1-t_n)\frac{1}{1-y_n}dy_n]\\ &\stackrel{a}{=}-\sum_{n=1}^{N}[t_n(1-y_n)\boldsymbol \phi_n^Td\mathbf w-(1-t_n)y_n\boldsymbol \phi_n^Td\mathbf w]\\ &=\sum_{n=1}^{N}(y_n-t_n)\boldsymbol \phi_n^Td\mathbf w \end{aligned}
dE(w)=−n=1∑N[tnyn1dyn−(1−tn)1−yn1dyn]=a−n=1∑N[tn(1−yn)ϕnTdw−(1−tn)ynϕnTdw]=n=1∑N(yn−tn)ϕnTdw
where
=
a
\stackrel{a}{=}
=a is due to the property
y
n
′
=
σ
′
(
a
n
)
=
σ
(
a
n
)
(
1
−
σ
(
a
n
)
)
y_n'=\sigma '(a_n)=\sigma(a_n)(1-\sigma(a_n))
yn′=σ′(an)=σ(an)(1−σ(an)).
The contribution to the gradient from data point n n n is given by the ‘error’ y n − t n y_n − t_n yn−tn between the target value and the prediction of the model, times the basis function vector ϕ n \boldsymbol \phi_n ϕn. Furthermore, it takes precisely the same form as the gradient of the sum-of-squares error function for the linear regression model. (See Linear Regression: Maximum likelihood and least squares ( L M . 9 ) (LM.9) (LM.9))