Linear Classification

最新推荐文章于 2021-01-28 21:20:57 发布

拉普拉斯的汪

最新推荐文章于 2021-01-28 21:20:57 发布

阅读量267

点赞数 1

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_39599295/article/details/110299417

版权

Machine Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Reference:

Bishop C M. Pattern recognition and machine learning[M]. springer, 2006.

- Chapter 4 up to and including 4.3.2

Content

In the linear regression models, the model prediction

y(\mathbf x,\mathbf w)

was given by a linear function of the parameter

\mathbf w

. In the simplest case, the model is also linear in the input variables and therefore takes the form

y(\mathbf x)=\mathbf w^T\mathbf x+w_0

, so that

y

is a real number.

For classification problems, however, we wish to predict discrete class labels, or more generally posterior probabilities that lie in the range $(0, 1)$ . To achieve this, we consider a generalization of this model in which we transform the linear function of $\mathbf w$ using a nonlinear function $f(\cdot)$ so that
$g(\mathbf x)=f(\mathbf w^T\mathbf x+w_0)\tag{GLM}$
$f(\cdot)$ is known as an activation function.

The decision surfaces correspond to $g(\mathbf x)=\mathrm{constant}$ , so that $\mathbf w^T \mathbf x+w_0=\mathrm{constant}$ and hence the decision surfaces are linear functions of $\mathbf x$ , even if the function $f(\cdot)$ is nonlinear. For this reason, the class of models described by $(G L M)$ are called generalized linear models.

Discriminant Functions (Nonprobabilistic Methods)

A discriminant is a function that takes an input vector $\mathbf x$ and assigns it to one of $K$ classes, denoted $\mathcal C_k$ . In this case, probabilities play no role. In this chapter, we shall restrict attention to linear discriminants, namely those for which the decision surfaces are hyperplanes.

Two classes

The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that
$y(\mathbf x)=\mathbf w^T \mathbf x+w_0$
where $\mathbf w$ is called a weight vector, and $w_0$ is a bias. The negative of the bias is sometimes called a threshold.
$\text{assign }\mathbf x\text{ to class }\mathcal C_1 \quad \text{ if }y(\mathbf x)\ge 0~(\text{or }\mathbf w^T \mathbf x\ge -w_0)\\ \text{assign }\mathbf x\text{ to class }\mathcal C_2 \quad \text{ if }y(\mathbf x)\le 0~(\text{or }\mathbf w^T \mathbf x\le -w_0)$
The corresponding decision boundary is therefore defined by the relation $y(\mathbf x) = 0$ , which corresponds to a $(D - 1)$ -dimensional hyperplane within the $D$ -dimensional input space.

The weight vector $\mathbf w$ is orthogonal to every vector lying within the decision surface, and so $\mathbf w$ determines the orientation of the decision surface. The normal distance from the origin to the decision surface is given by $-w_0/\|\mathbf w\|$ , so the bias parameter $w_0$ determines the location of the decision surface (the normal distance between hyperplane $\mathbf y=\mathbf a^T \mathbf x+b_0$ and hyperplane $\mathbf y=\mathbf a^T\mathbf x+b_1$ is $|b_1-b_2|/\|\mathbf a\|$ ).

在这里插入图片描述

As with the linear regression models, it is sometimes convenient to use a more compact notation in which we introduce an additional dummy ‘input’ value $x_0 = 1$ and then define $\tilde {\mathbf w}=(w_0,\mathbf w)$ and $\tilde {\mathbf x}=(x_0, \mathbf x)$ so that
$y(\mathbf x)=\tilde {\mathbf w}^T\tilde{\mathbf x}$

Multiple classes

A single $K$ -class discriminant comprising $K$ linear functions of the form
$y_k(\mathbf x)=\mathbf w_k^T\mathbf x+w_{k0}$

$\text{assign }\mathbf x\text{ to class }\mathcal C_k \quad \text{ if }y_k(\mathbf x)\ge y_j(\mathbf x),\forall j\ne k$

The decision boundary between class $\mathcal C_k$ and class $\mathcal C_j$ is therefore given by $y_k(\mathbf x)=y_j(\mathbf x)$ and hence correspond to a $(D - 1)$ -dimensional hyperplane defined by
$(\mathbf w_k-\mathbf w_j)^T\mathbf x+(w_{k0}-w_{j0})=0$

在这里插入图片描述

Least Squares for Classification

Consider a general classification problem with $K$ classes, with a 1-of-K binary coding scheme for the target vector $\mathbf t$ . For instance, if we have $K = 5$ classes, then a pattern from class 2 would be given the target vector $\mathbf t=(0,1,0,0,0)^T$ .

Each class $\mathcal C_k$ is described by its own linear model so that
$y_k(\mathbf x)=\mathbf w_k^T\mathbf x+w_{k0}$
where $k=1,\cdots,K$ . We can conveniently group these together using vector notation so that
$\mathbf y(\mathbf x)=\tilde {\mathbf W}^T\tilde{\mathbf x}$
where , $\mathbf W$ is a matrix whose $k$ th column comprises the $D + 1$ -dimensional vector $\tilde{\mathbf w}_k=(w_{k0},\mathbf w_k^T)^T$ and $\tilde {\mathbf x}$ is the corresponding augmented input vector $(1,\mathbf x^T)^T$ with a dummy input $x_0 = 1$ . We can obtain $\mathbf t$ by assigning $\mathbf x$ to the class for which the output $y_k=\tilde {\mathbf w}_k^T\tilde{\mathbf x}$ is largest.

Then consider a training data set $\{\mathbf x_n,\mathbf t_n\}$ where $n=1,\cdots,N$ , and define a matrix $\mathbf T$ whose $n^{th}$ row is the vector $\mathbf t_n^T$ , together with a matrix $\tilde {\mathbf X}$ whose $n^{th}$ row is $\tilde{\mathbf x}_n^T$ . The sum-of-squares error function can then be written as
$E_D(\tilde{\mathbf W})=\frac{1}{2}\mathrm{Tr}\left\{ (\tilde {\mathbf X}\tilde{\mathbf W}-\mathbf T)^T(\tilde {\mathbf X}\tilde{\mathbf W}-\mathbf T) \right\}$
Setting the derivative w.r.t. $\tilde {\mathbf W}$ to zero, and rearranging, we then obtain the solution for $\tilde{\mathbf W}$ in the form
$\tilde{\mathbf W}=(\tilde {\mathbf X}^T\tilde{\mathbf X})^{-1}\tilde {\mathbf X}^T\mathbf T=\tilde {\mathbf X}^{\dagger}\mathbf T$
We then obtain the discriminant function in the form
$\mathbf y(\mathbf x)=\tilde{\mathbf W}^T\tilde{\mathbf x}=\mathbf T^T(\tilde {\mathbf X}^{\dagger})^T\tilde{\mathbf x}$
However, recall that least squares corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution, binary target vectors clearly have a distribution that is far from Gaussian. Therefore, LS may suffer from some severe problems.

在这里插入图片描述

Fisher’s linear discriminant

Another way to view a linear classification model without probabilistic interpretation is in term of dimensionality reduction. Consider the case of two classes, and suppose we take the $D$ -dimensional input vector $\mathbf x$ and project it down to one dimension using
$y=\mathbf w^T\mathbf x$
If we place a threshold on $y$ and classify $y\ge -w_0$ as class $\mathcal C_1$ , and otherwise class $\mathcal C_2$ , then we obtain our standard linear classifier discussed in the previous section.

In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D-dimensional space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector $\mathbf w$ , we can select a projection that maximizes the class separation.

To begin with, consider a two-class problem in which there are $N_1$ points of class $\mathcal C_1$ and $N_2$ points of class $\mathcal C_2$ , so that the mean vectors of the two classes are given by
$\mathbf m_1=\frac{1}{N_1}\sum_{n\in \mathcal C_1} \mathbf x_n\quad \quad\quad\mathbf m_2=\frac{1}{N_2}\sum_{n\in \mathcal C_2} \mathbf x_n$
The simplest measure of the separation of the classes, when projected onto $\mathbf w$ , is the separation of the projected class means. This suggests that we might choose $\mathbf w$ so as to maximize
$m_2-m_1=\mathbf w^T(\mathbf m_2-\mathbf m_1)$
where $m_k=\mathbf w^T\mathbf m_k$ is the mean of the projected data from class $\mathcal C_k$ .

However, it can happen that the two classes are well separated in the original two-dimensional space $x_1,x_2)$ but have considerable overlap when project onto the line joining their means, as is shown in the left figure below.

在这里插入图片描述

The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap.

The within-class variance of the transformed data from class $\mathcal C_k$ is given by
$s_k^2=\sum_{n\in \mathcal C_k}(y_n-m_k)^2$
where $y_n=\mathbf w^T\mathbf x_n$ . The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by
$J(\mathbf w)=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}=\frac{\mathbf w^T\mathbf S_\mathrm{B}\mathbf w}{\mathbf w^T\mathbf S_\mathrm{W}\mathbf w}$
where $\mathbf{S}_{\mathrm{B}}$ is the between-class covariance matrix and is given by
$\mathbf{S}_{\mathrm{B}}=\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)^{{T}}$
and $\mathbf{S}_{\mathrm{W}}$ is the total within-class covariance matrix, given by
$\mathbf{S}_{\mathrm{W}}=\sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{1}\right)^{{T}}+\sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)\left(\mathbf{x}_{n}-\mathbf{m}_{2}\right)^{T}$
Differentiating $J(\mathbf w)$ with respect to $\mathbf w$ , we find that $J(\mathrm{w})$ is maximized when
$\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{B}} \mathbf{w}\right) \mathbf{S}_{\mathrm{W}} \mathbf{w}=\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{W}} \mathbf{w}\right) \mathbf{S}_{\mathrm{B}} \mathbf{w}$
From the expression of $\mathbf{S}_{\mathrm{B}}$ , we see that $\mathbf{S}_{\mathrm{B}} \mathbf w$ is always in the direction of $\left(\mathbf{m}_{2}-\mathbf{m}_{1}\right).$ Furthermore, we do not care about the magnitude of $\mathbf{w},$ only its direction, and so we can drop the scalar factors $\left(\mathbf{w}^{{T}} \mathbf{S}_{\mathrm{B}} \mathbf{w}\right)$ and $\left(\mathbf{w}^{T} \mathbf{S}_{\mathrm{W}} \mathbf{w}\right)$ . Therefore we obtain
$\mathbf w\propto \mathbf{S}_{\mathrm{W}}^{-1} \left(\mathbf{m}_{2}-\mathbf{m}_{1}\right)$
Now we obtain a specific choice of direction for projection of the data down to one dimension. The projected data can subsequently be used to construct a discriminant, by choosing a threshold on $y$ and classify $y\ge -w_0$ as class $\mathcal C_1$ , and otherwise class $\mathcal C_2$ .

The perceptron algorithm

The perceptron corresponds to a two-class model in which the input vector $\mathbf x$ is first transformed using a fixed nonlinear transformation to give a feature vector $\boldsymbol \phi(\mathbf x)$ , and this is then used to construct a generalized linear model of the form
$y(\mathbf x)=f(\mathbf w^T\boldsymbol\phi(\mathbf x))$
where the nonlinear activation function $f (\cdot)$ is given by a step function of the form
$f(a)=\left\{\begin{aligned} &+1,&& a\ge 0\\ &-1,&& a<0\end{aligned}\right.$
Assign $\mathbf x$ to class $\mathcal C_1$ when target values $t = + 1$ and $\mathcal C_2$ when $t = - 1$ .

Then let us see how to define the error function. We are seeking a weight vector $\mathbf w$ such that patterns $\mathbf x_n$ in class $\mathcal C_1$ will have $\mathbf w^T\boldsymbol \phi(\mathbf x_n)>0$ , whereas patterns in class $\mathcal C_2$ have $\mathbf w^T\boldsymbol \phi(\mathbf x_n)<0$ . Using the $t ∈ \{−1, +1\}$ target coding scheme it follows that we would like all patterns to satisfy $\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n>0$ . The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern $\mathbf x_n$ it tries to maximize the quantity $\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n$ , or minimize $-\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n$ . The perception criterion is therefore given by
$E_P(\mathbf w)=-\sum_{n\in \mathcal M}\mathbf w^T\boldsymbol \phi(\mathbf x_n)t_n$
where $\mathcal M$ denotes the set of all misclassified patterns.

We now apply the stochastic gradient descent algorithm to this error function. The change in the weight vector $\mathbf{w}$ is then given by
$\mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E_{\mathrm{P}}(\mathbf{w})=\mathbf{w}^{(\tau)}+\eta \boldsymbol\phi_{n} t_{n}$
where $\eta$ is the learning rate parameter and $\tau$ is an integer that indexes the steps of the algorithm. Because the perceptron function $y(\mathbf{x}, \mathbf{w})$ is unchanged if we multiply $\mathbf w$ by a constant, we can set the learning rate parameter $\eta$ equal to $1$ without of generality.

The perceptron learning algorithm has a simple interpretation, as follows. If the pattern is correctly classified, then the weight vector remains unchanged, whereas if it is incorrectly classified, then for class $\mathcal{C}_{1}$ we add the vector $\boldsymbol \phi\left(\mathbf{x}_{n}\right)$ onto the current estimate of weight vector $\mathbf{w}$ while for class $\mathcal{C}_{2}$ we subtract the vector $\boldsymbol \phi\left(\mathbf{x}_{n}\right)$ from $\mathbf{w}$ .

在这里插入图片描述

Probabilistic Generative Models

We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from simple assumptions about the distribution of the data. Here we shall adopt a generative approach in which we model the class-conditional densities $p(\mathbf x|\mathcal C_k)$ , as well as the class priors $p(\mathcal C_k)$ , and then use these to compute posterior probabilities $p(\mathcal C_k|\mathbf x)$ through Baye’s theorem.

Consider first of all the case of two classes. The posterior probability for class $\mathcal C_1$ can be written as
$\begin{aligned} p(\mathcal C_1|\mathbf x)&=\frac{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)}{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)+p(\mathbf x|\mathcal C_2)p(\mathcal C_2)}\\ &=\frac{1}{1+\exp(-a)}=\sigma (a) \end{aligned}$
where we have defined
$a=\ln \frac{p(\mathbf x|\mathcal C_1)p(\mathcal C_1)}{p(\mathbf x|\mathcal C_2)p(\mathcal C_2)}$
and $\sigma(a)$ is the logistic sigmoid function defined by
$\sigma (a)=\frac{1}{1+\exp(-a)}$

在这里插入图片描述

It satisfies the following symmetry property
$\sigma (-a)=1-\sigma (a)$
and the derivative of $\sigma (a)$ is
$\sigma '(a)=\frac{\exp(-a)}{(1+\exp(-a))^2}=\frac{1}{1+\exp (-a)}(1-\frac{1}{1+\exp (-a)})=\sigma(a)(1-\sigma(a))$
The inverse of the logistic sigmoid is given by
$a=\ln (\frac{\sigma}{1-\sigma})$
and is known as the logit function. It represents the log of the ratio of probabilities $\ln [p(\mathcal C_1|\mathbf x)/p(\mathcal C_1|\mathbf x)]$ for the two classes, also known as the log odds.

Note that we have simply rewritten the posterior probabilities in an equivalent form, and so the appearance of the logistic sigmoid may seem rather vacuous. However, it will have significance provided $a(\mathbf x)$ takes a simple functional form. We shall shortly consider situations in which $a(\mathbf x)$ is linear function of $\mathbf x$ , in which case the posterior probability is governed by a generalized linear model.

For the case of $K > 2$ classes, we have
$p(\mathcal C_k|\mathbf x)=\frac{p(\mathbf x|\mathcal C_k)p(\mathcal C_k)}{\sum_j p(\mathbf x|\mathcal C_j)p(\mathcal C_j)}=\frac{\exp(a_k)}{\sum_j\exp(a_j)}$
which is known as the softmax function. Here the quantities $a_k$ are defined by
$a_k=\ln p(\mathbf x|\mathcal C_k)p(\mathcal C_k)$

Gaussian class-conditional densities

Let us assume that the class-conditional densities are Gaussian and then explore the resulting form for the posterior probabilities. To start with, we shall assume that all classes share the same covariance matrix. (See Classifiers Based on Bayes Decision Theory: The Bayesian Classifier for Normally Distributed Classes)

The density for class $\mathcal C_k$ is given by
$p(\mathbf x|\mathcal C_k)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\boldsymbol \Sigma|^{1/2}}\exp \left\{-\frac{1}{2}(\mathbf x-\boldsymbol \mu_k)^T{\boldsymbol \Sigma}^{-1}(\mathbf x-\boldsymbol \mu_k) \right\}$
Consider first the case of two classes, we have
$p\left(\mathcal{C}_{1} | \mathbf{x}\right)=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}\right)$
where we have defined
$\begin{aligned} \mathbf{w} &=\boldsymbol \Sigma^{-1}\left(\boldsymbol \mu_{1}-\boldsymbol \mu_{2}\right) \\ w_{0} &=-\frac{1}{2} \boldsymbol \mu_{1}^{\mathrm{T}}\boldsymbol \Sigma^{-1} \boldsymbol \mu_{1}+\frac{1}{2} \boldsymbol \mu_{2}^{\mathrm{T}} \boldsymbol \Sigma^{-1} \boldsymbol \mu_{2}+\ln \frac{p\left(\mathcal{C}_{1}\right)}{p\left(\mathcal{C}_{2}\right)} \end{aligned}$
We see that the quadratic terms in $\mathbf{x}$ from the exponents of the Gaussian densities have cancelled (due to the assumption of common covariance matrices) leading to a linear function of $\mathbf{x}$ in the argument of the logistic sigmoid.

The resulting decision boundaries correspond to surfaces along which the posterior probabilities $p(\mathcal C_k|\mathbf x)$ are constant and so will be given by linear functions of $\mathbf x$ , and therefore the decision boundaries are linear in input space. The prior probabilities $p(\mathcal C_k)$ enter only through the bias parameter $w_0$ so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.

For the general case of $K$ classes
$a_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0}$
where we have defined
$\begin{aligned} \mathbf{w}_{k} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k} \\ w_{k 0} &=-\frac{1}{2} \boldsymbol{\mu}_{k}^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k}+\ln p\left(\mathcal{C}_{k}\right) \end{aligned}$
We see that the $a_{k}(\mathbf{x})$ are again linear functions of $\mathbf x$ as a consequence of the cancellation of the quadratic terms due to the shared covariances. The resulting decision boundaries, corresponding to the minimum misclassification rate, will occur when two of the posterior probabilities (the two largest) are equal, and so will be defined by linear functions of $\mathbf x$ , and so again we have a generalized linear model.

Once we have specified a parametric functional form for the class-conditional densities $p(\mathbf x|\mathcal C_k)$ , we can then determine the values of the parameters, together with the prior class probabilities $p(\mathcal C_k)$ , using maximum likelihood.

Consider first the case of two classes, each having a Gaussian class-conditional density with a shared covariance matrix, and suppose we have a data set $\{ \mathbf x_n,t_n\}$ where $n=1,\cdots,N$ . Here $t_n=1$ denotes class $\mathcal C_1$ and $t_n=0$ denotes class $\mathcal C_2$ . We denote the prior class probability $p(\mathcal C_1)=\pi$ , so that $p(\mathcal C_2)=1-\pi$ . For a data point $\mathbf x_n$ from class $\mathcal C_1$ , we have $t_n=1$ and hence
$p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n}|\mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n}| \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)$
Similarly for class $\mathcal{C}_{2},$ we have $t_{n}=0$ and hence
$p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} |\mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)$
Thus the likelihood function is given by
$p\left(\mathbf{t} \mid \pi, \boldsymbol \mu_{1}, \boldsymbol \mu_{2}, \boldsymbol \Sigma\right)=\prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol \mu_{1}, \boldsymbol \Sigma\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol \mu_{2}, \boldsymbol \Sigma\right)\right]^{1-t_{n}}$
where $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{\mathrm{T}}$ . As usual, it is convenient to maximize the log of the likelihood function. Consider first the maximization with respect to $\pi .$ The terms in the log likelihood function that depend on $π$ are
$\sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\}$
Setting the derivative with respect to $\pi$ equal to zero and rearranging, we obtain
$\pi=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}}$
where $N_{1}$ denotes the total number of data points in class $\mathcal{C}_{1},$ and $N_{2}$ denotes the total number of data points in class $\mathcal{C}_{2}$ . Thus the maximum likelihood estimate for $\pi$ is simply the fraction of points in class $\mathcal{C}_{1}$ as expected. This result is easily generalized to the multiclass case where again the maximum likelihood estimate of the prior probability associated with class $\mathcal{C}_{k}$ is given by the fraction of the training set points assigned to that class.

Now consider the maximization with respect to $\mu_{1}$ . Again we can pick out of the log likelihood function those terms that depend on $\boldsymbol \mu_{1}$ giving
$\sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} |\boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)=-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)+\text { const. }$
Setting the derivative with respect to $\boldsymbol \mu_{1}$ to zero and rearranging, we obtain
$\boldsymbol \mu_1=\frac{1}{N_1}\sum_{n=1}^N t_n \mathbf x_n$
which is simply the mean of all the input vectors $\mathbf{x}_{n}$ assigned to class $\mathcal{C}_{1} .$ By a similar argument, the corresponding result for $\boldsymbol \mu_{2}$ is given by
$\boldsymbol \mu_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N}\left(1-t_{n}\right) \mathbf{x}_{n}$
which again is the mean of all the input vectors $\mathbf{x}_{n}$ assigned to class $\mathcal{C}_{2}$ .

Finally, consider the maximum likelihood solution for the shared covariance matrix $\boldsymbol \Sigma$ . Picking out the terms in the log likelihood function that depend on $\boldsymbol \Sigma$ , we have
$\begin{aligned} &-\frac{1}{2} \sum_{n=1}^{N} t_{n} \ln |\mathbf{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right) \\ &-\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right) \ln |\mathbf{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right) \\ &=-\frac{N}{2} \ln |\mathbf{\Sigma}|-\frac{N}{2} \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\} \end{aligned}$
where we have defined
$\begin{aligned} \mathbf{S} &=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} \\ \mathbf{S}_{1} &=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol \mu_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol \mu_{1}\right)^{\mathrm{T}} \\ \mathbf{S}_{2} &=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol \mu_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol \mu_{2}\right)^{\mathrm{T}} \end{aligned}$
Setting the derivative to zero,
$d(\ln |\mathbf{\Sigma}|+ \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\})=\operatorname{Tr}(\mathbf{\Sigma}^{-1} d\mathbf{\Sigma})-\operatorname{Tr}(\mathbf{\Sigma}^{-1} \mathbf{S}\mathbf{\Sigma}^{-1}d\mathbf{\Sigma})=0\Longrightarrow\boldsymbol \Sigma=\mathbf{S}$
We see that $\boldsymbol \Sigma=\mathbf{S},$ which represents a weighted average of the covariance matrices associated with each of the two classes separately.

Probabilistic Discriminative Models

We have seen how to model the class-conditional densities $p(\mathbf x|\mathcal C_k)$ , as well as the class priors $p(\mathcal C_k)$ , and then use these to compute posterior probabilities $p(\mathcal C_k|\mathbf x)$ through Baye’s theorem. For [Gaussian class-conditional densities](#Gaussian class-conditional densities), the posterior probability can be written as a logistic sigmoid acting on a linear function of $\mathbf x$ . An alternative approach is to directly restrict posteriors as the generalized linear model without the Gaussian class-conditional assumption.

So far, we have considered classification models that work directly with the original input vector $\mathbf x$ . However, all of the algorithms are equally applicable if we first make a fixed nonlinear transformation of the inputs using a vector of basis functions $\phi(\mathbf x)$ (as we did in linear regression). The resulting decision boundaries will be linear in the feature space $\phi$ , and these correspond to nonlinear decision boundaries in the original $\mathbf x$ space, as illustrated in Figure 4.12.

在这里插入图片描述

Logistic regression

Restrict the posterior probability of class $\mathcal C_1$ as a logistic sigmoid acting on a linear function of the feature vector $\boldsymbol \phi$ so that
$p(\mathcal C_1|\boldsymbol \phi)=y(\boldsymbol \phi)=\sigma (\mathbf w^T\boldsymbol \phi)$
and $p(\mathcal C_2|\boldsymbol \phi)=1-p(\mathcal C_1|\boldsymbol \phi)$ . This model is known as logistic regression, although it is a model for classification.

For an $M$ -dimensional feature space $\boldsymbol \phi$ , this model has $M$ adjustable parameters. By contrast, if we had fitted [Gaussian class conditional densities](#Gaussian class-conditional densities) using maximum likelihood, we would have (for two-class classification) $2 M$ parameters for the means and $M (M + 1) / 2$ parameters for the (shared) covariance matrix. Together with the class prior $p(\mathcal C_1)$ , this gives a total of $M (M + 5) / 2 + 1$ parameters, which grows quadratically with $M$ , in contrast to the linear dependence on $M$ of the number of parameters in logistic regression.

We now use maximum likelihood to determine the parameters of the logistic regression model.

For a data set $\left\{\boldsymbol \phi_{n}, t_{n}\right\},$ where $t_{n} \in\{0,1\}$ and $\boldsymbol \phi_n=\phi(\mathbf x_n)$ , the likelihood function can be written
$p(\mathbf{t} \mid \mathbf{w})=\prod_{n=1}^{N} y_{n}^{t_{n}}\left\{1-y_{n}\right\}^{1-t_{n}}$
where $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{\mathrm{T}}$ and $y_{n}=p\left(\mathcal{C}_{1} |\boldsymbol \phi_{n}\right) .$ As usual, we can define an error function by taking the negative logarithm of the likelihood, which gives the cross entropy error function in the form
$E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\}$
where $y_{n}=\sigma\left(a_{n}\right)$ and $a_{n}=\mathbf{w}^{\mathrm{T}}\boldsymbol \phi_{n} .$ Differentiate $E(\mathbf w)$ w.r.t. $\mathbf w$ , we obtain
$\begin{aligned} d E(\mathbf{w})&=-\sum_{n=1}^{N}[t_n\frac{1}{y_n}dy_n-(1-t_n)\frac{1}{1-y_n}dy_n]\\ &\stackrel{a}{=}-\sum_{n=1}^{N}[t_n(1-y_n)\boldsymbol \phi_n^Td\mathbf w-(1-t_n)y_n\boldsymbol \phi_n^Td\mathbf w]\\ &=\sum_{n=1}^{N}(y_n-t_n)\boldsymbol \phi_n^Td\mathbf w \end{aligned}$
where $\stackrel{a}{=}$ is due to the property $y_n'=\sigma '(a_n)=\sigma(a_n)(1-\sigma(a_n))$ .

The contribution to the gradient from data point $n$ is given by the ‘error’ $y_n − t_n$ between the target value and the prediction of the model, times the basis function vector $\boldsymbol \phi_n$ . Furthermore, it takes precisely the same form as the gradient of the sum-of-squares error function for the linear regression model. (See Linear Regression: Maximum likelihood and least squares $(L M . 9)$ )

拉普拉斯的汪

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Linear Classification

Reference:Bishop C M. Pattern recognition and machine learning[M]. springer, 2006.- Chapter 4 up to and including 4.3.2ContentDiscriminant Functions (Nonprobabilistic Methods)Two classesMultiple classesLeast Squares for ClassificationFisher’s linear dis
复制链接

扫一扫