Algorithm Learning：Supervised Learning - Logistic Regression - two-class Classification

最新推荐文章于 2021-07-30 08:48:43 发布

Lai_Ye

最新推荐文章于 2021-07-30 08:48:43 发布

阅读量337

点赞数 3

文章标签：机器学习

本文链接：https://blog.csdn.net/Lai_Ye_/article/details/108654881

版权

Introduction

This report focuses on two-class Classification and introduces the Logistic Regression , including theories, explanations, formula derivations, potential applications and so on.
.
Note：This report is a study note of “CS229 Lecture Notes”, by Andrew Ng, Stanford University.

Theories of Logistic Regression

Classification and regression are two major problems that machine learning can solve. From the perspective of the types of predicted values, the quantitative output of continuous variable prediction is called regression.The qualitative output of discrete variable prediction is called classification.

Logistic Regression, although with the word “Regression” in its name, is actually a classification method for two classification problems (i.e., only two kinds of outputs). In this method the variable that we’re trying to predict is a variable y that we can think of as taking on two values either 0 or 1. Class 0 is called “Negative Class” and Class 1 is called “Positive Class” .
$y\in\lbrace {0},{1}\rbrace$

Note：The formula represents the value space of the label y .

Procedure Explanation

The theories of logistic regression and linear regression are similar and can be simply described as follows:

（1）Find an appropriate prediction function (called Hypothesis in Andrew Ng’s class), which is generally expressed as $h$ function.

（2）Define a decision boundary. Its meaning is to find out exactly when logistic regression determines the data to be of class 1 or class 0.

（3）Construct a cost function. In this article, the cross entropy function based on maximum likelihood estimation is used as the cost function to analyze.

（4）Use gradient descent to minimize the cost function.

1. Hypothesis Representation

In this section, Let’s introduce the logistic regression hypothesis first. According to the above steps, we need to find a suitable hypothesis representation $h$ _θ $(x)$ . And here the hypothesis of logistic regression is $g(h(θ))=\frac{1}{1+e^{-θ^Tx+\epsilon}}$ where
$g(z)=\frac{1}{1+e^{-z}}$ is called the logistic function or the sigmoid function, it looks like this:
The sigmoid function . It reproduced from https://blog.csdn.net/iracer/article/details/50684609
Notice that it starts off near 0 and then rises until it crosses 0.5 and the origin, and then it flattens out again like so. It has the following characteristics: if $z > > 0$ , $y\to1$ ，else if $z < < 0$ ， $y\to0$ . This has strong practical implications: $y$ represents the probability that the sample belongs to class 1.

Based on this, the probabilities of classification results of input X as class 1 and class 0 are respectively:
$P(y=1|x;θ)=h_θ(x)\\ P(y=0|x;θ)=1-h_θ(x)$

2. Decision Boundary

The decision boundary is a property of the hypothesis under the parameters. Once we have the parameters θ, we can determine the decision boundaries. The decision boundary is: $θ^Tx=0$ If $θ^Tx≥0$ , then predict $y = 1$ ;
If $θ^Tx<0$ , then predict $y = 0$ ;

Let’s start with an example:

example:
在这里插入图片描述 The figure above depicts a linear decision boundary，decision boundary form is as follows： $θ^Tx=θ_0x_0+θ_1x_1+θ_2x_2, x_0=1$ $if\ θ=[-3,1,1]^T$ The prediction function is constructed as: $h_θ(x)=g(θ^Tx)=\frac{1}{1+e^{3-x_1-x_2}}$

Predict $y = 1$ , if $3+x_1+x_2≥0$ , that is, if $x_1+x_2≥3$ , then $y = 1$ ,
Predict $y = 0$ , if $3+x_1+x_2<0$ , that is, if $x_1+x_2<3$ , then $y = 0$ .

3. Cost Function

Assuming that the $n$ training examples were generated independently, the loss function and the J function are as follows, which are derived from the maximum likelihood estimation: $Loss(h_θ(x),y)= \begin{cases} -log(h_θ(x)),& \text{if $y=1$} \\ -log(1-h_θ(x)), & \text{if $y=0$} \end{cases}$ $J(θ)=\frac{1}{m}\sum_{i=1}^mLoss(h_θ(x^{(i)}),y^{(i)})=-\frac{1}{m}\sum_{i=1}^m(y^{(i)} log(h_θ(x^{(i)}))+(1-y^{(i)})log(1-h_θ(x^{(i)})))$

And then we’re going to do some formula derivations:
$P(y=1|x;θ)=h_θ(x)\\ P(y=0|x;θ)=1-h_θ(x)$ In fact, this can be combined into the following formula: $P(y|x;θ)=(h_θ(x))^{y}(1-h_θ(x))^{1-y}$ Take the likelihood function: $L(θ)=\prod_{i=1}^mP(y|x^{(i)};θ)=\prod_{i=1}^m(h_θ(x^{(i)}))^{y^{(i)}}(1-h_θ(x^{(i)}))^{1-y^{(i)}}$ After taking the logarithm: $l(θ)=log{L(θ)}=\sum_{i=1}^m(y^{(i)} log(h_θ(x^{(i)}))+(1-y^{(i)})log(1-h_θ(x^{(i)})))$ The above equation is to gain θ when x is maximized by using the maximum likelihood estimation. If gradient descent method is used, a negative coefficient should be added to obtain the final J function： $J(θ)=-\frac{1}{m}l(θ)$ In this case, we just have to minimize $J$ to get $θ$ .

Note：The likelihood function of parameter θ given output x is equal to the probability of taking x given θ： $L (θ ∣ x) = P (x ∣ θ)$ .

4. Gradient Descent

Then we use gradient descent to train this model. The $θ$ update process is derived as follows:

$θ_j:=θ_j-\alpha\frac{ \delta}{\deltaθ_j}J(θ)$

$\frac{ \delta}{\deltaθ_j}J(θ)=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\frac{1}{h_θ(x^{(i)})}\frac{ \delta}{\deltaθ_j}h_θ(x^{(i)})-(1-y^{(i)})\frac{1}{1-h_θ(x^{(i)})}\frac{\delta}{\deltaθ_j}h_θ(x^{(i)}))$

$\frac{ \delta}{\deltaθ_j}J(θ)=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\frac{1}{h_θ(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_θ(x^{(i)})})\frac{\delta}{\deltaθ_j}h_θ(x^{(i)})$

for $g (z)$ :

$g'(z)=\frac{d}{dz}\frac{1}{1+e^{-z}}=\frac{1}{(1+e^{-z})^2}e^{-z}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z))$

$\frac{ \delta}{\deltaθ_j}g(θ^Tx)=g(θ^Tx)(1-g(θ^Tx))\frac{\delta}{\deltaθ_j}θ^Tx^{(i)}$

To the original:

$\frac{ \delta}{\deltaθ_j}J(θ)=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\frac{1}{g(θ^Tx^{(i)})}-(1-y^{(i)})\frac{1}{1-g(θ^Tx^{(i)})})g(θ^Tx^{(i)})(1-g(θ^Tx^{(i)}))\frac{\delta}{\deltaθ_j}θ^Tx^{(i)}$

$\qquad\quad\,\,=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}(1-g(θ^Tx^{(i)}))-(1-y^{(i)})g(θ^Tx^{(i)}))x_j^{(i)}$

$\qquad\quad\,\,=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}-g(θ^Tx^{(i)}))x_j^{(i)}$

$\qquad\quad\,\,=\frac{1}{m}\sum_{i=1}^m(h_θ(x^{(i)})-y^{(i)})x_j^{(i)}$

Finally, we get Parameter $θ$ update process:

$θ_j:=θ_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_θ(x^{(i)})-y^{(i)})x_j^{(i)},j=0,1,...,n$

This has a lot of similarities to the linear regression gradient descent process.

5. Vectorization

To facilitate code implementation, the parameter update process needs to be vectorized further, and the following is the mathematical derivation of the vectorization process:

The matrix form of the training data is specified as follows, where each row of $x$ is a unit sample and each column is a different characteristic value.

$x=\begin{bmatrix} x ^{(1)} \\ x^{(2)} \\... \\x^{(m)} \\ \end{bmatrix}=\begin{bmatrix} x ^{(1)}_0&x^{(1)} _1&...&x^{(1)} _n\\ x^{(2)}_0&x^{(2)} _1&...&x^{(2)} _n \\...&...&...&... \\x^{(m)}_0&x^{(m)} _1&...&x^{(m)} _n \\ \end{bmatrix},y=\begin{bmatrix} y ^{(1)} \\ y^{(2)} \\... \\y^{(m)} \\ \end{bmatrix} \qquad\quad\,\,\qquad\quad\,\,(1)$

The matrix form of the parameter to be solved is as follows:
$θ=\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix} \qquad\quad\,\,\qquad\quad\,\,(2)$
$x\cdotθ$ write for $A$ :
$A=x\cdotθ=\begin{bmatrix} x ^{(1)}_0&x^{(1)} _1&...&x^{(1)} _n\\ x^{(2)}_0&x^{(2)} _1&...&x^{(2)} _n \\...&...&...&... \\x^{(m)}_0&x^{(m)} _1&...&x^{(m)} _n \\ \end{bmatrix}\cdot\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix}=\begin{bmatrix} θ_0x ^{(1)}_0+θ_1x ^{(1)}_1+...+θ_nx ^{(1)}_n \\θ_0x ^{(2)}_0+θ_1x ^{(2)}_1+...+θ_nx ^{(2)}_n\\... \\θ_0x ^{(m)}_0+θ_1x ^{(m)}_1+...+θ_nx ^{(m)}_n\\ \end{bmatrix} =\begin{bmatrix} A^{(1)} \\ A^{(2)} \\... \\ A^{(m)}\\ \end{bmatrix}\qquad\quad\,\,\qquad\quad\,\,(3)$
$h_θ(x)-y$ write for $E$ :
$E=h_θ(x)-y=\begin{bmatrix} g(A^{(1)})-y^{(1)} \\g(A^{(2)})-y^{(2)} \\... \\g(A^{(m)})-y^{(m)}\\ \end{bmatrix}=\begin{bmatrix} E^{(1)} \\ E^{(2)} \\... \\ E^{(m)} \\ \end{bmatrix}=g(A)-y\qquad\quad\,\,\qquad\quad\,\,(4)$
Put the formula (4) results into the Parameter $θ$ update process:
$θ_j:=θ_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_θ(x^{(i)})-y^{(i)})x_j^{(i)}$

$\quad\,\,\,=θ_j-\alpha\frac{1}{m}\sum_{i=1}^m E^{(i)}x_j^{(i)}\qquad\quad\,\,\qquad\quad\,\,(5)$

$\quad\,\,\,=θ_j-\alpha \frac{1}{m}(x_j^{(1)}x_j^{(2)},...,x_j^{(m)})E$
The sum of (5) in terms of parameters is:
$θ=\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix}:=\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix}-\frac{1}{m}\alpha\begin{bmatrix} x ^{(1)}_0&x^{(2)} _0&...&x^{(m)} _0\\ x^{(1)}_1&x^{(2)} _1&...&x^{(m)} _1 \\...&...&...&... \\x^{(1)}_n&x^{(2)} _n&...&x^{(m)} _n \\ \end{bmatrix}\cdot E$

$\quad\,\,\,=θ-\frac{1}{m}\alpha x^TE\qquad\quad\,\,\qquad\quad\,\,(6)$

In summary, the parameterization steps are as follows:
$A=x\cdot θ\to E=g(A)-y\to θ:=θ-\frac{1}{m}\alpha x^TE$
The composite representation is:
$θ:=θ-\frac{1}{m}\alpha x^T(g(x\cdot θ)-y)$

Potential Applications

There are many application scenarios of logistic regression algorithm. At present, this algorithm has been widely used in virtual application recommendation and guess the buyer’s choice.Extract as many features as possible to accurately judge and cater to user preferences.

Logistic regression has its potential application value in many fields including finance. For instance, in the application of credit risk: PD modeling prediction: As a measure of default rate, PD’s modeling needs to rely on historical data, that is, the record of default of the credit portfolio of financial institutions within a certain period of time in the past.Default and non-default can be represented by 0 and 1 respectively, and default data can be marked by a binary data.With features and markers, modeling can be done, and since the output is a binary variable, logistic regression is the natural choice, and the model parameters are clearly defined and easily accepted by the industry and regulatory authorities.

Logistic regression can also play a role in the field of circuits. By extracting features and judging the input and output of integrated circuits, machines can automatically tell us the damage information of the circuits without having to disassemble them.