Algorithm Learning:Supervised Learning - Logistic Regression - two-class Classification

Introduction

This report focuses on two-class Classification and introduces the Logistic Regression , including theories, explanations, formula derivations, potential applications and so on.
.
Note:This report is a study note of “CS229 Lecture Notes”, by Andrew Ng, Stanford University.


Theories of Logistic Regression

Classification and regression are two major problems that machine learning can solve. From the perspective of the types of predicted values, the quantitative output of continuous variable prediction is called regression.The qualitative output of discrete variable prediction is called classification.

Logistic Regression, although with the word “Regression” in its name, is actually a classification method for two classification problems (i.e., only two kinds of outputs). In this method the variable that we’re trying to predict is a variable y that we can think of as taking on two values either 0 or 1. Class 0 is called “Negative Class” and Class 1 is called “Positive Class” .
y ∈ { 0 , 1 } y\in\lbrace {0},{1}\rbrace y{0,1}

Note:The formula represents the value space of the label y .


Procedure Explanation

The theories of logistic regression and linear regression are similar and can be simply described as follows:

(1)Find an appropriate prediction function (called Hypothesis in Andrew Ng’s class), which is generally expressed as h h h function.

(2)Define a decision boundary. Its meaning is to find out exactly when logistic regression determines the data to be of class 1 or class 0.

(3)Construct a cost function. In this article, the cross entropy function based on maximum likelihood estimation is used as the cost function to analyze.

(4)Use gradient descent to minimize the cost function.

1. Hypothesis Representation

In this section, Let’s introduce the logistic regression hypothesis first. According to the above steps, we need to find a suitable hypothesis representation h h hθ ( x ) (x) (x). And here the hypothesis of logistic regression is g ( h ( θ ) ) = 1 1 + e − θ T x + ϵ g(h(θ))=\frac{1}{1+e^{-θ^Tx+\epsilon}} g(h(θ))=1+eθTx+ϵ1where
g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1is called the logistic function or the sigmoid function, it looks like this:
The sigmoid function . It reproduced from https://blog.csdn.net/iracer/article/details/50684609
Notice that it starts off near 0 and then rises until it crosses 0.5 and the origin, and then it flattens out again like so. It has the following characteristics: if z > > 0 z>>0 z>>0, y → 1 y\to1 y1,else if z < < 0 z<<0 z<<0 y → 0 y\to0 y0. This has strong practical implications: y y y represents the probability that the sample belongs to class 1.

Based on this, the probabilities of classification results of input X as class 1 and class 0 are respectively:
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=1|x;θ)=h_θ(x)\\ P(y=0|x;θ)=1-h_θ(x) P(y=1x;θ)=hθ(x)P(y=0x;θ)=1hθ(x)

2. Decision Boundary

The decision boundary is a property of the hypothesis under the parameters. Once we have the parameters θ, we can determine the decision boundaries. The decision boundary is: θ T x = 0 θ^Tx=0 θTx=0If θ T x ≥ 0 θ^Tx≥0 θTx0, then predict y = 1 y=1 y=1;
If θ T x < 0 θ^Tx<0 θTx<0, then predict y = 0 y=0 y=0;

Let’s start with an example:

example:
在这里插入图片描述The figure above depicts a linear decision boundary,decision boundary form is as follows: θ T x = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 , x 0 = 1 θ^Tx=θ_0x_0+θ_1x_1+θ_2x_2, x_0=1 θTx=θ0x0+θ1x1+θ2x2,x0=1 i f   θ = [ − 3 , 1 , 1 ] T if\ θ=[-3,1,1]^T if θ=[3,1,1]TThe prediction function is constructed as: h θ ( x ) = g ( θ T x ) = 1 1 + e 3 − x 1 − x 2 h_θ(x)=g(θ^Tx)=\frac{1}{1+e^{3-x_1-x_2}} hθ(x)=g(θTx)=1+e3x1x21

Predict y = 1 y=1 y=1, if − 3 + x 1 + x 2 ≥ 0 -3+x_1+x_2≥0 3+x1+x20, that is, if x 1 + x 2 ≥ 3 x_1+x_2≥3 x1+x23, then y = 1 y=1 y=1,
Predict y = 0 y=0 y=0, if − 3 + x 1 + x 2 < 0 -3+x_1+x_2<0 3+x1+x2<0, that is, if x 1 + x 2 < 3 x_1+x_2<3 x1+x2<3, then y = 0 y=0 y=0.

3. Cost Function

Assuming that the n n n training examples were generated independently, the loss function and the J function are as follows, which are derived from the maximum likelihood estimation: L o s s ( h θ ( x ) , y ) = { − l o g ( h θ ( x ) ) , if  y = 1 − l o g ( 1 − h θ ( x ) ) , if  y = 0 Loss(h_θ(x),y)= \begin{cases} -log(h_θ(x)),& \text{if $y=1$} \\ -log(1-h_θ(x)), & \text{if $y=0$} \end{cases} Loss(hθ(x),y)={log(hθ(x)),log(1hθ(x)),if y=1if y=0 J ( θ ) = 1 m ∑ i = 1 m L o s s ( h θ ( x ( i ) ) , y ( i ) ) = − 1 m ∑ i = 1 m ( y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ) J(θ)=\frac{1}{m}\sum_{i=1}^mLoss(h_θ(x^{(i)}),y^{(i)})=-\frac{1}{m}\sum_{i=1}^m(y^{(i)} log(h_θ(x^{(i)}))+(1-y^{(i)})log(1-h_θ(x^{(i)}))) J(θ)=m1i=1mLoss(hθ(x(i)),y(i))=m1i=1m(y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i))))

And then we’re going to do some formula derivations:
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=1|x;θ)=h_θ(x)\\ P(y=0|x;θ)=1-h_θ(x) P(y=1x;θ)=hθ(x)P(y=0x;θ)=1hθ(x)In fact, this can be combined into the following formula: P ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y P(y|x;θ)=(h_θ(x))^{y}(1-h_θ(x))^{1-y} P(yx;θ)=(hθ(x))y(1hθ(x))1yTake the likelihood function: L ( θ ) = ∏ i = 1 m P ( y ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) L(θ)=\prod_{i=1}^mP(y|x^{(i)};θ)=\prod_{i=1}^m(h_θ(x^{(i)}))^{y^{(i)}}(1-h_θ(x^{(i)}))^{1-y^{(i)}} L(θ)=i=1mP(yx(i);θ)=i=1m(hθ(x(i)))y(i)(1hθ(x(i)))1y(i)After taking the logarithm: l ( θ ) = l o g L ( θ ) = ∑ i = 1 m ( y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ) l(θ)=log{L(θ)}=\sum_{i=1}^m(y^{(i)} log(h_θ(x^{(i)}))+(1-y^{(i)})log(1-h_θ(x^{(i)}))) l(θ)=logL(θ)=i=1m(y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i))))The above equation is to gain θ when x is maximized by using the maximum likelihood estimation. If gradient descent method is used, a negative coefficient should be added to obtain the final J function: J ( θ ) = − 1 m l ( θ ) J(θ)=-\frac{1}{m}l(θ) J(θ)=m1l(θ)In this case, we just have to minimize J J J to get θ θ θ.

Note:The likelihood function of parameter θ given output x is equal to the probability of taking x given θ: L ( θ ∣ x ) = P ( x ∣ θ ) L(θ|x)=P(x|θ) L(θx)=P(xθ).

4. Gradient Descent

Then we use gradient descent to train this model. The θ θ θ update process is derived as follows:

θ j : = θ j − α δ δ θ j J ( θ ) θ_j:=θ_j-\alpha\frac{ \delta}{\deltaθ_j}J(θ) θj:=θjαδθjδJ(θ)

δ δ θ j J ( θ ) = − 1 m ∑ i = 1 m ( y ( i ) 1 h θ ( x ( i ) ) δ δ θ j h θ ( x ( i ) ) − ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ) δ δ θ j h θ ( x ( i ) ) ) \frac{ \delta}{\deltaθ_j}J(θ)=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\frac{1}{h_θ(x^{(i)})}\frac{ \delta}{\deltaθ_j}h_θ(x^{(i)})-(1-y^{(i)})\frac{1}{1-h_θ(x^{(i)})}\frac{\delta}{\deltaθ_j}h_θ(x^{(i)})) δθjδJ(θ)=m1i=1m(y(i)hθ(x(i))1δθjδhθ(x(i))(1y(i))1hθ(x(i))1δθjδhθ(x(i)))

δ δ θ j J ( θ ) = − 1 m ∑ i = 1 m ( y ( i ) 1 h θ ( x ( i ) ) − ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ) ) δ δ θ j h θ ( x ( i ) ) \frac{ \delta}{\deltaθ_j}J(θ)=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\frac{1}{h_θ(x^{(i)})}-(1-y^{(i)})\frac{1}{1-h_θ(x^{(i)})})\frac{\delta}{\deltaθ_j}h_θ(x^{(i)}) δθjδJ(θ)=m1i=1m(y(i)hθ(x(i))1(1y(i))1hθ(x(i))1)δθjδhθ(x(i))

for g ( z ) g(z) g(z):

g ′ ( z ) = d d z 1 1 + e − z = 1 ( 1 + e − z ) 2 e − z = 1 1 + e − z ( 1 − 1 1 + e − z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=\frac{d}{dz}\frac{1}{1+e^{-z}}=\frac{1}{(1+e^{-z})^2}e^{-z}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z)) g(z)=dzd1+ez1=(1+ez)21ez=1+ez1(11+ez1)=g(z)(1g(z))

δ δ θ j g ( θ T x ) = g ( θ T x ) ( 1 − g ( θ T x ) ) δ δ θ j θ T x ( i ) \frac{ \delta}{\deltaθ_j}g(θ^Tx)=g(θ^Tx)(1-g(θ^Tx))\frac{\delta}{\deltaθ_j}θ^Tx^{(i)} δθjδg(θTx)=g(θTx)(1g(θTx))δθjδθTx(i)

To the original:

δ δ θ j J ( θ ) = − 1 m ∑ i = 1 m ( y ( i ) 1 g ( θ T x ( i ) ) − ( 1 − y ( i ) ) 1 1 − g ( θ T x ( i ) ) ) g ( θ T x ( i ) ) ( 1 − g ( θ T x ( i ) ) ) δ δ θ j θ T x ( i ) \frac{ \delta}{\deltaθ_j}J(θ)=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\frac{1}{g(θ^Tx^{(i)})}-(1-y^{(i)})\frac{1}{1-g(θ^Tx^{(i)})})g(θ^Tx^{(i)})(1-g(θ^Tx^{(i)}))\frac{\delta}{\deltaθ_j}θ^Tx^{(i)} δθjδJ(θ)=m1i=1m(y(i)g(θTx(i))1(1y(i))1g(θTx(i))1)g(θTx(i))(1g(θTx(i)))δθjδθTx(i)

   = − 1 m ∑ i = 1 m ( y ( i ) ( 1 − g ( θ T x ( i ) ) ) − ( 1 − y ( i ) ) g ( θ T x ( i ) ) ) x j ( i ) \qquad\quad\,\,=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}(1-g(θ^Tx^{(i)}))-(1-y^{(i)})g(θ^Tx^{(i)}))x_j^{(i)} =m1i=1m(y(i)(1g(θTx(i)))(1y(i))g(θTx(i)))xj(i)

   = − 1 m ∑ i = 1 m ( y ( i ) − g ( θ T x ( i ) ) ) x j ( i ) \qquad\quad\,\,=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}-g(θ^Tx^{(i)}))x_j^{(i)} =m1i=1m(y(i)g(θTx(i)))xj(i)

   = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \qquad\quad\,\,=\frac{1}{m}\sum_{i=1}^m(h_θ(x^{(i)})-y^{(i)})x_j^{(i)} =m1i=1m(hθ(x(i))y(i))xj(i)

Finally, we get Parameter θ θ θ update process:

θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) , j = 0 , 1 , . . . , n θ_j:=θ_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_θ(x^{(i)})-y^{(i)})x_j^{(i)},j=0,1,...,n θj:=θjαm1i=1m(hθ(x(i))y(i))xj(i),j=0,1,...,n

This has a lot of similarities to the linear regression gradient descent process.

5. Vectorization

To facilitate code implementation, the parameter update process needs to be vectorized further, and the following is the mathematical derivation of the vectorization process:

  1. The matrix form of the training data is specified as follows, where each row of x x x is a unit sample and each column is a different characteristic value.

x = [ x ( 1 ) x ( 2 ) . . . x ( m ) ] = [ x 0 ( 1 ) x 1 ( 1 ) . . . x n ( 1 ) x 0 ( 2 ) x 1 ( 2 ) . . . x n ( 2 ) . . . . . . . . . . . . x 0 ( m ) x 1 ( m ) . . . x n ( m ) ] , y = [ y ( 1 ) y ( 2 ) . . . y ( m ) ]       ( 1 ) x=\begin{bmatrix} x ^{(1)} \\ x^{(2)} \\... \\x^{(m)} \\ \end{bmatrix}=\begin{bmatrix} x ^{(1)}_0&x^{(1)} _1&...&x^{(1)} _n\\ x^{(2)}_0&x^{(2)} _1&...&x^{(2)} _n \\...&...&...&... \\x^{(m)}_0&x^{(m)} _1&...&x^{(m)} _n \\ \end{bmatrix},y=\begin{bmatrix} y ^{(1)} \\ y^{(2)} \\... \\y^{(m)} \\ \end{bmatrix} \qquad\quad\,\,\qquad\quad\,\,(1) x=x(1)x(2)...x(m)=x0(1)x0(2)...x0(m)x1(1)x1(2)...x1(m)............xn(1)xn(2)...xn(m),y=y(1)y(2)...y(m)(1)

  1. The matrix form of the parameter to be solved is as follows:
    θ = [ θ 0 θ 1 . . . θ n ]       ( 2 ) θ=\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix} \qquad\quad\,\,\qquad\quad\,\,(2) θ=θ0θ1...θn(2)

  2. x ⋅ θ x\cdotθ xθ write for A A A:
    A = x ⋅ θ = [ x 0 ( 1 ) x 1 ( 1 ) . . . x n ( 1 ) x 0 ( 2 ) x 1 ( 2 ) . . . x n ( 2 ) . . . . . . . . . . . . x 0 ( m ) x 1 ( m ) . . . x n ( m ) ] ⋅ [ θ 0 θ 1 . . . θ n ] = [ θ 0 x 0 ( 1 ) + θ 1 x 1 ( 1 ) + . . . + θ n x n ( 1 ) θ 0 x 0 ( 2 ) + θ 1 x 1 ( 2 ) + . . . + θ n x n ( 2 ) . . . θ 0 x 0 ( m ) + θ 1 x 1 ( m ) + . . . + θ n x n ( m ) ] = [ A ( 1 ) A ( 2 ) . . . A ( m ) ]       ( 3 ) A=x\cdotθ=\begin{bmatrix} x ^{(1)}_0&x^{(1)} _1&...&x^{(1)} _n\\ x^{(2)}_0&x^{(2)} _1&...&x^{(2)} _n \\...&...&...&... \\x^{(m)}_0&x^{(m)} _1&...&x^{(m)} _n \\ \end{bmatrix}\cdot\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix}=\begin{bmatrix} θ_0x ^{(1)}_0+θ_1x ^{(1)}_1+...+θ_nx ^{(1)}_n \\θ_0x ^{(2)}_0+θ_1x ^{(2)}_1+...+θ_nx ^{(2)}_n\\... \\θ_0x ^{(m)}_0+θ_1x ^{(m)}_1+...+θ_nx ^{(m)}_n\\ \end{bmatrix} =\begin{bmatrix} A^{(1)} \\ A^{(2)} \\... \\ A^{(m)}\\ \end{bmatrix}\qquad\quad\,\,\qquad\quad\,\,(3) A=xθ=x0(1)x0(2)...x0(m)x1(1)x1(2)...x1(m)............xn(1)xn(2)...xn(m)θ0θ1...θn=θ0x0(1)+θ1x1(1)+...+θnxn(1)θ0x0(2)+θ1x1(2)+...+θnxn(2)...θ0x0(m)+θ1x1(m)+...+θnxn(m)=A(1)A(2)...A(m)(3)

  3. h θ ( x ) − y h_θ(x)-y hθ(x)y write for E E E:
    E = h θ ( x ) − y = [ g ( A ( 1 ) ) − y ( 1 ) g ( A ( 2 ) ) − y ( 2 ) . . . g ( A ( m ) ) − y ( m ) ] = [ E ( 1 ) E ( 2 ) . . . E ( m ) ] = g ( A ) − y       ( 4 ) E=h_θ(x)-y=\begin{bmatrix} g(A^{(1)})-y^{(1)} \\g(A^{(2)})-y^{(2)} \\... \\g(A^{(m)})-y^{(m)}\\ \end{bmatrix}=\begin{bmatrix} E^{(1)} \\ E^{(2)} \\... \\ E^{(m)} \\ \end{bmatrix}=g(A)-y\qquad\quad\,\,\qquad\quad\,\,(4) E=hθ(x)y=g(A(1))y(1)g(A(2))y(2)...g(A(m))y(m)=E(1)E(2)...E(m)=g(A)y(4)

  4. Put the formula (4) results into the Parameter θ θ θ update process:
    θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) θ_j:=θ_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_θ(x^{(i)})-y^{(i)})x_j^{(i)} θj:=θjαm1i=1m(hθ(x(i))y(i))xj(i)

        = θ j − α 1 m ∑ i = 1 m E ( i ) x j ( i )       ( 5 ) \quad\,\,\,=θ_j-\alpha\frac{1}{m}\sum_{i=1}^m E^{(i)}x_j^{(i)}\qquad\quad\,\,\qquad\quad\,\,(5) =θjαm1i=1mE(i)xj(i)(5)

        = θ j − α 1 m ( x j ( 1 ) x j ( 2 ) , . . . , x j ( m ) ) E \quad\,\,\,=θ_j-\alpha \frac{1}{m}(x_j^{(1)}x_j^{(2)},...,x_j^{(m)})E =θjαm1(xj(1)xj(2),...,xj(m))E

  5. The sum of (5) in terms of parameters is:
    θ = [ θ 0 θ 1 . . . θ n ] : = [ θ 0 θ 1 . . . θ n ] − 1 m α [ x 0 ( 1 ) x 0 ( 2 ) . . . x 0 ( m ) x 1 ( 1 ) x 1 ( 2 ) . . . x 1 ( m ) . . . . . . . . . . . . x n ( 1 ) x n ( 2 ) . . . x n ( m ) ] ⋅ E θ=\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix}:=\begin{bmatrix} θ_0 \\ θ_1 \\... \\θ_n\\ \end{bmatrix}-\frac{1}{m}\alpha\begin{bmatrix} x ^{(1)}_0&x^{(2)} _0&...&x^{(m)} _0\\ x^{(1)}_1&x^{(2)} _1&...&x^{(m)} _1 \\...&...&...&... \\x^{(1)}_n&x^{(2)} _n&...&x^{(m)} _n \\ \end{bmatrix}\cdot E θ=θ0θ1...θn:=θ0θ1...θnm1αx0(1)x1(1)...xn(1)x0(2)x1(2)...xn(2)............x0(m)x1(m)...xn(m)E

        = θ − 1 m α x T E       ( 6 ) \quad\,\,\,=θ-\frac{1}{m}\alpha x^TE\qquad\quad\,\,\qquad\quad\,\,(6) =θm1αxTE(6)

In summary, the parameterization steps are as follows:
A = x ⋅ θ → E = g ( A ) − y → θ : = θ − 1 m α x T E A=x\cdot θ\to E=g(A)-y\to θ:=θ-\frac{1}{m}\alpha x^TE A=xθE=g(A)yθ:=θm1αxTE
The composite representation is:
θ : = θ − 1 m α x T ( g ( x ⋅ θ ) − y ) θ:=θ-\frac{1}{m}\alpha x^T(g(x\cdot θ)-y) θ:=θm1αxT(g(xθ)y)

Potential Applications

There are many application scenarios of logistic regression algorithm. At present, this algorithm has been widely used in virtual application recommendation and guess the buyer’s choice.Extract as many features as possible to accurately judge and cater to user preferences.

Logistic regression has its potential application value in many fields including finance. For instance, in the application of credit risk: PD modeling prediction: As a measure of default rate, PD’s modeling needs to rely on historical data, that is, the record of default of the credit portfolio of financial institutions within a certain period of time in the past.Default and non-default can be represented by 0 and 1 respectively, and default data can be marked by a binary data.With features and markers, modeling can be done, and since the output is a binary variable, logistic regression is the natural choice, and the model parameters are clearly defined and easily accepted by the industry and regulatory authorities.

Logistic regression can also play a role in the field of circuits. By extracting features and judging the input and output of integrated circuits, machines can automatically tell us the damage information of the circuits without having to disassemble them.

  • 3
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值