理论探究
The goal of binary logistic regression is to train a classifier that can make a binary decision about the class of a new input observation.
Consider a single input observation x, which we will represent by a vector of features
[
x
1
,
x
2
,
.
.
.
,
x
n
]
\lbrack x1,x2,...,xn\rbrack
[x1,x2,...,xn]
The classifier output y can be 1 (meaning the observation is a member of the class) or 0(the observation is not a member of the class).
We want to know the probability
P
(
y
=
1
∣
x
)
P(y=1\vert x)
P(y=1∣x)
And logistic regression (LR) solves this task by learning, from a training set, a vector of weights and a bias term.
After we’ve learned the weights in training, the resulting single number z expresses the weighted sum of the evidence for the class
z
=
(
∑
i
=
1
n
w
i
x
i
)
+
b
z=(\sum_{i=1}^nw_ix_i)+b
z=(i=1∑nwixi)+b
In the rest of the book we’ll represent such sums using the dot product notation from linear algebra. The dot product of two vectors a and b, written as a·b is the sum of the products of the corresponding elements of each vector. Thus we have the following formation
z
=
w
⋅
x
+
b
z=w\cdot x+b
z=w⋅x+b
It’s obvious that z ranges from −∞ to ∞. But we hope that z lies between 0 and 1. So, we use the sigmoid function.
y
=
σ
(
z
)
=
1
1
+
e
−
z
lim
z
→
∞
1
1
+
e
−
z
=
1
lim
z
→
−
∞
1
1
+
e
−
z
=
0
y=\sigma(z)=\frac1{1+e^{-z}}\\ \;\\ \lim_{z\rightarrow\infty}\frac1{1+e^{-z}}=1\\ \;\\ \lim_{z\rightarrow-\infty}\frac1{1+e^{-z}}=0
y=σ(z)=1+e−z1z→∞lim1+e−z1=1z→−∞lim1+e−z1=0
We’re almost there. If we apply the sigmoid to the sum of the weighted features,we get a number between 0 and 1. To make it a probability, we just need to make sure that the two cases,p(y=1) and p(y=0), sum to 1. We can do this as follows:
P
(
y
=
1
)
=
σ
(
w
⋅
x
+
b
)
=
1
1
+
e
−
w
⋅
x
+
b
P
(
y
=
0
)
=
1
−
σ
(
w
⋅
x
+
b
)
=
e
−
w
⋅
x
+
b
1
+
e
−
w
⋅
x
+
b
P(y=1)=\sigma(w\cdot x+b)=\frac1{1+e^{-w\cdot x+b}}\\ \;\\ P(y=0)=1-\sigma(w\cdot x+b)=\frac{e^{-w\cdot x+b}}{1+e^{-w\cdot x+b}}
P(y=1)=σ(w⋅x+b)=1+e−w⋅x+b1P(y=0)=1−σ(w⋅x+b)=1+e−w⋅x+be−w⋅x+b
Now we have an algorithm that given an instance x computes the probability P(y=1|x). For a test instance x, we say yes if the probability P(y=1|x) is more than 0.5, and no otherwise. We call 0.5 the decision boundary:
y
^
=
1
i
f
P
(
y
=
1
∣
x
)
>
0.5
y
^
=
0
o
t
h
e
r
w
i
s
e
\widehat y=1\;\;\;if\;P(y=1\vert x)>0.5\\ \widehat y=0\;\;\;otherwise\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;
y
=1ifP(y=1∣x)>0.5y
=0otherwise