文章目录
Binary Classification
Yes or No
(
x
,
y
)
:
x
∈
R
n
x
,
y
∈
{
0
,
1
}
(x,y): x\in\mathbb R^{n_x},y\in\{0,1\}
(x,y):x∈Rnx,y∈{0,1}
m training examples:
{
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
.
.
.
,
(
x
(
m
)
,
y
(
m
)
)
}
\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\}
{(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
In Matrix:
X
∈
R
n
x
×
m
,
Y
∈
R
1
×
m
X\in\mathbb R^{n_x\times m},Y\in\mathbb R^{1\times m}
X∈Rnx×m,Y∈R1×m
X
=
[
x
1
(
1
)
x
1
(
2
)
.
.
.
x
1
(
m
)
.
.
.
.
.
.
.
.
.
.
.
.
x
n
x
(
1
)
x
n
x
(
2
)
.
.
.
x
n
x
(
m
)
]
X=\begin{bmatrix} x_1^{(1)} & x_1^{(2)} &...& x_1^{(m)}\\ ...&...&...&...\\ x_{n_x}^{(1)} & x_{n_x}^{(2)} &...& x_{n_x}^{(m)} \end{bmatrix}
X=⎣⎢⎡x1(1)...xnx(1)x1(2)...xnx(2).........x1(m)...xnx(m)⎦⎥⎤
Y
=
[
y
(
1
)
y
(
2
)
.
.
.
y
(
m
)
]
Y=\begin{bmatrix} y^{(1)} & y^{(2)} &...& y^{(m)} \end{bmatrix}
Y=[y(1)y(2)...y(m)]
Logistic Regression
a learning algorithm that you use when the output labels y in a supervised learning problem are all either 0 or 1, so for binary classification problems,
Given x x x, want y ^ = P ( y = 1 ∣ x ) \hat{y}=P(y=1|x) y^=P(y=1∣x) (the probability of the chance that y is equal to 1 given the input features x)
In Linear Regression
Parameters:
w
∈
R
n
x
,
b
∈
R
w\in \mathbb R^{n_x}, b\in \mathbb R
w∈Rnx,b∈R
Output:
y
^
=
w
T
x
+
b
\hat{y}=w^Tx+b
y^=wTx+b
But in Logistic Regression,
0
≤
y
^
≤
1
0\leq \hat{y}\leq 1
0≤y^≤1
Output:
y
^
=
σ
(
w
T
x
+
b
)
\hat{y}=\sigma(w^Tx+b)
y^=σ(wTx+b)
Sigmoid function
σ
(
x
)
=
1
1
+
e
−
x
\sigma(x)=\frac{1}{1+e^{-x}}
σ(x)=1+e−x1
Loss function (Error function)
To measure how good the out put y ^ \hat{y} y^ is when the true label is y y y
In Linear Regression:
L
(
y
^
,
y
)
=
1
2
(
y
^
−
y
)
2
\mathcal L(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2
L(y^,y)=21(y^−y)2
But in Logistic Regression:
L
(
y
^
,
y
)
=
−
(
y
log
y
^
+
(
1
−
y
)
log
(
1
−
y
^
)
)
\mathcal L(\hat{y},y)=-(y\log \hat{y}+(1-y)\log(1-\hat{y}))
L(y^,y)=−(ylogy^+(1−y)log(1−y^))
If
y
=
1
y=1
y=1, then
L
(
y
^
,
y
)
=
−
y
log
y
^
\mathcal L(\hat{y},y)=-y\log \hat{y}
L(y^,y)=−ylogy^, when
y
^
→
1
,
L
(
y
^
,
y
)
→
0
\hat{y}\rightarrow 1, \mathcal L(\hat{y},y)\rightarrow 0
y^→1,L(y^,y)→0
If
y
=
0
y=0
y=0, then
L
(
y
^
,
y
)
=
−
log
(
1
−
y
^
)
\mathcal L(\hat{y},y)=-\log(1-\hat{y})
L(y^,y)=−log(1−y^), when
y
^
→
0
,
L
(
y
^
,
y
)
→
0
\hat{y}\rightarrow 0, \mathcal L(\hat{y},y)\rightarrow 0
y^→0,L(y^,y)→0
Cost function
The loss function measures how well you’re doing on a single training example.
The cost function measures how well you’re doing on an entire training set.
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
(
y
^
(
i
)
,
y
(
i
)
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
log
y
^
(
i
)
+
(
1
−
y
(
i
)
)
log
(
1
−
y
^
(
i
)
)
]
J(w,b)=\frac{1}{m}\sum\limits_{i=1}^m\mathcal L(\hat{y}^{(i)},y^{(i)})=-\frac{1}{m}\sum\limits_{i=1}^m[y^{(i)}\log \hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})]
J(w,b)=m1i=1∑mL(y^(i),y(i))=−m1i=1∑m[y(i)logy^(i)+(1−y(i))log(1−y^(i))]
Gradient Descent
Computation Graph
Chain rule (Backward Calculation)
d
J
d
V
=
3
\frac{dJ}{dV}=3
dVdJ=3
d
V
d
a
=
1
\frac{dV}{da}=1
dadV=1
d
J
d
a
=
d
J
d
V
d
V
d
a
=
3
\frac{dJ}{da}= \frac{dJ}{dV}\frac{dV}{da}=3
dadJ=dVdJdadV=3
d
J
d
u
=
d
J
d
a
=
3
\frac{dJ}{du}= \frac{dJ}{da} =3
dudJ=dadJ=3
d
J
d
b
=
d
J
d
u
d
u
d
b
=
3
c
\frac{dJ}{db}= \frac{dJ}{du} \frac{du}{db} =3c
dbdJ=dudJdbdu=3c
Variable Name in Code Writing:
d
J
v
a
r
:
d
F
i
n
a
l
O
u
t
p
u
t
V
a
r
d
v
a
r
dJvar : \frac{dFinalOutputVar}{dvar}
dJvar:dvardFinalOutputVar
d
V
:
d
J
d
V
dV : \frac{dJ}{dV}
dV:dVdJ
d
a
:
d
V
d
a
da : \frac{dV}{da}
da:dadV
Computation Graph of Logistic Regression
- Forward/left to right calculation to compute the cost function
- Backward/right to left calculation to compute derivatives
Gradient Descent on m m m Examples
Initialize:
J = 0 , d w 1 = 0 , . . . , d w n = 0 , d b = 0 J=0, dw_1=0, ..., dw_n=0, db=0 J=0,dw1=0,...,dwn=0,db=0
Training set:
For i =1 to m
z ( i ) = w T x ( i ) + b z^{(i)} = w^Tx^{(i)}+b z(i)=wTx(i)+b
a ( i ) = σ ( z ( i ) ) a^{(i)} = \sigma(z^{(i)}) a(i)=σ(z(i))
J + = − [ y ( i ) log a ( i ) + ( 1 − y ( i ) ) log ( 1 − a ( i ) ) ] J+=-[y^{(i)}\log a^{(i)} + (1-y^{(i)})\log (1-a^{(i)})] J+=−[y(i)loga(i)+(1−y(i))log(1−a(i))]
d z ( i ) = a ( i ) − y ( i ) dz^{(i)} = a^{(i)} -y^{(i)} dz(i)=a(i)−y(i)
d b + = d z ( i ) db += dz^{(i)} db+=dz(i)for j = 1 to n
d w j + = x j ( i ) d z ( i ) dw_j += x_j^{(i)}dz^{(i)} dwj+=xj(i)dz(i)
J = 1 m J J = \frac{1}{m}J J=m1J
d b = 1 m d b db = \frac{1}{m}db db=m1dbfor j = 1 to n
d w j = 1 m d w j dw_j = \frac{1}{m} dw_j dwj=m1dwj
J , d w 1 , . . . , d w n , d b J,dw_1,...,dw_n,db J,dw1,...,dwn,db are entire training set
Gradient Descent:
for j = 1 to n
w j = w j − α d w j w_j = w_j-\alpha dw_j wj=wj−αdwj
b = b − α d b b = b- \alpha db b=b−αdb
Vectorization
import numpy as np
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print("Vectorized Version:"+ str(1000*(toc-tic)) + 'ms')
c = 0
tic = time.time()
for i in range(1000000):
c += a[i]*b[i]
toc = time.time()
print("For Loop:"+ str(1000*(toc-tic)) + 'ms')
For Loop is 300 times slower than Vectorized Version
Vectorization techniques allow you to get rid of these explicit for-loops in your code
Training set:
Z = w T X + b = n p . d o t ( w . T , X ) + b Z=w^TX+b=np.dot(w.T,X)+b Z=wTX+b=np.dot(w.T,X)+b
A = σ ( Z ) A=\sigma(Z) A=σ(Z)
d Z = A Y ˙ dZ=A\dot Y dZ=AY˙
d w = 1 m X d Z T dw=\frac{1}{m}XdZ^T dw=m1XdZT
d b = 1 m n p . s u m ( d Z ) db= \frac{1}{m}np.sum(dZ) db=m1np.sum(dZ)