Neural Network and Deep Learning--week 1

Binary Classification

Yes or No

( x , y ) : x ∈ R n x , y ∈ { 0 , 1 } (x,y): x\in\mathbb R^{n_x},y\in\{0,1\} (x,y):xRnx,y{0,1}
m training examples: { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) } \{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\} {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}

In Matrix:
X ∈ R n x × m , Y ∈ R 1 × m X\in\mathbb R^{n_x\times m},Y\in\mathbb R^{1\times m} XRnx×m,YR1×m
X = [ x 1 ( 1 ) x 1 ( 2 ) . . . x 1 ( m ) . . . . . . . . . . . . x n x ( 1 ) x n x ( 2 ) . . . x n x ( m ) ] X=\begin{bmatrix} x_1^{(1)} & x_1^{(2)} &...& x_1^{(m)}\\ ...&...&...&...\\ x_{n_x}^{(1)} & x_{n_x}^{(2)} &...& x_{n_x}^{(m)} \end{bmatrix} X=x1(1)...xnx(1)x1(2)...xnx(2).........x1(m)...xnx(m)
Y = [ y ( 1 ) y ( 2 ) . . . y ( m ) ] Y=\begin{bmatrix} y^{(1)} & y^{(2)} &...& y^{(m)} \end{bmatrix} Y=[y(1)y(2)...y(m)]

Logistic Regression

a learning algorithm that you use when the output labels y in a supervised learning problem are all either 0 or 1, so for binary classification problems,

Given x x x, want y ^ = P ( y = 1 ∣ x ) \hat{y}=P(y=1|x) y^=P(y=1x) (the probability of the chance that y is equal to 1 given the input features x)

In Linear Regression
Parameters: w ∈ R n x , b ∈ R w\in \mathbb R^{n_x}, b\in \mathbb R wRnx,bR
Output: y ^ = w T x + b \hat{y}=w^Tx+b y^=wTx+b

But in Logistic Regression, 0 ≤ y ^ ≤ 1 0\leq \hat{y}\leq 1 0y^1
Output: y ^ = σ ( w T x + b ) \hat{y}=\sigma(w^Tx+b) y^=σ(wTx+b)

Sigmoid function

σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1+e^{-x}} σ(x)=1+ex1
在这里插入图片描述

Loss function (Error function)

To measure how good the out put y ^ \hat{y} y^ is when the true label is y y y

In Linear Regression: L ( y ^ , y ) = 1 2 ( y ^ − y ) 2 \mathcal L(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2 L(y^,y)=21(y^y)2
But in Logistic Regression: L ( y ^ , y ) = − ( y log ⁡ y ^ + ( 1 − y ) log ⁡ ( 1 − y ^ ) ) \mathcal L(\hat{y},y)=-(y\log \hat{y}+(1-y)\log(1-\hat{y})) L(y^,y)=(ylogy^+(1y)log(1y^))

If y = 1 y=1 y=1, then L ( y ^ , y ) = − y log ⁡ y ^ \mathcal L(\hat{y},y)=-y\log \hat{y} L(y^,y)=ylogy^, when y ^ → 1 , L ( y ^ , y ) → 0 \hat{y}\rightarrow 1, \mathcal L(\hat{y},y)\rightarrow 0 y^1,L(y^,y)0
If y = 0 y=0 y=0, then L ( y ^ , y ) = − log ⁡ ( 1 − y ^ ) \mathcal L(\hat{y},y)=-\log(1-\hat{y}) L(y^,y)=log(1y^), when y ^ → 0 , L ( y ^ , y ) → 0 \hat{y}\rightarrow 0, \mathcal L(\hat{y},y)\rightarrow 0 y^0,L(y^,y)0

在这里插入图片描述

Cost function

The loss function measures how well you’re doing on a single training example.
The cost function measures how well you’re doing on an entire training set.
J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) log ⁡ y ^ ( i ) + ( 1 − y ( i ) ) log ⁡ ( 1 − y ^ ( i ) ) ] J(w,b)=\frac{1}{m}\sum\limits_{i=1}^m\mathcal L(\hat{y}^{(i)},y^{(i)})=-\frac{1}{m}\sum\limits_{i=1}^m[y^{(i)}\log \hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})] J(w,b)=m1i=1mL(y^(i),y(i))=m1i=1m[y(i)logy^(i)+(1y(i))log(1y^(i))]

Gradient Descent

Computation Graph

在这里插入图片描述
Chain rule (Backward Calculation)
d J d V = 3 \frac{dJ}{dV}=3 dVdJ=3
d V d a = 1 \frac{dV}{da}=1 dadV=1
d J d a = d J d V d V d a = 3 \frac{dJ}{da}= \frac{dJ}{dV}\frac{dV}{da}=3 dadJ=dVdJdadV=3
d J d u = d J d a = 3 \frac{dJ}{du}= \frac{dJ}{da} =3 dudJ=dadJ=3
d J d b = d J d u d u d b = 3 c \frac{dJ}{db}= \frac{dJ}{du} \frac{du}{db} =3c dbdJ=dudJdbdu=3c

Variable Name in Code Writing:
d J v a r : d F i n a l O u t p u t V a r d v a r dJvar : \frac{dFinalOutputVar}{dvar} dJvar:dvardFinalOutputVar
d V : d J d V dV : \frac{dJ}{dV} dV:dVdJ
d a : d V d a da : \frac{dV}{da} da:dadV

Computation Graph of Logistic Regression
在这里插入图片描述

  • Forward/left to right calculation to compute the cost function
  • Backward/right to left calculation to compute derivatives

Gradient Descent on m m m Examples

Initialize:

J = 0 , d w 1 = 0 , . . . , d w n = 0 , d b = 0 J=0, dw_1=0, ..., dw_n=0, db=0 J=0,dw1=0,...,dwn=0,db=0

Training set:

For i =1 to m
z ( i ) = w T x ( i ) + b z^{(i)} = w^Tx^{(i)}+b z(i)=wTx(i)+b
a ( i ) = σ ( z ( i ) ) a^{(i)} = \sigma(z^{(i)}) a(i)=σ(z(i))
J + = − [ y ( i ) log ⁡ a ( i ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a ( i ) ) ] J+=-[y^{(i)}\log a^{(i)} + (1-y^{(i)})\log (1-a^{(i)})] J+=[y(i)loga(i)+(1y(i))log(1a(i))]
d z ( i ) = a ( i ) − y ( i ) dz^{(i)} = a^{(i)} -y^{(i)} dz(i)=a(i)y(i)
d b + = d z ( i ) db += dz^{(i)} db+=dz(i)

for j = 1 to n
d w j + = x j ( i ) d z ( i ) dw_j += x_j^{(i)}dz^{(i)} dwj+=xj(i)dz(i)

J = 1 m J J = \frac{1}{m}J J=m1J
d b = 1 m d b db = \frac{1}{m}db db=m1db

for j = 1 to n
d w j = 1 m d w j dw_j = \frac{1}{m} dw_j dwj=m1dwj

J , d w 1 , . . . , d w n , d b J,dw_1,...,dw_n,db J,dw1,...,dwn,db are entire training set

Gradient Descent:

for j = 1 to n
w j = w j − α d w j w_j = w_j-\alpha dw_j wj=wjαdwj
b = b − α d b b = b- \alpha db b=bαdb

Vectorization

import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()

print("Vectorized Version:"+ str(1000*(toc-tic)) + 'ms')

c = 0
tic = time.time()
for i in range(1000000):
	c += a[i]*b[i]
toc = time.time()

print("For Loop:"+ str(1000*(toc-tic)) + 'ms')

For Loop is 300 times slower than Vectorized Version

Vectorization techniques allow you to get rid of these explicit for-loops in your code
Training set:

Z = w T X + b = n p . d o t ( w . T , X ) + b Z=w^TX+b=np.dot(w.T,X)+b Z=wTX+b=np.dot(w.T,X)+b
A = σ ( Z ) A=\sigma(Z) A=σ(Z)
d Z = A Y ˙ dZ=A\dot Y dZ=AY˙
d w = 1 m X d Z T dw=\frac{1}{m}XdZ^T dw=m1XdZT
d b = 1 m n p . s u m ( d Z ) db= \frac{1}{m}np.sum(dZ) db=m1np.sum(dZ)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值