Class1-Week2-Neural Networks Basics

Logistic Regression

Description

Logistic regression is a learning algorithm used in a supervised learning problem when the output ? are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data

Example: Cat vs No-cat

Given an image represented by a feature vector ?, the algorithm will evaluate the probability of a cat being in that image.

G i v e n − x , y ^ = P ( y = 1 ∣ x ) , w h e r e y ^ ∈ [ 0 , 1 ] Given - x, \widehat{y} = P(y=1|x), where \widehat{y} \in [0,1] Givenx,y =P(y=1x),wherey [0,1]

The parameters used in Logistic regression are:

  • The input features vector: x ∈ R n x x \in \mathbb{R}^{n_{x}} xRnx, where n x n_{x} nx is the number of features
  • The training label: y ∈ { 0 , 1 } y \in \{0,1\} y{0,1}
  • The weight: w ∈ R n x w \in \mathbb{R}^{n_{x}} wRnx, where n x n_{x} nx is the number of features
  • The threshold: b ∈ R b \in \mathbb{R} bR
  • The output: y ^ = σ w T x + b \widehat{y} = \sigma{w^{T}x + b} y =σwTx+b
  • Sigmoid function: s = σ ( w T x + b ) s = \sigma(w^{T}x + b) s=σ(wTx+b), σ z = 1 1 + e − z \sigma{z} = \frac{1}{1 + e^{-z}} σz=1+ez1

Logistic Function

import numpy as np 
import time
import matplotlib.pyplot as plt 

%matplotlib inline
x = np.arange(-10, 10, 0.001)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.suptitle(r'$y=\frac{1}{1+e^{-x}}$', fontsize=20)
plt.grid(color='gray')
plt.grid(linewidth='1')
plt.grid(linestyle='--')

plt.show()

在这里插入图片描述

( w T x + b ) (w^{T}x + b) (wTx+b) is a linear function (?? + ?), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.

Some observations from the graph:

  • If z is large positive number, then σ ( z ) = 1 \sigma(z) = 1 σ(z)=1
  • If z is large negative number, then σ ( z ) = 0 \sigma(z) = 0 σ(z)=0
  • if z = 0 z = 0 z=0, then σ ( z ) = 0.5 \sigma(z) = 0.5 σ(z)=0.5

Logistic Cost Function

To train the parameters ? and ?, we need to define a cost function.

Recap:

y ^ = σ ( w T x ( i ) + b ) , w h e r e   σ ( z i ) = 1 1 + e − z \widehat{y} = \sigma(w^{T}x^{(i)}+b), where\ \sigma(z^{i}) = \frac{1}{1 + e^{-z}} y =σ(wTx(i)+b),where σ(zi)=1+ez1

G i v e n   { ( x ( 1 ) , y ( 1 ) ) , ⋯   , ( ( x ( m ) , y ( m ) ) ) } , w e   w a n t   y ^ ( i ) ≈ y ( i ) Given\ \{(x^{(1)},y^{(1)}), \cdots,((x^{(m)},y^{(m)}))\}, we\ want\ \widehat{y}^{(i)}\approx y^{(i)} Given {(x(1),y(1)),,((x(m),y(m)))},we want y (i)y(i)

  • x ( i ) x^{(i)} x(i) the i-th traning example

Loss(error) Function

The loss function measures the discrepancy between the prediction y ^ ( i ) \widehat{y}^{(i)} y (i) and the desired output y ( i ) y^{(i)} y(i).In other words, the loss function computes the error for a single training example.

L ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 L(\widehat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\widehat{y}^{(i)} - y^{(i)})^{2} L(y (i),y(i))=21(y (i)y(i))2

L ( y ^ ( i ) , y ( i ) ) = − ( y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ) L(\widehat{y}^{(i)}, y^{(i)}) = -(y^{(i)}log(\widehat{y}^{(i)}) + (1 - y^{(i)})log(1-\widehat{y}^{(i)})) L(y (i),y(i))=(y(i)log(y (i))+(1y(i))log(1y (i)))

  • If y ( i ) = 1 y^{(i)} = 1 y(i)=1: ( y ^ ( i ) , y ( i ) ) = − l o g ( y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(\widehat{y}^{(i)}) (y (i),y(i))=log(y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 1.
  • If y ( i ) = 0 y^{(i)} = 0 y(i)=0: ( y ^ ( i ) , y ( i ) ) = − l o g ( 1 − y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(1 - \widehat{y}^{(i)}) (y (i),y(i))=log(1y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 0

Cost Function

The cost function is the average of the loss function of the entire training set. We are going to find the parameters w and b that minimize the overall cost function.

J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1mL(y (i),y(i))=m1i=1m[y(i)log(y (i))+(1y(i))log(1y (i))]


Gradient Descent

One Example:

We have get that:

z = w T x + b z = w^{T}x + b z=wTx+b

y ^ = a = σ ( z ) \widehat{y} = a = \sigma(z) y =a=σ(z)

L ( a , y ) = − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) L(a,y) = -(ylog(a) + (1 - y)log(1 - a)) L(a,y)=(ylog(a)+(1y)log(1a))

Forward:
在这里插入图片描述
Backward:

α L ( a , y ) α a = α − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) α a = − y a + 1 − y 1 − a \frac{\alpha_{L(a,y)}}{\alpha_{a}} = \frac{\alpha_{-(ylog(a)+(1-y)log(1-a))}}{\alpha_{a}} = -\frac{y}{a} + \frac{1-y}{1-a} αaαL(a,y)=αaα(ylog(a)+(1y)log(1a))=ay+1a1y

α L ( a , y ) α z = α L ( a , y ) α a × α a α z = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) = a − y \frac{\alpha_{L(a,y)}}{\alpha_{z}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) = a - y αzαL(a,y)=αaαL(a,y)×αzαa=(ay+1a1y)×a(1a)=ay

α L ( a , y ) α w 1 = α L ( a , y ) α a × α a α z × α z α w 1 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 1 = x 1 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{1}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{1}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{1} = x_{1}(a-y) αw1αL(a,y)=αaαL(a,y)×αzαa×αw1αz=(ay+1a1y)×a(1a)×x1=x1(ay)

α L ( a , y ) α w 2 = α L ( a , y ) α a × α a α z × α z α w 2 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 2 = x 2 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{2}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{2}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{2} = x_{2}(a-y) αw2αL(a,y)=αaαL(a,y)×αzαa×αw2αz=(ay+1a1y)×a(1a)×x2=x2(ay)

α L ( a , y ) α b = α L ( a , y ) α a × α a α z × α z α b = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × 1 = ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{b}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{b}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times 1 = (a-y) αbαL(a,y)=αaαL(a,y)×αzαa×αbαz=(ay+1a1y)×a(1a)×1=(ay)

M Training Examples:

Recap:

J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1mL(y (i),y(i))=m1i=1m[y(i)log(y (i))+(1y(i))log(1y (i))]

α J ( w , b ) α w = 1 m ∑ i = 1 m α L ( y ^ ( i ) , y ( i ) ) α w \frac{\alpha_{J(w,b)}}{\alpha_{w}} = \frac{1}{m} \sum_{i=1}^{m} \frac{\alpha_{L(\widehat{y}^{(i)}, y^{(i)})}}{\alpha_w} αwαJ(w,b)=m1i=1mαwαL(y (i),y(i))

# Assume we have two features in this prediction
def logisitcRegressionGradientDescent(m, x, y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    x (matrix: 2 * m) -- the features
    y (vector: 1 * m) -- the label
    w (vector: 1 * 2) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """     
    J = 0; dw1 = 0; dw2 = 0; db = 0
    z = []; a = []
    for i in range(m):
        # FP
        z[i] = w * x[:, i] + b
        a[i] = sigmoid(z[i])
        J += -[y[i] * log(a[i]) + (1 - y[i]) * log(1 - a[i])]
        
        #BP
        dz[i] = a[i] - y[i]
        dw1 += x[i][0] * dz[i]
        dw2 += x[i][1] * dz[i]
        db += dz[i]
    
    dw1 /= m
    dw2 /= m
    db /= m
    
    w1 -= alpha * dw1
    w2 -= alpha * dw2
    b -= alpha * b
    
    return

Vectorization

Look at the Power of Vectorization:

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorized version:", str((toc - tic) * 1000) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print("For loop version:", str((toc - tic) * 1000) + "ms")
249706.30302638497
Vectorized version: 1.2466907501220703ms
249706.30302638162
For loop version: 395.0514793395996ms

Vectorizing Logitic Regression

def logisitcRegressionGradientDescent(m, X, Y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    X (matrix: n * m) -- the features
    Y (vector: m * 1) -- the label
    w (vector: 1 * n) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """    
    
    # FP
    Z = np.dot(W, X) + b
    A = sigmoid(Z)
    #J = -np.sum(Y * log(A) + (1 - Y) * log(1 - A))
    dZ = A - Y
    dw = 1 / m * np.dot(dZ, X.T)
    db = 1 / m * np.sum(dZ)
    return

Broadcasting in Python

a = np.array([[1,2], [3,4]])
print(a)

b = 1
# b = [1,1]
# b = [[1], [1]]

# boradcast b to [[1,1], [1,1]]
# b = np.array([[1,1], [1,1]])

print(a + b)
print(a - b)
[[1 2]
 [3 4]]
[[2 3]
 [4 5]]
[[0 1]
 [2 3]]

A Note on Python/Numpy vectors

a = np.random.randn(5) # Rank 1 array
print(a.shape, a)

print(a.T)
print(a * a.T)
(5,) [ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[2.19126245 0.42514382 0.0072945  1.17883282 2.96784882]
a = np.random.randn(5, 1)
print(a.shape)
print(a)
print(a.T)
print(a.T.shape)
print(a * a.T)
(5, 1)
[[ 1.80545889]
 [ 2.31719407]
 [ 0.89081914]
 [-1.08760266]
 [ 0.24755189]]
[[ 1.80545889  2.31719407  0.89081914 -1.08760266  0.24755189]]
(1, 5)
[[ 3.25968179  4.18359863  1.60833733 -1.96362189  0.44694476]
 [ 4.18359863  5.36938835  2.06420082 -2.52018643  0.57362578]
 [ 1.60833733  2.06420082  0.79355874 -0.96885726  0.22052396]
 [-1.96362189 -2.52018643 -0.96885726  1.18287954 -0.2692381 ]
 [ 0.44694476  0.57362578  0.22052396 -0.2692381   0.06128194]]

Explanation of Logistic Regression

Recap:

We knew that y ^ = p ( y = 1 ∣ x ) \widehat{y} = p(y=1|x) y =p(y=1x), so that we can get:

  • If y = 1 y = 1 y=1, p ( y ∣ x ) = y ^ p(y|x) = \widehat{y} p(yx)=y
  • If y = 0 y = 0 y=0, p ( y ∣ x ) = 1 − y ^ p(y|x) = 1 - \widehat{y} p(yx)=1y

Above that, we can generate a function:
p ( y ∣ x ) = y ^ y ( 1 − y ^ ) 1 − y p(y|x) = \widehat{y}^{y}(1-\widehat{y})^{1-y} p(yx)=y y(1y )1y

l o g ( p ( y ∣ x ) ) = y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) log(p(y|x)) = ylog(\widehat{y}) + (1-y)log(1-\widehat{y}) log(p(yx))=ylog(y )+(1y)log(1y )

Then, we want to maximize the p(y|x), so we minimize the -p(y|x):
− l o g ( p ( y ∣ x ) ) = − ( y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) ) = L ( y ^ , y ) \begin{aligned} -log(p(y|x)) &= -(ylog(\widehat{y}) + (1-y)log(1-\widehat{y})) \\ &= L(\widehat{y}, y) \end{aligned} log(p(yx))=(ylog(y )+(1y)log(1y ))=L(y ,y)

we should remember that the loss function above is a convex function, so we can find the optimal of the function.

Cost on m examples:
p ( Y ∣ X ) = ∏ i = 1 m p ( y ∣ x ) p(Y|X) = \prod_{i=1}^{m}p(y|x) p(YX)=i=1mp(yx)

l o g ( p ( Y ∣ X ) ) = ∑ i = 1 m l o g ( p ( y ∣ x ) ) = − ∑ i = 1 m L ( y ^ , x ) = − J ( W , b ) \begin{aligned} log(p(Y|X)) &= \sum_{i=1}^{m}log(p(y|x)) \\ &= -\sum_{i=1}^{m}L(\widehat{y}, x) \\ &= -J(W, b) \end{aligned} log(p(YX))=i=1mlog(p(yx))=i=1mL(y ,x)=J(W,b)

To summarize, by minimizing this cost function J(w,b) we’re really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
"Track-Before-Detect with Neural Networks"是一种利用神经网络进行目标跟踪前检测的方法。在传统的跟踪算法中,通常先进行目标检测,然后再进行跟踪。然而,在某些场景下,目标可能非常小、模糊或者被部分遮挡,传统的目标检测方法往往无法准确地检测到目标,从而导致跟踪失败。 "Track-Before-Detect with Neural Networks"的核心思想是在跟踪之前先对目标进行检测。而与传统的目标检测方法不同的是,它使用神经网络来实现目标检测,而不是基于传统的图像处理技术。神经网络通常可以更好地处理图像的特征提取和模式识别任务。 这种方法首先使用神经网络对图像进行处理,提取其中的特征。然后,基于提取的特征,在图像中进行目标检测。如果检测到了目标,就可以在该帧中进行跟踪,随着目标在不同帧之间的位置变化,通过更新模型来实现目标的连续跟踪。 相对于传统方法,"Track-Before-Detect with Neural Networks"有以下优势:首先,神经网络可以自动学习图像中的特征,无需手动设计特征提取算法。其次,神经网络具有较强的泛化能力,可以适应不同目标的形状、尺寸和外观变化。此外,神经网络还可以通过训练进行优化,提高准确性和鲁棒性。因此,这种方法可以在复杂的环境中更准确地检测和跟踪目标。 总之,"Track-Before-Detect with Neural Networks"是一种利用神经网络实现目标跟踪和检测的方法,具有较好的准确性和鲁棒性,在实际应用中具有广泛的应用前景。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值