Class1-Week2-Neural Networks Basics

最新推荐文章于 2021-07-28 13:04:51 发布

zcx_language

最新推荐文章于 2021-07-28 13:04:51 发布

阅读量344

点赞数

分类专栏： Deep Learning

本文链接：https://blog.csdn.net/language_zcx/article/details/97557770

版权

Deep Learning 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

文章目录

@[toc]

Logistic Regression
Description
Example: Cat vs No-cat
Logistic Function

Logistic Cost Function
Loss(error) Function
Cost Function

Gradient Descent
One Example:
M Training Examples:

Vectorization
Look at the Power of Vectorization:
Vectorizing Logitic Regression
Broadcasting in Python
A Note on Python/Numpy vectors

Explanation of Logistic Regression

Logistic Regression

Description

Logistic regression is a learning algorithm used in a supervised learning problem when the output ? are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data

Example: Cat vs No-cat

Given an image represented by a feature vector ?, the algorithm will evaluate the probability of a cat being in that image.

$\widehat{y} = P(y=1|x), where \widehat{y} \in [0,1]$

The parameters used in Logistic regression are:

The input features vector: $\in \mathbb{R}^{n_{x}}$ , where $n_{x}$ is the number of features
The training label: $\in \{0,1\}$
The weight: $\in \mathbb{R}^{n_{x}}$ , where $n_{x}$ is the number of features
The threshold: $\in \mathbb{R}$
The output: $\widehat{y} = \sigma{w^{T}x + b}$
Sigmoid function: $\sigma(w^{T}x + b)$ , $\sigma{z} = \frac{1}{1 + e^{-z}}$

Logistic Function

import numpy as np 
import time
import matplotlib.pyplot as plt 

%matplotlib inline

x = np.arange(-10, 10, 0.001)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.suptitle(r'$y=\frac{1}{1+e^{-x}}$', fontsize=20)
plt.grid(color='gray')
plt.grid(linewidth='1')
plt.grid(linestyle='--')

plt.show()

在这里插入图片描述

$w^{T}x + b)$ is a linear function (?? + ?), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.

Some observations from the graph:

If z is large positive number, then $\sigma(z) = 1$
If z is large negative number, then $\sigma(z) = 0$
if $z = 0$ , then $\sigma(z) = 0.5$

Logistic Cost Function

To train the parameters ? and ?, we need to define a cost function.

Recap:

$\widehat{y} = \sigma(w^{T}x^{(i)}+b), where\ \sigma(z^{i}) = \frac{1}{1 + e^{-z}}$

$Given\ \{(x^{(1)},y^{(1)}), \cdots,((x^{(m)},y^{(m)}))\}, we\ want\ \widehat{y}^{(i)}\approx y^{(i)}$

$x^{(i)}$ the i-th traning example

Loss(error) Function

The loss function measures the discrepancy between the prediction $\widehat{y}^{(i)}$ and the desired output $y^{(i)}$ .In other words, the loss function computes the error for a single training example.

$L(\widehat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\widehat{y}^{(i)} - y^{(i)})^{2}$

$L(\widehat{y}^{(i)}, y^{(i)}) = -(y^{(i)}log(\widehat{y}^{(i)}) + (1 - y^{(i)})log(1-\widehat{y}^{(i)}))$

If $y^{(i)} = 1$ : $(\widehat{y}^{(i)}, y^{(i)}) = -log(\widehat{y}^{(i)})$ where $\widehat{y}^{(i)}$ should be close to 1.
If $y^{(i)} = 0$ : $(\widehat{y}^{(i)}, y^{(i)}) = -log(1 - \widehat{y}^{(i)})$ where $\widehat{y}^{(i)}$ should be close to 0

Cost Function

The cost function is the average of the loss function of the entire training set. We are going to find the parameters w and b that minimize the overall cost function.

$=\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})]$

Gradient Descent

One Example:

We have get that:

$z = w^{T}x + b$

$\widehat{y} = a = \sigma(z)$

$L (a, y) = - (y l o g (a) + (1 - y) l o g (1 - a))$

Forward:
在这里插入图片描述
Backward:

$\frac{\alpha_{L(a,y)}}{\alpha_{a}} = \frac{\alpha_{-(ylog(a)+(1-y)log(1-a))}}{\alpha_{a}} = -\frac{y}{a} + \frac{1-y}{1-a}$

$\frac{\alpha_{L(a,y)}}{\alpha_{z}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) = a - y$

$\frac{\alpha_{L(a,y)}}{\alpha_{w_{1}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{1}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{1} = x_{1}(a-y)$

$\frac{\alpha_{L(a,y)}}{\alpha_{w_{2}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{2}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{2} = x_{2}(a-y)$

$\frac{\alpha_{L(a,y)}}{\alpha_{b}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{b}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times 1 = (a-y)$

M Training Examples:

Recap:

$=\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})]$

$\frac{\alpha_{J(w,b)}}{\alpha_{w}} = \frac{1}{m} \sum_{i=1}^{m} \frac{\alpha_{L(\widehat{y}^{(i)}, y^{(i)})}}{\alpha_w}$

# Assume we have two features in this prediction
def logisitcRegressionGradientDescent(m, x, y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    x (matrix: 2 * m) -- the features
    y (vector: 1 * m) -- the label
    w (vector: 1 * 2) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """     
    J = 0; dw1 = 0; dw2 = 0; db = 0
    z = []; a = []
    for i in range(m):
        # FP
        z[i] = w * x[:, i] + b
        a[i] = sigmoid(z[i])
        J += -[y[i] * log(a[i]) + (1 - y[i]) * log(1 - a[i])]
        
        #BP
        dz[i] = a[i] - y[i]
        dw1 += x[i][0] * dz[i]
        dw2 += x[i][1] * dz[i]
        db += dz[i]
    
    dw1 /= m
    dw2 /= m
    db /= m
    
    w1 -= alpha * dw1
    w2 -= alpha * dw2
    b -= alpha * b
    
    return

Vectorization

Look at the Power of Vectorization:

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorized version:", str((toc - tic) * 1000) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print("For loop version:", str((toc - tic) * 1000) + "ms")

249706.30302638497
Vectorized version: 1.2466907501220703ms
249706.30302638162
For loop version: 395.0514793395996ms

Vectorizing Logitic Regression

def logisitcRegressionGradientDescent(m, X, Y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    X (matrix: n * m) -- the features
    Y (vector: m * 1) -- the label
    w (vector: 1 * n) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """    
    
    # FP
    Z = np.dot(W, X) + b
    A = sigmoid(Z)
    #J = -np.sum(Y * log(A) + (1 - Y) * log(1 - A))
    dZ = A - Y
    dw = 1 / m * np.dot(dZ, X.T)
    db = 1 / m * np.sum(dZ)
    return

Broadcasting in Python

a = np.array([[1,2], [3,4]])
print(a)

b = 1
# b = [1,1]
# b = [[1], [1]]

# boradcast b to [[1,1], [1,1]]
# b = np.array([[1,1], [1,1]])

print(a + b)
print(a - b)

[[1 2]
 [3 4]]
[[2 3]
 [4 5]]
[[0 1]
 [2 3]]

A Note on Python/Numpy vectors

a = np.random.randn(5) # Rank 1 array
print(a.shape, a)

print(a.T)
print(a * a.T)

(5,) [ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[2.19126245 0.42514382 0.0072945  1.17883282 2.96784882]

a = np.random.randn(5, 1)
print(a.shape)
print(a)
print(a.T)
print(a.T.shape)
print(a * a.T)

(5, 1)
[[ 1.80545889]
 [ 2.31719407]
 [ 0.89081914]
 [-1.08760266]
 [ 0.24755189]]
[[ 1.80545889  2.31719407  0.89081914 -1.08760266  0.24755189]]
(1, 5)
[[ 3.25968179  4.18359863  1.60833733 -1.96362189  0.44694476]
 [ 4.18359863  5.36938835  2.06420082 -2.52018643  0.57362578]
 [ 1.60833733  2.06420082  0.79355874 -0.96885726  0.22052396]
 [-1.96362189 -2.52018643 -0.96885726  1.18287954 -0.2692381 ]
 [ 0.44694476  0.57362578  0.22052396 -0.2692381   0.06128194]]

Explanation of Logistic Regression

Recap:

We knew that $\widehat{y} = p(y=1|x)$ , so that we can get:

If $y = 1$ , $\widehat{y}$
If $y = 0$ , $\widehat{y}$

Above that, we can generate a function:
$\widehat{y}^{y}(1-\widehat{y})^{1-y}$

$ylog(\widehat{y}) + (1-y)log(1-\widehat{y})$

Then, we want to maximize the p(y|x), so we minimize the -p(y|x):
$\begin{aligned} -log(p(y|x)) &= -(ylog(\widehat{y}) + (1-y)log(1-\widehat{y})) \\ &= L(\widehat{y}, y) \end{aligned}$

we should remember that the loss function above is a convex function, so we can find the optimal of the function.

Cost on m examples:
$\prod_{i=1}^{m}p(y|x)$

$\begin{aligned} log(p(Y|X)) &= \sum_{i=1}^{m}log(p(y|x)) \\ &= -\sum_{i=1}^{m}L(\widehat{y}, x) \\ &= -J(W, b) \end{aligned}$

To summarize, by minimizing this cost function J(w,b) we’re really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.

zcx_language

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Class1-Week2-Neural Networks Basics

Logistic RegressionDescriptionLogistic regression is a learning algorithm used in a supervised learning problem when the output
复制链接

扫一扫

专栏目录