深度学习-Cross-Entropy Cost函数来实现MNIST手写数字识别

101人阅读 评论(0) 收藏 举报
分类:

学习彭亮《深度学习进阶:算法与应用》课程


旧的Cost Funtion

之前的cost Function是一个二元的Function,之前初始化Baise和Weight都是从正态分布里随机初始化。
我们理想情况是让神经网络学习更快,即更快达到我们的学习目标。

假设简单模型(我们已经定义好weight和baise): 只有一个输入,一个神经元,一个输出
这里写图片描述
简单模型: 输入为1时, 输出为0
这里写图片描述
这里写图片描述

假设1:初始 w = 0.6, b = 0.9 初始预测的输出 a = 0.82, 需要学习(因为我们的目标是输出为0)
学习率: 0.15

Ij=1*0.6+0.9=1.5
Oj=1/(1+e^-Ij)=0.82

这里写图片描述

在经过300轮学习后,output=0.09,已经很接近我们的目标0

假设2:初始: w = 2.0, b = 2.0, 初始预测输出: 0.98, 和理想输出0差点很远
这里写图片描述

可见,在0-100多轮学习的时候学习非常非常慢,而且最终和目标输出0也差得很远。

神经网络的学习行为和人脑差的很多, 开始学习很慢, 后来逐渐增快.。
为什么?

学习慢 => 偏导数 ∂C/∂w 和 ∂C/∂b 值小

计算偏导数:
回顾之前学习的Cost函数:
这里写图片描述

对于一个x, y 和单个神经元:
这里写图片描述
这里写图片描述
分别对w和b求偏导数:(并带入上述假设的值: x = 1, y = 0)
这里写图片描述

回顾sigmoid函数
这里写图片描述

当神经元输出接近1时, 曲线很平缓 =>
这里写图片描述很小,即偏导数很小。

如何增快学习?
我们不用之前设定的cost function,重新定义一个cost function,使它学习后不会造成学习效率降低的情况。


Cross-Entropy cost 函数

假设一个稍微复杂一些的神经网络
这里写图片描述
定义cross-entropy函数:
这里写图片描述
为什么可以用来做cost函数?
1. 函数值大于等于0
2. 当a=y时, cost = 0

新的cost函数对weight求偏导:
这里写图片描述
这里写图片描述

用sigmoid函数定义
这里写图片描述
推出
这里写图片描述

代入上面的偏导, 得到:
这里写图片描述
学习的快慢取决于 :
这里写图片描述
也就是输出的error
好处: 错误大时,更新多,学得快. 错误小时,学习慢

类似,对baise求偏导:
这里写图片描述

用cross-entropy 演示
**假设1:**w = 0.6, b = 0.9
这里写图片描述

**假设2:**w = 2.0, b = 2.0
这里写图片描述

以上是对于一个单个神经元的cost, 对于多层:
这里写图片描述
以上把输出层所有的神经元的值加起来


总结:

  • cross-entropy cost几乎总是比二次cost函数好
  • 如果神经元的方程是线性的, 用二次cost函数 (不会有学习慢的问题)

代码

cross-entropy.py

import mnist_loader
training_data, validation_data, test_data =  mnist_loader.load_data_wrapper()
import network2
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,monitor_evaluation_accuracy=True)

network2.py

#coding=utf-8
# @Author: yangenneng
# @Time: 2018-01-25 16:12
# @Abstract:

"""
network2.py
~~~~~~~~~~~~~~
An improved version of network.py, implementing the stochastic
gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function,
regularization, and better initialization of network weights.  Note
that I have focused on making the code simple, easily readable, and
easily modifiable.  It is not optimized, and omits many desirable
features.
"""

#### Libraries
# Standard library
import json
import random
import sys

# Third-party libraries
import numpy as np


#### Define the quadratic and cross-entropy cost functions

class QuadraticCost(object):

    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output
        ``y``.
        """
        return 0.5*np.linalg.norm(a-y)**2

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer."""
        return (a-y) * sigmoid_prime(z)


class CrossEntropyCost(object):

    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output
        ``y``.  Note that np.nan_to_num is used to ensure numerical
        stability.  In particular, if both ``a`` and ``y`` have a 1.0
        in the same slot, then the expression (1-y)*np.log(1-a)
        returns nan.  The np.nan_to_num ensures that that is converted
        to the correct value (0.0).
        """
        return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer.  Note that the
        parameter ``z`` is not used by the method.  It is included in
        the method's parameters in order to make the interface
        consistent with the delta method for other cost classes.
        """
        return (a-y)


#### Main Network class
class Network(object):

    def __init__(self, sizes, cost=CrossEntropyCost):
        """The list ``sizes`` contains the number of neurons in the respective
        layers of the network.  For example, if the list was [2, 3, 1]
        then it would be a three-layer network, with the first layer
        containing 2 neurons, the second layer 3 neurons, and the
        third layer 1 neuron.  The biases and weights for the network
        are initialized randomly, using
        ``self.default_weight_initializer`` (see docstring for that
        method).
        """
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.default_weight_initializer()
        self.cost=cost

    def default_weight_initializer(self):
        """Initialize each weight using a Gaussian distribution with mean 0
        and standard deviation 1 over the square root of the number of
        weights connecting to the same neuron.  Initialize the biases
        using a Gaussian distribution with mean 0 and standard
        deviation 1.
        Note that the first layer is assumed to be an input layer, and
        by convention we won't set any biases for those neurons, since
        biases are only ever used in computing the outputs from later
        layers.
        """
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)/np.sqrt(x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def large_weight_initializer(self):
        """Initialize the weights using a Gaussian distribution with mean 0
        and standard deviation 1.  Initialize the biases using a
        Gaussian distribution with mean 0 and standard deviation 1.
        Note that the first layer is assumed to be an input layer, and
        by convention we won't set any biases for those neurons, since
        biases are only ever used in computing the outputs from later
        layers.
        This weight and bias initializer uses the same approach as in
        Chapter 1, and is included for purposes of comparison.  It
        will usually be better to use the default weight initializer
        instead.
        """
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            lmbda = 0.0,
            evaluation_data=None,
            monitor_evaluation_cost=False,
            monitor_evaluation_accuracy=False,
            monitor_training_cost=False,
            monitor_training_accuracy=False):
        """Train the neural network using mini-batch stochastic gradient
        descent.  The ``training_data`` is a list of tuples ``(x, y)``
        representing the training inputs and the desired outputs.  The
        other non-optional parameters are self-explanatory, as is the
        regularization parameter ``lmbda``.  The method also accepts
        ``evaluation_data``, usually either the validation or test
        data.  We can monitor the cost and accuracy on either the
        evaluation data or the training data, by setting the
        appropriate flags.  The method returns a tuple containing four
        lists: the (per-epoch) costs on the evaluation data, the
        accuracies on the evaluation data, the costs on the training
        data, and the accuracies on the training data.  All values are
        evaluated at the end of each training epoch.  So, for example,
        if we train for 30 epochs, then the first element of the tuple
        will be a 30-element list containing the cost on the
        evaluation data at the end of each epoch. Note that the lists
        are empty if the corresponding flag is not set.
        """
        if evaluation_data: n_data = len(evaluation_data)
        n = len(training_data)
        evaluation_cost, evaluation_accuracy = [], []
        training_cost, training_accuracy = [], []
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(
                    mini_batch, eta, lmbda, len(training_data))
            print "Epoch %s training complete" % j
            if monitor_training_cost:
                cost = self.total_cost(training_data, lmbda)
                training_cost.append(cost)
                print "Cost on training data: {}".format(cost)
            if monitor_training_accuracy:
                accuracy = self.accuracy(training_data, convert=True)
                training_accuracy.append(accuracy)
                print "Accuracy on training data: {} / {}".format(
                    accuracy, n)
            if monitor_evaluation_cost:
                cost = self.total_cost(evaluation_data, lmbda, convert=True)
                evaluation_cost.append(cost)
                print "Cost on evaluation data: {}".format(cost)
            if monitor_evaluation_accuracy:
                accuracy = self.accuracy(evaluation_data)
                evaluation_accuracy.append(accuracy)
                print "Accuracy on evaluation data: {} / {}".format(
                    self.accuracy(evaluation_data), n_data)
            print
        return evaluation_cost, evaluation_accuracy, \
            training_cost, training_accuracy

    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        """Update the network's weights and biases by applying gradient
        descent using backpropagation to a single mini batch.  The
        ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
        learning rate, ``lmbda`` is the regularization parameter, and
        ``n`` is the total size of the training data set.
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = (self.cost).delta(zs[-1], activations[-1], y)
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def accuracy(self, data, convert=False):
        """Return the number of inputs in ``data`` for which the neural
        network outputs the correct result. The neural network's
        output is assumed to be the index of whichever neuron in the
        final layer has the highest activation.
        The flag ``convert`` should be set to False if the data set is
        validation or test data (the usual case), and to True if the
        data set is the training data. The need for this flag arises
        due to differences in the way the results ``y`` are
        represented in the different data sets.  In particular, it
        flags whether we need to convert between the different
        representations.  It may seem strange to use different
        representations for the different data sets.  Why not use the
        same representation for all three data sets?  It's done for
        efficiency reasons -- the program usually evaluates the cost
        on the training data and the accuracy on other data sets.
        These are different types of computations, and using different
        representations speeds things up.  More details on the
        representations can be found in
        mnist_loader.load_data_wrapper.
        """
        if convert:
            results = [(np.argmax(self.feedforward(x)), np.argmax(y))
                       for (x, y) in data]
        else:
            results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in data]
        return sum(int(x == y) for (x, y) in results)

    def total_cost(self, data, lmbda, convert=False):
        """Return the total cost for the data set ``data``.  The flag
        ``convert`` should be set to False if the data set is the
        training data (the usual case), and to True if the data set is
        the validation or test data.  See comments on the similar (but
        reversed) convention for the ``accuracy`` method, above.
        """
        cost = 0.0
        for x, y in data:
            a = self.feedforward(x)
            if convert: y = vectorized_result(y)
            cost += self.cost.fn(a, y)/len(data)
        cost += 0.5*(lmbda/len(data))*sum(
            np.linalg.norm(w)**2 for w in self.weights)
        return cost

    def save(self, filename):
        """Save the neural network to the file ``filename``."""
        data = {"sizes": self.sizes,
                "weights": [w.tolist() for w in self.weights],
                "biases": [b.tolist() for b in self.biases],
                "cost": str(self.cost.__name__)}
        f = open(filename, "w")
        json.dump(data, f)
        f.close()

#### Loading a Network
def load(filename):
    """Load a neural network from the file ``filename``.  Returns an
    instance of Network.
    """
    f = open(filename, "r")
    data = json.load(f)
    f.close()
    cost = getattr(sys.modules[__name__], data["cost"])
    net = Network(data["sizes"], cost=cost)
    net.weights = [np.array(w) for w in data["weights"]]
    net.biases = [np.array(b) for b in data["biases"]]
    return net

#### Miscellaneous functions
def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the j'th position
    and zeroes elsewhere.  This is used to convert a digit (0...9)
    into a corresponding desired output from the neural network.
    """
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

这里写图片描述
这里写图片描述

这里写图片描述

演示程序和之前的二次cost函数对比:结果提高了
cross-entropy: 信息论里面的概念, 对于神经网络算出来的值和真实值之间差距的意外程度。

查看评论

深度学习笔记5torch实现mnist手写数字识别

转自: http://www.aichengxu.com/view/2464034 本节代码地址: https://github.com/vic-w/torch-practice/tree/m...
  • u012749168
  • u012749168
  • 2016-09-27 21:16:22
  • 1162

深度学习(一) cross-entropy和sofrmax

Cross-entropy 神经网络的学习行为和人脑差的很多, 开始学习很慢, 后来逐渐增快 为什么? 学习慢 => 偏导数 ∂C/∂w 和 ∂C/∂b 值小 回顾之前学习的Cost函数: 回顾s...
  • mdzzzzzz
  • mdzzzzzz
  • 2017-09-19 11:21:15
  • 3096

深度学习框架Caffe学习笔记(2)-MNIST手写数字识别例程

MNIST(Mixed National Institute of Standards and Technology)是一个大型手写体数字识别数据库,广泛应用与机器学习领域的训练和测试。MNIST包括...
  • u013407923
  • u013407923
  • 2016-11-08 00:34:20
  • 2403

深度学习- 用Torch实现MNIST手写数字识别

本节代码地址: https://github.com/vic-w/torch-practice/tree/master/mnist MNIST是手写数字识别的数据库。在深度学习流行的今...
  • zoeyunjj
  • zoeyunjj
  • 2016-06-12 10:05:11
  • 1728

交叉熵代价函数(cross-entropy cost function)

1.从方差代价函数说起 代价函数经常用方差代价函数(即采用均方误差MSE),比如对于一个神经元(单输入单输出,sigmoid函数),定义其代价函数为: 其中y是我们期望的输出,a为神经元的实...
  • wtq1993
  • wtq1993
  • 2016-06-23 10:37:10
  • 4360

【深度学习】笔记2_caffe自带的第一个例子,Mnist手写数字识别代码,过程,网络详解

/******************************************************************************** *文件说明: *        运行...
  • maweifei
  • maweifei
  • 2016-10-29 15:28:27
  • 3077

Tensorflow深度学习笔记(五)--手写数字识别-MNIST数据测试

MNIST的结果是0-9,常用softmax函数进行分类,输出结果。 softmax函数常用于分类,定义如下: ​ softmax(xi)=exp(xi)∑je...
  • juyin2015
  • juyin2015
  • 2017-12-01 22:07:07
  • 162

[Keras实战] 构建LeNet实现手写数字识别(mnist数据集)

在写实际的代码前,我们先把所需要用到的一些库导入进来: from keras import backend as K from keras.models import Sequential from...
  • Kexiii
  • Kexiii
  • 2017-08-28 20:45:10
  • 1138

深度学习笔记:交叉熵(cross-entropy)损失函数解决二次型带来的学习速率下降问题

我们都希望我们的神经网络能够根据误差来加快学习的速度。但实际是什么样的呢?让我们先来看一个例子: 这个网络只有一个神经元,一个输入一个输出: 我们训练这个网络做一个简单的任务,输入1,输出0.当然...
  • u010312436
  • u010312436
  • 2017-11-06 12:26:02
  • 603

4.0 Cross-Entropy Cost目标方程让神经网络学习更快

关键是看第二个变化,这个和学习率没有关系,我们只看重变化率,和学习的快慢不是一个定义。...
  • u011507206
  • u011507206
  • 2016-12-10 09:24:31
  • 529
    个人资料
    专栏达人 持之以恒
    等级:
    访问量: 30万+
    积分: 4331
    排名: 8698
    联系方式

    博文主要参考网上资料,视频笔记,结合个人见解,仅供学习、交流使用,如有侵权,请联系博主删除。


    博客专栏
    最新评论