

1 卷积神经网络

Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs(多层感知机). From Hubel and Wiesel’s early work on the cat’s visual cortex(猫的视觉皮层), we know the visual cortex contains a complex arrangement of cells. These cells are sensitive to small sub-regions of the visual field(局部区域), called a receptive field(感知野). The sub-regions are tiled to cover the entire visual field. These cells act as local filters(局部滤波器) over the input space and are well-suited to exploit the strong spatially local correlation(空间关系信息) present in natural images.

Additionally, two basic cell types have been identified: Simple cells(S细胞) respond maximally to specific edge-like patterns within their receptive field(S细胞在自身的感知野内最大限度地对图像中类似边缘模式的刺激做出响应). Complex cells(C细胞) have larger receptive fields and are locally invariant to the exact position of the pattern(C细胞具有更大的感受野,它可以对图像中产生刺激的模式的空间位置进行精准地定位).

The animal visual cortex being the most powerful visual processing system in existence, it seems natural to emulate its behavior. Hence, many neurally-inspired models can be found in the literature. To name a few: the NeoCognitron [Fukushima], HMAX [Serre07] and LeNet-5 [LeCun98], which will be the focus of this tutorial.


1.1 局部感知





1.2 参数共享




更直观一些,当从一个大尺寸图像中随机选取一小块,比如说 8×8 作为样本,并且从这个小块样本中学习到了一些特征,这时我们可以把从这个 8×8 样本中学习到的特征作为探测器,应用到这个图像的任意地方中去。特别是,我们可以用从 8×8 样本中所学习到的特征跟原本的大尺寸图像作卷积,从而对这个大尺寸图像上的任一位置获得一个不同特征的激活值


1.3 多卷积核







The figure shows two layers of a CNN. Layer m-1 contains four feature maps. Hidden layer m contains two feature maps ( h1 and  h2 ). Pixels (neuron outputs) in  h1  and  h2  (outlined as blue and red squares) are computed from pixels of layer (m-1) which fall within their 2x2 receptive field in the layer below (shown as colored rectangles). Notice how the receptive field spans all four input feature maps. The weights  W1  and  W2  of  h1  and  h2  are thus 3D weight tensors. The leading dimension indexes the input feature maps, while the other two refer to the pixel coordinates.

Putting it all together, Wklij  denotes the weight connecting each pixel of the k-th feature map at layer m, with the pixel at coordinates (i,j)  of the l-th feature map of layer (m-1).

1.4 Down-pooling(池化,降采样)

在通过卷积获得了特征 (features) 之后,下一步我们希望利用这些特征去做分类。 训练分类器,例如 softmax 分类器,但这样做面临计算量的挑战。例如:对于一个 96X96 像素的图像,假设我们已经学习得到了400个定义在8X8输入上的特征,每一个特征和图像卷积都会得到一个  (968+1)×(968+1)=7921  维的卷积特征,由于有 400 个特征,所以每个样例 (example) 都会得到一个  7921×400=3,168,400  维的卷积特征向量。学习一个拥有超过 3 百万特征输入的分类器十分不便,并且容易出现过拟合 (over-fitting)。

  • 降低图像分辨率,减少训练维数

为了解决这个问题,首先回忆一下,我们之所以决定使用卷积后的特征是因为图像具有一种“静态性”的属性,这也就意味着在一个图像区域有用的特征极有可能在另一个区域同样适用。因此,为了描述大的图像,一个很自然的想法就是对不同位置的特征进行聚合统计,例如,人们可以计算图像一个区域上的某个特定特征的平均值 (或最大值)。这些概要统计特征不仅具有低得多的维度 (相比使用所有提取得到的特征),同时还会改善结果(不容易过拟合)。这种聚合的操作就叫做池化 (pooling),有时也称为平均池化或者最大池化 (取决于计算池化的方法)。



1.5 多层卷积


1.6 卷积神经网络总结

  • 基于人工神经网络

  • 在人工神经网络前,利用卷积滤波器进行特征提取

  • 使用卷积核作为特征提取器

  • *自动训练特征提取器(即卷积核,即阈值参数)

  • 卷积核一次训练多次使用,可以在线学习

  • 局部感受野+权值共享+多卷积核+降采样,使模型参数与特征大幅减少

2 Theano 实现

2.1 The Convolution Operator(卷积算子ConvOp)

ConvOp是在Theano中对卷积层的主要实现,它使用了theano.tensor.signal.conv2d的函数功能,ConvOp包含两个输入(symbolic inputs): 
- 对应输入图像的mini-batch的4D张量。每个张量的形式为:[mini-batch的大小,输入的特征图的数量,图像的高度,图像的宽度] 
- 对应于权值W的4D张量。每个张量的形式为:[m层的特征图数量,m-1层的特征图数量,滤波器的高度,滤波器的宽度]


下面代码实现了和上图相似的卷积层,输入包括了大小为120x160的3个特征图(对应RGB). 我们采用了两个具有9x9感知野的卷积滤波器。

import theano
from theano import tensor as T
from theano.tensor.nnet import conv2d

import numpy

rng = numpy.random.RandomState(23455)

# instantiate 4D tensor for input
input = T.tensor4(name='input')

# initialize shared variable for weights(初始化共享权重).
w_shp = (2, 3, 9, 9)
w_bound = numpy.sqrt(3 * 9 * 9)
W = theano.shared( numpy.asarray(
                low=-1.0 / w_bound,
                high=1.0 / w_bound,
            dtype=input.dtype), name ='W')

# initialize shared variable for bias (1D tensor) with random values(初始化共享偏置)(b是一维张量)
# IMPORTANT: biases are usually initialized to zero. However in this
# particular application, we simply apply the convolutional layer to
# an image without learning the parameters. We therefore initialize
# them to random values to "simulate" learning.
b_shp = (2,) #把b视为列向量,因此其shape为(2,)
b = theano.shared(numpy.asarray(
            rng.uniform(low=-.5, high=.5, size=b_shp),
            dtype=input.dtype), name ='b')

# build symbolic expression that computes the convolution of input with filters in w(卷积)
conv_out = conv2d(input, W)

# build symbolic expression to add bias and apply activation function, i.e. produce neural net layer output
# A few words on ``dimshuffle`` :
#   ``dimshuffle`` is a powerful tool in reshaping a tensor;
#   what it allows you to do is to shuffle dimension around
#   but also to insert new ones along which the tensor will be
#   broadcastable;
#   dimshuffle('x', 2, 'x', 0, 1)
#   This will work on 3d tensors with no broadcastable
#   dimensions. The first dimension will be broadcastable,
#   then we will have the third dimension of the input tensor as
#   the second of the resulting tensor, etc. If the tensor has
#   shape (20, 30, 40), the resulting tensor will have dimensions
#   (1, 40, 1, 20, 30). (AxBxC tensor is mapped to 1xCx1xAxB tensor)
#   More examples:
#    dimshuffle('x') -> make a 0d (scalar) into a 1d vector
#    dimshuffle(0, 1) -> identity
#    dimshuffle(1, 0) -> inverts the first and second dimensions
#    dimshuffle('x', 0) -> make a row out of a 1d vector (N to 1xN)
#    dimshuffle(0, 'x') -> make a column out of a 1d vector (N to Nx1)
#    dimshuffle(2, 0, 1) -> AxBxC to CxAxB
#    dimshuffle(0, 'x', 1) -> AxB to Ax1xB
#    dimshuffle(1, 'x', 0) -> AxB to Bx1xA
output = T.nnet.sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))

# create theano function to compute filtered images
f = theano.function([input], output)

dimshuffle( ):改变一个array张量结构的一个工具 

Let’s have a little bit of fun with this…

import numpy
import pylab
from PIL import Image

# open random image of dimensions 639x516
img = Image.open(open('doc/images/3wolfmoon.jpg'))
# dimensions are (height, width, channel)
img = numpy.asarray(img, dtype='float64') / 256.

# put image in 4D tensor of shape (1, 3, height, width)
img_ = img.transpose(2, 0, 1).reshape(1, 3, 639, 516)
filtered_img = f(img_)

# plot original image and first and second components of output
pylab.subplot(1, 3, 1); pylab.axis('off'); pylab.imshow(img)
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
pylab.subplot(1, 3, 2); pylab.axis('off'); pylab.imshow(filtered_img[0, 0, :, :])
pylab.subplot(1, 3, 3); pylab.axis('off'); pylab.imshow(filtered_img[0, 1, :, :])
另外,我们使用了和MLP相同的公式对权值进行初始化,这些权值都是从一个范围为[-1/fan-in, 1/fan-in]的均匀分布中采样得到的,fan-in是隐层的输入节点数目,在MLP中,这个fan-in就是下面那一层的节点数目,然而对于CNN来说,我们需要考虑到输入特征图的数量,以及感知野的大小。

2.2 MaxPooling(最大池化)


(2)最大池化提出了一种变化的不变性形式。为了理解这种不变性,我们假设把最大池化层和一个卷积层结合起来。对于输入图像的单个像素,有8个变换的方向,如果最大池化在 2x2的窗口上面实现,这8个可能的配置中,有3个可以准确的产生和卷积层相同的结果。如果最大池化在 3x3的窗口上面实现,则产生精确结果的概率变成了5/8。因此,它对于位移变化有着不错的鲁棒性,最大池化用一种很灵活的方式降低了中间表示层的维度。



from theano.tensor.signal import pool

input = T.dtensor4('input')
maxpool_shape = (2, 2)
pool_out = pool.pool_2d(input, maxpool_shape, ignore_border=True)
f = theano.function([input],pool_out)

invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
print 'With ignore_border set to True:'
print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
print 'output[0, 0, :, :] =\n', f(invals)[0, 0, :, :]

pool_out = pool.pool_2d(input, maxpool_shape, ignore_border=False)
f = theano.function([input],pool_out)
print 'With ignore_border set to False:'
print 'invals[1, 0, :, :] =\n ', invals[1, 0, :, :]
print 'output[1, 0, :, :] =\n ', f(invals)[1, 0, :, :]
注意到和大部分Theano代码不同的是,max_pool_2d 函数在创建Theano图的时候,需要一个向下采样的因子(downscaling factor)ds (长度为2的tuple变量,表示了图像的宽和高的缩放。这个可能在以后的版本中升级。

2.3 The Full Model: LeNet(一个完整的CNN模型:LeNet)




Note that the term “convolution” could corresponds to different mathematical operations: 
1. theano.tensor.nnet.conv2d, which is the most common one in almost all of the recent published convolutional models. In this operation, each output feature map is connected to each input feature map by a different 2D filter, and its value is the sum of the individual convolution of all inputs through the corresponding filter. 
2. The convolution used in the original LeNet model: In this work, each output feature map is only connected to a subset of input feature maps. 
3. The convolution used in signal processing: theano.tensor.signal.conv.conv2d, which works only on single channel inputs.

Here, we use the first operation, so this models differ slightly from the original LeNet paper. One reason to use 2. would be to reduce the amount of computation needed, but modern hardware makes it as fast to have the full connection pattern. Another reason would be to slightly reduce the number of free parameters, but we have other regularization techniques at our disposal.

2.4 Putting it All Together 全部代码

We now have all we need to implement a LeNet model in Theano. We start with the LeNetConvPoolLayer class, which implements a {convolution + max-pooling} layer.

class LeNetConvPoolLayer(object):
    """Pool Layer of a convolutional network """

    def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):
        Allocate a LeNetConvPoolLayer with shared variable internal parameters.

        :type rng: numpy.random.RandomState
        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.dtensor4
        :param input: symbolic image tensor, of shape image_shape

        :type filter_shape: tuple or list of length 4
        :param filter_shape: (number of filters, num input feature maps,
                              filter height, filter width)

        :type image_shape: tuple or list of length 4
        :param image_shape: (batch size, num input feature maps,
                             image height, image width)

        :type poolsize: tuple or list of length 2
        :param poolsize: the downsampling (pooling) factor (#rows, #cols)

        assert image_shape[1] == filter_shape[1]
        self.input = input

        # there are "num input feature maps * filter height * filter width"
        # inputs to each hidden unit
        fan_in = numpy.prod(filter_shape[1:])
        # each unit in the lower layer receives a gradient from:
        # "num output feature maps * filter height * filter width" /
        #   pooling size
        fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) //
        # initialize weights with random weights
        W_bound = numpy.sqrt(6. / (fan_in + fan_out))
        self.W = theano.shared(
                rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),

        # the bias is a 1D tensor -- one bias per output feature map
        b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
        self.b = theano.shared(value=b_values, borrow=True)

        # convolve input feature maps with filters
        conv_out = conv2d(

        # pool each feature map individually, using maxpooling
        pooled_out = pool.pool_2d(

        # add the bias term. Since the bias is a vector (1D array), we first
        # reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
        # thus be broadcasted across mini-batches and feature map
        # width & height
        self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))

        # store parameters of this layer
        self.params = [self.W, self.b]

        # keep track of model input
        self.input = input
We leave out the code that performs the actual training and early-stopping, since it is exactly the same as with an MLP. The interested reader can nevertheless access the code in the ‘code’ folder of DeepLearningTutorials.





