Mnist 数据集
Mnist是一个包含60000个训练样本和10000个测试样本的手写数字集合(下载地址)。通常,将60000个训练样本分为50000个实际训练样本和10000个交叉验证样本。每个样本是一副28*28的黑白图像,下面是典型图像。
为了方便处理,样本为2值图像(黑:0,白:1),每个样本被拉成一列(784),并且具有一个标号(0~9)。下面是读入数据集的代码:
import cPickle, gzip, numpy
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
当实际使用数据时,通常将其分成很多块(随机梯度下降法,后文提到)。为了不用每次都从CPU内从中读取数据,再拿到GPU中做运算(这样会很慢),需要建立共享变量,这样每次GPU做运算时直接从GPU内从中读取数据就可以了。下面这段代码实现了数据共享,并取出数据中的一块。
import theano
import theano.tensor as T
def shared_dataset(data_xy):
""" Function that loads the dataset into shared variables
The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ``floatX`` as well
# (``shared_y`` does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn't make sense) therefore instead of returning
# ``shared_y`` we will have to cast it to int. This little hack
# lets us get around this issue
return shared_x, T.cast(shared_y, 'int32')
test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)
batch_size = 500 # size of the minibatch
# accessing the third minibatch of the training set
data = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]
常用符号含义
训练数据