教程地址:http://www.deeplearning.net/tutorial/gettingstarted.html
Datasets
(1)mnist手写数字集:每张是一个784维向量(28*28),像素值为0到1的float,每张代表一个0到9的数,50000张training set,10000张validation set(验证集用于类似学习率,model size等参数的选择),10000张testing set。
For convenience we pickled the dataset to make it easier to use in python.
import cPickle, gzip, numpy
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
Note:cPickle包的功能和用法与pickle包几乎完全相同,cPickle用C码的,性能好很多。
(2)We encourage you to store the dataset into shared variablesand access it based on the minibatch index, given a fixed and known batch size(即代码中的batch_size =500).
原因是:使用GPU时,不停地把数据拷贝到GPU效率不高,尽量使用Theano shared variables来提高性能;建议设6个不同共享变量,data:training set,validation set ,testing set 3个,label 3个。
def shared_dataset(data_xy):
#Function that loads the dataset into shared variables
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# GPU上数据存储为float,y应该是int,所以return的时候用cast转成int,
return shared_x, T.cast(shared_y, 'int32')
test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)
batch_size = 500 # size of the minibatch
# accessing the third minibatch of the training set
data = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]
如果出现内存溢出的情况:
you can store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores.
Learning a Classifier
Zero-One Loss
预测对的样本损失就是0,不对就是1,所有样本损失求和
If is the prediction function, then this loss can be written as:
where either is the training set (during training) or
(to avoid biasing the evaluation of validation or test error).
is the indicator function defined as: