这个作业主要是训练一个logistics regression模型,用来识别图片的字母,字母从’A’到‘J’共十种,与MNIST类似。
首先要下载jupyter以及其他包,具体过程见我知乎上的回答
https://www.zhihu.com/question/51996422/answer/150529315
安装了jupyter后运行jupyter notebook启动,但是在台式机上报错了
Native kernel (python2) is not available
后来查了下,在https://github.com/jupyter/notebook/issues/1280 上看到安装ipykernel即可
pip install ipykernel
问题1 查看样本中的数据是否有效
作业中已经给了一个例子了,我这里在每个类中随机选取一张图片进行显示
from IPython.display import display, Image
#show a random image in each folder
for i in range(num_classes):
imagenames = os.listdir(train_folders[i])
sample_idx = np.random.randint(len(imagenames)) # pick a random image index
imagename = train_folders[i] + '/' + imagenames[sample_idx]
display(Image(filename=imagename))
问题2 验证归一化之后的图片是否有效
作业中的代码将图片进行了预处理,使得数据的均值为0,标准差为0.5,因此,需要查看下处理之后的图片是否有问题。
plt.figure()
for i in range(num_classes):
pickle_file = train_datasets[i] # index 0 should be all As, 1 = all Bs, etc.
with open(pickle_file, 'rb') as f:
letter_set = pickle.load(f) # unpickle
sample_idx = np.random.randint(len(letter_set)) # pick a random image index
sample_image = letter_set[sample_idx, :, :] # extract a 2D slice
plt.subplot(10, 1, i+1)
plt.imshow(sample_image) # display it
与上个问题类似,仍然是使用figure(), subplot()与imshow()这些函数来完成
问题3 验证数据平衡
这里主要查看各个类别的样本数据是否数量上相近。我直接用len()函数得到各类别中样本的数量并输出
def get_num(data_set):
for pickle_file in data_set:
with open(pickle_file,'rb') as f:
data = pickle.load(f)
print(len(data))
print("training data")
get_num(train_datasets)
print("test data")
get_num(test_datasets)
问题4 样本乱序与验证
这里主要是将样本打乱,并查看打乱后的样本是否有效。打乱部分已经有了,那就还是将图片输出下吧
#show 50 images
def showImage(data_set):
plt.figure()
Images = data_set[0:50,:,:]
for index,Image in enumerate(Images):
plt.subplot(10, 5, index+1)
plt.imshow(Image)
print("training dataset")
showImage(train_dataset)
print("validation dataset")
showImage(valid_dataset)
print("test dataset")
showImage(test_dataset)
我这里分别在训练集,验证集,测试集中每个集合中随机输出了50张图片
问题5 寻找重叠样本
这个问题暂时还不会做。。。记得之前看数据结构的课程中有讲过求集合的交集,还需要再看下具体怎么做。
问题6 训练模型
这部分就是要进行训练了,我这里没有用sklearn而是用了tensorflow,与教程中的MNIST中相似,只不过是要写一个next_batch()函数,以获取一定数量的样本来进行优化。这里只是进行了简单的实现。
#using tensorflow
import tensorflow as tf
def next_batch(batch_size, index, data_x, data_y):
data_length = len(data_x)
start = index
index = index + batch_size
if index > data_length:
perm = np.arange(data_length) # arange() create Arithmetic sequence
np.random.shuffle(perm)
start = 0
index = batch_size
images = data_x[perm]
labels = data_y[perm]
return images[start: index], labels[start: index]
sample_size = 5000
x_train = train_dataset[:sample_size].reshape(sample_size, image_size * image_size)
y_train = np.zeros((sample_size, num_classes))
#y_train = np.ndarray(shape=(sample_size, num_classes), dtype=np.float32)
x_test = test_dataset.reshape(len(test_dataset), image_size * image_size)
y_test = np.zeros((len(test_labels), num_classes))
#y_test = np.ndarray(shape=(len(test_labels), num_classes), dtype=np.float32)
for i in range(sample_size):
y_train[i][train_labels[i]] = 1
X_test = test_dataset.reshape(test_dataset.shape[0], image_size * image_size)
for i in range(len(test_labels)):
y_test[i][test_labels[i]] = 1
W = tf.Variable(tf.zeros([image_size * image_size, num_classes]))
b = tf.Variable(tf.zeros([num_classes]))
x = tf.placeholder(tf.float32, [None, image_size * image_size])
y = tf.placeholder(tf.float32, [None, num_classes])
y_ = tf.nn.softmax(tf.matmul(x, W) + b)
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(y_), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
index_in_epoch = len(x_train)
batch_size = 50
for step in range(sample_size/10):
batch_data = next_batch(batch_size, index_in_epoch, x_train, y_train)
batch_x = batch_data[0]
batch_y = batch_data[1]
index_in_epoch = index_in_epoch + batch_size
sess.run(train_step, feed_dict={x: batch_x, y: batch_y})
#evaluate
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: x_train, y: y_train}))
print(sess.run(accuracy, feed_dict={x: x_test, y: y_test}))
训练集的成功率:0.8466
测试集的成功率:0.8541