数据集介绍
这个数据集是给出图片识别出该图是猫或者狗,训练集和测试集只有这2类别的图像
比赛链接: https://www.kaggle.com/c/dogs-vs-cats/overview.
数据处理
- 获得数据及标注的文件(kaggle数据加载比较弯弯绕绕,我这都是偷懒直接用别人跑通的代码,这里就不贴了);
- 对数据进行分析 并作一定可视化 ;
- 划分训练集 和验证集 ;
- 包裹数据 即批处理 。
看一下数据集里的数据构成
ist_of_fnames = os.listdir(os.path.join(tmp_dir,'train'))
print('Total number of of images in tmp/train is {0}'.format(len(list_of_fnames)))
list_of_cats_fnames = [i for i in list_of_fnames if 'CAT' in i.upper()]
list_of_dogs_fnames = [i for i in list_of_fnames if 'DOG' in i.upper()]
TOTAL_CATS = len(list_of_cats_fnames)
TOTAL_DOGS = len(list_of_dogs_fnames)
print('{0} CATS images'.format(TOTAL_CATS))
print('{0} DOGS images'.format(TOTAL_DOGS))``
结果
Total number of of images in tmp/train is 25000
12500 CATS images
12500 DOGS images
划分训练集和验证集
TRAIN_TEST_SPLIT_AT = 0.9
BATCH_SIZE = 100
TARGET_SIZE = (128, 128)
NO_OF_EPOCHS = 1
EXPERIMENT_SIZE = 10000
NO_OF_FOLDS = 5
from shutil import copyfile
np.random.shuffle(list_of_cats_fnames)
np.random.shuffle(list_of_dogs_fnames)
tmp_train_dir = os.path.join(tmp_dir, 'train')
c = 0
for i in list_of_cats_fnames:
if c < (round(TRAIN_TEST_SPLIT_AT * EXPERIMENT_SIZE)):
copyfile(os.path.join(tmp_train_dir, i), os.path.join(train_dir, i))
else:
copyfile(os.path.join(tmp_train_dir, i), os.path.join(test_dir, i))
c += 1
if c >= EXPERIMENT_SIZE:
break
c = 0
for i in list_of_dogs_fnames:
if c < (round(TRAIN_TEST_SPLIT_AT * EXPERIMENT_SIZE)):
copyfile(os.path.join(tmp_train_dir, i), os.path.join(train_dir, i))
else:
copyfile(os.path.join(tmp_train_dir, i), os.path.join(test_dir, i))
c += 1
if c >= EXPERIMENT_SIZE:
break
print('Total training cat images :', len(os.listdir(train_dir)))
print('Total test dog images :', len(os.listdir(test_dir)))
train_X = [img_fname for img_fname in os.listdir(train_dir)]
train_X = np.array(train_X)
#
train_labels