【caffe源码研究】第二章：使用篇(1): 制作数据集

最新推荐文章于 2024-07-08 10:46:24 发布

原创最新推荐文章于 2024-07-08 10:46:24 发布 · 2k 阅读

4 ·

CC 4.0 BY-SA版权

Deep Learning 同时被 2 个专栏收录

46 篇文章

订阅专栏

Caffe

33 篇文章

订阅专栏

本文详细介绍了Caffe中各种数据输入格式，包括LMDB、LevelDB、ImageData和HDF5等，涵盖了数据预处理、转换及配置方法。

caffe支持的输入数据最常用的是

Data类型，支持LEVELDB or LMDB。LevelDB的格式只需要将convert_imageset 后面接参数db_backend=leveldb即可。
Images(支持ImageData格式)
HDF5

一、LMDB、LevelDB格式

1. LMDB

lmdb是openLDAP项目开发的嵌入式（作为一个库嵌入到宿主程序）存储引擎。其主要特性有：

基于文件映射IO（mmap）
基于B+树的key-value接口
基于MVCC（Multi Version Concurrent Control）的事务处理
类bdb（berkeley db）的api

2. LevelDB

Leveldb是一个google实现的非常高效的kv数据库，目前的版本1.2能够支持billion级别的数据量了。在这个数量级别下还有着非常高的性能，主要归功于它的良好的设计。特别是LSM算法。

对比

lmdb
- 利用mmap 直接进行映射，尽量少内存拷贝（可以为只读直接返回引擎中的内存），提高读性能
- 利用tree 方式组织数据，并且和系统虚拟内存页大小一致的页进行文件组织
- 优点：专门进行了读优化
- 缺点：和系统页一样大的组织方式（4k），如果单条record为1k，浪费严重
leveldb
- 利用层表方式组织数据，优化写入速度
- 优点：为写入优化，并且进行压缩
- 缺点：写入太频繁，来不及重写磁盘会爆掉（LSM通病）。最坏落盘7次，不可忍受。

3. 数据格式

我们的数据如下，每个traindata和testdata里面都有10个文件夹，命名为0-9，分别对应数字0-9. 下方是目录结构部分显示。

F:\CAFFE\DATA
│  list.txt
│  
├─testData
│  ├─0
│  │      0-3-033OJJ7KZA.jpg 
│  │      0-5-CV7UTRECKB.jpg
│  │      
│  ├─1
│  │      1-3-01VZAOCIPC.jpg
│  │      1-3-09GBY203S5.jpg
│          
└─trainData
    │  train.txt
    │  
    ├─0
    │      0-3-00DUJ0RVR9.jpg
    │      0-3-0AWLKVU51V.jpg
    │      
    ├─1
    │      1-7-E3Y0H6X1TR.jpg
    │      1-7-E5DLYZ289T.jpg

4. 数据txt文件

先制作一个txt文件，包含数据的路径和标签，格式如下

trainData/0/0-3-00DUJ0RVR9.jpg 0
trainData/0/0-3-0AWLKVU51V.jpg 0
trainData/0/0-3-0DS9V90EJ6.jpg 0
trainData/0/0-3-0DUO09DFPD.jpg 0
trainData/0/0-3-0F1UTHN9O9.jpg 0
trainData/0/0-3-0KBIEMMCYC.jpg 0
trainData/0/0-3-0QPBZLGTF7.jpg 0
trainData/0/0-3-0R5LZ0FG2H.jpg 0
trainData/0/0-3-0T1RBO2IMH.jpg 0
trainData/0/0-3-0TTN1FAFZY.jpg 0

写个简单的python脚本

import os

rootPath = './'

f = open(rootPath+'train.txt','w')
for i in range(10):
    path = 'trainData/' + str(i)
    lists = os.listdir(rootPath + path)
    for listfile in lists:
        if listfile != 'Thumbs.db':
            f.writelines([path,'/',listfile,' ',str(i),'\n'])
f.close()

f = open(rootPath+'test.txt','w')
for i in range(10):
    path = 'testData/' + str(i)                                                                                                              
    lists = os.listdir(rootPath + path)
    for listfile in lists:
        if listfile != 'Thumbs.db':
            f.writelines([path,'/',listfile,' ',str(i),'\n'])
f.close()

即可以生成train.txt和test.txt。

5. 数据转换

使用接口convert_imageset 进行转换。

shell脚本如下

TOOLS=/home/users/fangjin/caffe/build/tools                                             
ESIZE_HEIGHT=32
RESIZE_WIDTH=32
TRAIN_DATA_ROOT=/home/users/fangjin/test/number_data/

echo "Creating train lmdb..."
GLOG_logtostderr=1 $TOOLS/convert_imageset \
   --resize_height=32 \
   --resize_width=32 \
   --shuffle \
   $TRAIN_DATA_ROOT \
   train.txt \
   number_train_lmdb

echo "Creating test lmdb..."
GLOG_logtostderr=1 $TOOLS/convert_imageset \
   --resize_height=32 \
   --resize_width=32 \
   --shuffle \
   $TRAIN_DATA_ROOT \
   test.txt \
   number_test_lmdb  #输出

参数说明
1. resize_height ，可选参数，resize后的高。
2. resize_width ，可选参数，resize后的宽。但是注意，resize_height和resize_width不能仅设置一个。
3. shuffle，是可选参数，混排。
4. $TRAIN_DATA_ROOT这个参数指的是图片生成txt文件中的相对主目录。也就是说$TRAIN_DATA_ROOT+ txt中路径才是完整路径。
5. db_backend ，LevelDB的格式只需要将convert_imageset 后面接参数db_backend= leveldb即可。

如果报错一般都是路径错误，每次重新运行都需要先删除原来的lmdb数据。

二、ImageData数据

ImageData格式是直接使用图片，不转换为其他格式。官网和sample都没有提供例子。

根据官网的解释

Images
•   Layer type:ImageData
•   Parameters
o   Required
   source: name of a text file, with each line giving an image filename and label
   batch_size: number of images to batch together
o   Optional
     rand_skip
     Shuffle [default false]
     new_height, new_width: if provided, resize all images to this size

需要准备一个txt文件，包含着图片的绝对路径和标签。将之前的python脚本更改一下，写入绝对路径就行了。
同时对配置文件进行修改，
先将type修改type: "ImageData"
将data_param改为image_data_param
范例如下

name: "LeNet"
layer {
  name: "mnist"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  image_data_param {
    source: "F:/caffe/data/trainData/train.txt"
    batch_size:64
    new_height:32
    new_width:32
    shuffle:true
  }
}
layer {
  name: "mnist"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.00390625
  }
  image_data_param {
    source: "F:/caffe/data/testData/test.txt"
    batch_size: 100
    new_height:32
    new_width:32
    shuffle:true
  }
}

三、HDF5数据

1. hdf5格式

HDF 是用于存储和分发科学数据的一种自我描述、多对象文件格式。HDF 是由美国国家超级计算应用中心（NCSA）创建的，以满足不同群体的科学家在不同工程项目领域之需要。HDF 可以表示出科学数据存储和分布的许多必要条件。HDF 被设计为：

自述性：对于一个HDF 文件里的每一个数据对象，有关于该数据的综合信息（元数据）。在没有任何外部信息的情况下，HDF 允许应用程序解释HDF文件的结构和内容。
通用性：许多数据类型都可以被嵌入在一个HDF文件里。例如，通过使用合适的HDF 数据结构，符号、数字和图形数据可以同时存储在一个HDF 文件里。
灵活性：HDF允许用户把相关的数据对象组合在一起，放到一个分层结构中，向数据对象添加描述和标签。它还允许用户把科学数据放到多个HDF 文件里。
扩展性：HDF极易容纳将来新增加的数据模式，容易与其他标准格式兼容。
跨平台性：HDF 是一个与平台无关的文件格式。HDF 文件无需任何转换就可以在不同平台上使用。

hdf5格式的一个优势是hdf5支持非整数标签。
无论是LevelDB还是LMDB或者是imageData类型，标签都只支持整数型，代码部分如下

int label;
while (std::getline(infile, line)) {
    pos = line.find_last_of(' ');
    label = atoi(line.substr(pos + 1).c_str());
    lines_.push_back(std::make_pair(line.substr(0, pos), label));                                                                            
}

label = atoi(line.substr(pos + 1).c_str());

label只能是整数型，并且如果不写成整数型，也并不会报错，因为atoi函数的说明

atoi( ) 函数会扫描参数 nptr字符串，跳过前面的空白字符（例如空格，tab缩进等，可以通过isspace( )函数来检测），直到遇上数字或正负符号才开始做转换，而再遇到非数字或字符串结束时(‘\0’)才结束转换，并将结果返回。如果 nptr不能转换成 int 或者 nptr为空字符串，那么将返回 0[1] 。

因此相当于会截断小数点后的部分。

而hdf5的格式可以支持浮点数和向量形式的标签。浮点数的标签常见于回归问题，向量形式的标签常见于多标签问题。

官网介绍
HDF5 Input

类型（type）：HDF5Data
参数： 
必要： 
    + source: the name of the file to read from（读取的文件的名称）
    + batch_size（每次处理的输入的数据量）

2. 单通道数据浮点数标签

(1) 制作数据集

先制作一个数据集，图像混乱程度的数据。自己有数据的可以略过。

__author__ = 'frank'
import os
import sys
import datetime

from multiprocessing import Process

import numpy as np
from matplotlib import pyplot

LATTICE_SIZE = 100
SAMPLE_SIZE = 2200
STEP_ORDER_RANGE = [3, 7]
SAMPLE_FOLDER = 'samples'

#----------------------------------------------------------------------#
#   Check periodic boundary conditions
#----------------------------------------------------------------------#
def bc(i):
    if i+1 > LATTICE_SIZE-1:
        return 0
    if i-1 < 0:
        return LATTICE_SIZE - 1
    else:
        return i

#----------------------------------------------------------------------#
#   Calculate internal energy
#----------------------------------------------------------------------#
def energy(system, N, M):
    return -1 * system[N,M] * (system[bc(N-1), M] \
                               + system[bc(N+1), M] \
                               + system[N, bc(M-1)] \
                               + system[N, bc(M+1)])

#----------------------------------------------------------------------#
#   Build the system
#----------------------------------------------------------------------#
def build_system():
    system = np.random.random_integers(0, 1, (LATTICE_SIZE, LATTICE_SIZE))
    system[system==0] = - 1

    return system

#----------------------------------------------------------------------#
#   The Main monte carlo loop
#----------------------------------------------------------------------#
def main(T, index):

    score = np.random.random()
    order = score*(STEP_ORDER_RANGE[1]-STEP_ORDER_RANGE[0]) + STEP_ORDER_RANGE[0]
    stop = np.int(np.round(np.power(10.0, order)))
    print('Running sample: {}, stop @ {}'.format(index, stop))
    sys.stdout.flush()

    system = build_system()

    for step in range(stop):
        M = np.random.randint(0, LATTICE_SIZE)
        N = np.random.randint(0, LATTICE_SIZE)

        E = -2. * energy(system, N, M)

        if E <= 0.:
            system[N,M] *= -1
        elif np.exp(-1./T*E) > np.random.rand():
            system[N,M] *= -1

        #if step % 100000 == 0:
        #    print('.'),
        #    sys.stdout.flush()

    filename = '{}/'.format(SAMPLE_FOLDER) + '{:0>5d}'.format(index) + '_{}.jpg'.format(score)
    pyplot.imsave(filename, system, cmap='gray')
    print('Saved to {}!\n'.format(filename))
    sys.stdout.flush()

#----------------------------------------------------------------------#
#   Run the menu for the monte carlo simulation
#----------------------------------------------------------------------#

def run_main(index, length):
    np.random.seed(datetime.datetime.now().microsecond)
    for i in xrange(index, index+length):
        main(0.1, i)

def run():

    cmd = 'mkdir -p {}'.format(SAMPLE_FOLDER)
    os.system(cmd)

    n_processes = 8
    length = int(SAMPLE_SIZE/n_processes)
    processes = [Process(target=run_main, args=(x, length)) for x in np.arange(n_processes)*length]

    for p in processes:
        p.start()

    for p in processes:
        p.join()

if __name__ == '__main__':
    run()

会生成2200张图像，这每一张图片名，前面是名字，后面是图像混乱程度，也就是标签。

(2) 划分数据集

使用其中的1800张做训练集，200张做验证集，200张做测试集。
写代码makedataset.py

__author__ = 'frank'
import os
import numpy

filename2score = lambda x: x[:x.rfind('.')].split('_')[-1]

img_files = sorted(os.listdir('samples'))

with open('train.txt', 'w') as train_txt:
    for f in img_files[:1800]:
        score = filename2score(f)
        line = 'samples/{} {}\n'.format(f, score)
        train_txt.write(line)

with open('val.txt', 'w') as val_txt:
    for f in img_files[1800:2000]:
        score = filename2score(f)
        line = 'samples/{} {}\n'.format(f, score)
        val_txt.write(line)

with open('test.txt', 'w') as test_txt:
    for f in img_files[2000:]:
        line = 'samples/{}\n'.format(f)
        test_txt.write(line)

运行，生成了test.txt、train.txt、val.txt

(3) 将数据转为hdf5格式

运行mass2hdf5.py，生成train.h5 和train_h5.txt。将filename修改为val.txt，再次运行，生成val.h5 和val_h5.txt。
其中train_h5.txt存储着.h5后缀的文件名，有时候一次性存在一个h5文件里太大，可以拆分成好几个h5文件，将文件名写入txt中即可。具体在代码里就是将其改成循环，每个h5里只装N个数据。
当然，你得使用conda装上h5py。

__author__ = 'frank'
import sys
import numpy
from matplotlib import pyplot
import h5py

IMAGE_SIZE = (100, 100)
MEAN_VALUE = 128

filename = 'train.txt'
setname, ext = filename.split('.')

with open(filename, 'r') as f:
    lines = f.readlines()

numpy.random.shuffle(lines)

sample_size = len(lines)
imgs = numpy.zeros((sample_size, 1,) + IMAGE_SIZE, dtype=numpy.float32)
scores = numpy.zeros(sample_size, dtype=numpy.float32)

h5_filename = '{}.h5'.format(setname)
with h5py.File(h5_filename, 'w') as h:
    for i, line in enumerate(lines):
        image_name, score = line[:-1].split()
        img = pyplot.imread(image_name)[:, :, 0].astype(numpy.float32)
        img = img.reshape((1, )+img.shape)
        img -= MEAN_VALUE
        imgs[i] = img
        scores[i] = float(score)
        if (i+1) % 100 == 0:
            print('processed {} images!'.format(i+1))
    h.create_dataset('data', data=imgs)
    h.create_dataset('score', data=scores)

with open('{}_h5.txt'.format(setname), 'w') as f:
    f.write(h5_filename)

分析代码可以看到，主要是

    h.create_dataset('data', data=imgs)
    h.create_dataset('score', data=scores)

这两行配置了存入h5文件中的内容。

(4) 配置

配置文件相关的地方也需要改成HDF5Data

layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "score"
  include {
    phase: TRAIN
  }
  hdf5_data_param {
    source: "train_h5.txt"
    batch_size: 64
  }
}

这里top的data和score与上面的存入名是对应的。

3. 多通道数据浮点数标签

多通道的数据，主要指的是彩图，与上面单通道数据最大的区别是img的部分。这里以Tid2013数据集为例。
代码如下

__author__ = 'frank'

import sys
import numpy
from matplotlib import pyplot
import h5py

IMAGE_SIZE = (384, 512)
MEAN_VALUE = 128

filename = 'E:/Paper_5_chapter/tid2013/train/train.txt'
setname, ext = filename.split('.')

with open(filename, 'r') as f:
    lines = f.readlines()

numpy.random.shuffle(lines)
for seg in range(len(lines)/100):
    lines_seg = lines[100*seg:100*seg+100-1]
    sample_size = len(lines_seg)
    imgs = numpy.zeros((sample_size, 3,)+ IMAGE_SIZE, dtype=numpy.float32)
    scores = numpy.zeros(sample_size, dtype=numpy.float32)
    h5_filename = '{}_{}.h5'.format(setname,seg)
    with h5py.File(h5_filename, 'a') as h:
        for i, line in enumerate(lines_seg):
            image_name, score = line[:-1].split()
            img = pyplot.imread(image_name).astype(numpy.float32)
            img2 = numpy.zeros((3,384,512))
            img2[0]=img[:,:,0]
            img2[1]=img[:,:,1]
            img2[2]=img[:,:,2]
            img2 = img2.reshape((1, )+img2.shape)
            img2 -= MEAN_VALUE
            imgs[i] = img2
            scores[i] = float(score)/10
            if (i+1) % 10 == 0:
                print('processed {} images!'.format(i+1))
        h.create_dataset('data', data=imgs)
        h.create_dataset('score', data=scores)
    with open('{}_h5.txt'.format(setname), 'a') as f:
        f.write(h5_filename)
        f.write('\n')

主要的区别在于

imgs = numpy.zeros((sample_size, 3,)+ IMAGE_SIZE, dtype=numpy.float32)

这里使用了三通道。

img2[0]=img[:,:,0]
img2[1]=img[:,:,1]
img2[2]=img[:,:,2]

这里将H*W*C的彩图转换成了C*H*W的格式。
然后这个脚本每隔100次会生成一个h5文件，避免一次性生成大文件内存不足的问题。

4. 向量标签

本小节参考自 Caffe中HDF5Data例子

(1) 生成hdf5数据

如果一个数据的标签是向量形式

import random
from PIL import Image
import numpy as np
import h5py

IMAGE_DIR = ['image_train', 'image_test']
HDF5_FILE = ['hdf5_train.h5', 'hdf5_test.h5']
LIST_FILE = ['list_train.txt', 'list_test.txt']

LABELS = dict(
    # (kind_1, kind_2)
    A_0 = (0, 0),
    B_0 = (1, 0),
    A_1 = (0, 1),
    B_1 = (1, 1),
    A_2 = (0, 2),
    B_2 = (1, 2),
)

print '\nplease wait...'

for kk, image_dir in enumerate(IMAGE_DIR):
    # 读取文件列表于file_list
    file_list = ...
    # 文件列表乱序
    random.shuffle(file_list)

    # 标签类别
    kind_index = ...

    # 图片大小为96*32，单通道
    datas = np.zeros((len(file_list), 1, 32, 96))
    # label大小为1*2
    labels = np.zeros((len(file_list), 2))

    for ii, _file in enumerate(file_list):
        # hdf5文件要求数据是float或者double格式
        # 同时caffe中Hdf5DataLayer不允许使用transform_param，
        # 所以要手动除以256
        datas[ii, :, :, :] = \
            np.array(Image.open(_file)).astype(np.float32) / 256
        labels[ii, :] = np.array(LABELS[kind_index ]).astype(np.int)

    # 写入hdf5文件
    with h5py.File(HDF5_FILE[kk], 'w') as f:
        f['data'] = datas
        f['labels'] = labels
        f.close()

    # 写入列表文件，可以有多个hdf5文件
    with open(LIST_FILE[kk], 'w') as f:
        f.write(os.path.abspath(HDF5_FILE[kk]) + '\n')
        f.close()

print '\ndone...'

注意：
caffe中要求1个hdf5文件大小不超过2GB，所以如果数据量太大，建议生成多个hdf5文件
我用的5万张图片，大小一共30几兆，生成的hdf5文件是1.8GB

(2) 标签切分Slicing

Slice layer用于将一个input layer分割成多个output layers，根据给定的维度（目前只能指定num或者channel）。

类型（type）：Slice
例子

layer {
  name: "slicer_label"
  type: "Slice"
  bottom: "label"
  ## 假设label的维度是：N x 3 x 1 x 1
  top: "label1"
  top: "label2"
  top: "label3"
  slice_param {
    axis: 1                        # 指定维度为channel
    slice_point: 1                 # 将label[~][1][~][~]赋给label1
    slice_point: 2                 # 将label[~][2][~][~]赋给label2
                                   # 将label[~][3][~][~]赋给label3
  }
}

axis表明是哪一个维度，slice_point是该维度的索引，slice_point的数量必须是top blobs的数量减1.

(3) 配置

一个实例如下

name: "LeNet"

###for data and labels

layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "labels"
  include {
    phase: TRAIN
  }
  hdf5_data_param {
    source: "list_train.txt"
    batch_size: 100
  }
}
layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "labels"
  include {
    phase: TEST
  }
  hdf5_data_param {
    source: "list_test.txt"
    batch_size: 100
  }
}
layer {
  name: "slicers"
  type: "Slice"
  bottom: "labels"
  top: "label_1"
  top: "label_2"
  slice_param {
    axis: 1
    slice_point: 1
  }
}

### for all

layer {
  name: "conv_all"
  type: "Convolution"
  bottom: "data"
  top: "conv_all"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 50
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu_all"
  type: "ReLU"
  bottom: "conv_all"
  top: "conv_all"
}
layer {
  name: "pool_all"
  type: "Pooling"
  bottom: "conv_all"
  top: "pool_all"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}

### for kind_1

layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool_all"
  top: "ip1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 2
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "accuracy1"
  type: "Accuracy"
  bottom: "ip1"
  bottom: "label_1"
  top: "accuracy1"
  include {
    phase: TEST
  }
}
layer {
  name: "loss_1"
  type: "SoftmaxWithLoss"
  bottom: "ip1"
  bottom: "label_1"
  top: "loss_1"
}

###for kind_2

layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "pool_all"
  top: "ip2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 3
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "accuracy2"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label_2"
  top: "accuracy2"
  include {
    phase: TEST
  }
}
layer {
  name: "loss_2"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label_2"
  top: "loss_2"
}

网络结构如下
这里写图片描述