neuraltalk2-代码解析(1)

最新推荐文章于 2024-03-21 09:31:16 发布

MagicYangTwo

最新推荐文章于 2024-03-21 09:31:16 发布

阅读量2.8k

点赞数 2

分类专栏： Torch 文章标签：机器学习 Torch imgcaption

本文链接：https://blog.csdn.net/qq_30133053/article/details/52328383

版权

Torch 专栏收录该内容

4 篇文章 3 订阅

订阅专栏

这是我下军令状的第…不知道多少天了，大概有十多天了吧，代码最近也在努力地分析中，好记性不如烂笔头，所以我也就把我最近这几天努力的结果写下来，一方面是督促我自己，另一方面是分享自己的努力，希望有一天能帮助到我这样的小白，虽然知道自己有些好高骛远，但作为一种挑战何尝不是一种乐趣，完成这个实验，我会静下心来，好好的摸索，学习真正的机器学习(之前看的公开课，总觉得太过之间片面，提醒一下入机器学习坑的小白，网上的公开课(NG最初始版，网易公开课上的stanford还ok)，比较适合非计算机专业非数学专业的人，很直接很暴力，自学才是大学的学习方式)，我有一个梦想，将来成为一名研发智能产品的工程师和科学家，努力！加油！————万恶的专业课（!~!）

第一篇，我先解析下neuraltalk2代码中的utils.lua,netutils.lua工具包，和DataLoader.lua数据加载文件。

首先声明：代码不是原创，而是转载，获得代码在之前的博客给出了链接。

utils.lua
这个工具文件由utils.getopt(opt,key,default_value),utils.read_json(path),
utils.write_json(path,j),utils.dict_average(dicts),
utils.count_keys(t),utils.average_values(t)这六个方法构成，有些方法我都直接明白，所以有些具体细节就不多叙述。

第一点
- 大家可以了解一个什么是json格式，百度百科给的介绍还是十分不错的，在实际代码其实只用到了cjson包中两个函数，分别为encode_sparse_array()与encode()，链接CJSON,encode_sparse_array()简单来说就是编码稀疏数组，即数组中的缺失元素用nil值来代替，encode()是将json格式的txt转换为lua可识别的table形式。
utils.getopt(opt, key, default_value)
- 简单叙述下，这个方法是查询在opt中有木有key键值所对应的值opt[key]，如果有则返回opt[key]，如果没有则返回default_value，若传入的default_value为空，这时候getopt方法报错，输出“error: required key ’ .. key .. ’ was not provided in an opt.”
utils.read_json(path)
- 这个方法根据path路径，读取一个json格式的文件，返回一个lua可识别的table结构。
  - utils.write_json(path,j)
- 这个方法是根据path路径，现将j转换为一个稀疏的数组，在将j存储。
  - utils.dict_average(dicts)
    -首先了解一下这个方法作用的variable结构，dicts是一个链表，链表中的每个元素数为k:v对的table，并且该方法的前提是所有dicts[n]的k值相同，所以该方法返回dict table中的每个键值，所对应的值value为，原dicts链表中每个元素table，所对应相同键值的平均值。

function utils.dict_average(dicts)
  local dict = {}
  local n = 0
  --遍历链表中的每个元素（类型table）
  for i,d in pairs(dicts) do
    --遍历每个table元素，并且将值加到新建的dict变量中，如果键值k对应的value存在
    for k,v in pairs(d) do
      if dict[k] == nil then dict[k] = 0 end
      dict[k] = dict[k] + v
    end
    --记录链表中元素的数量
    n=n+1
  end
  --遍历新variable（类型table），将每个键值对应的value取average
  for k,v in pairs(dict) do
    dict[k] = dict[k] / n -- produce the average
  end
  return dict
end

utils.count_keys(t)
- variable t（类型table）这个方法返回t中k:V对的数量，说实话不知道这个方法用来干嘛的（@.@），可以解释为K神讨厌#这个operator
utils.average_values(t)
- 输入variable t（类型table）这个方法返回t中所有value和的平均值

net_utils.lua
- 这个网络工具大致由两部分组成，一部分为net_utils网络工具，另一部分为nn.FeatExpander这个类（这个类继承于nn.module，对nn.module不熟悉的同学可以看我其他博客，有可能暂时没发，以后会不上对nn.module的分析与实例解析）
nn.FeatExpande
- 这个类继承与nn.module，因此拥有nn.module的一些特性。继承于module类都有两个接口｛｛input｝,{output}｝，这个类的功能为将input即输入张量，对其扩充n倍，形成输出的output张量，所以这个类不涉及任何的神经网络参数的训练。
- layer_init（n）这个方法用来初始化父类，并且初始化FeatExpander的初始参数n，n为input扩充的倍数。
- layer:updataOutput(input) 这个方法是在继承nn.module类，建议重写的方法，这个方法定义了如何更新输出向量，即这个方法的最终返回结果为输出向量，具体解析等我后面粘的代码。
- layer：updataGradInput（input，gradOutput）这个方法也是重写了nn.module中的方法,这个方法这用来确定当误差传导到output时，如何将误差传导到input层，所以这个方法返回的是input的梯度。具体详解将在后面粘出。

function layer:updateOutput(input)
  --若layer中n为1，则并不需要扩充，相当与做一个空操作
  if self.n == 1 then self.output = input; return self.output end -- act as a noop for efficiency
  -- simply expands out the features. Performs a copy information
  -- 确认输入张量是否为2维的
  assert(input:nDimension() == 2)
  -- 得到第二维的长度
  local d = input:size(2)
  --重新设定输出向量的大小，第一维的大小为原输入数据第一维的大小乘以扩充倍数，第二维的大小不做任何改变
  self.output:resize(input:size(1)*self.n, d)
  --将每组数据足一拷贝，是以第一维来分组
  for k=1,input:size(1) do
    --这里的K也可以看作是指定的是第k行数据，j为第原input第k行数据在output中所处的第j行
    local j = (k-1)*self.n+1
    --值得注意的是这里的数据是成块的，即j行与j+self.n行之间都是相同的数据，expand（）函数是不会分配新的内存，即其实扩展数据是不存在。具体函数详解可去官方的帮助去找   
    self.output[{ {j,j+self.n-1} }] = input[{ {k,k}, {} }]:expand(self.n, d) -- copy over
  end
  return self.output
end

function layer:updateGradInput(input, gradOutput)
  --n为1，空操作
  if self.n == 1 then self.gradInput = gradOutput; return self.gradInput end -- act as noop for efficiency
  -- add up the gradients for each block of expanded features
  --重新设定self.gradInput的大小，按照input的size
  self.gradInput:resizeAs(input)
  --获得input的第二维的大小(应该说范数，但是我并不是学数学滴^_^)
  local d = input:size(2)
  --以input:size（1）为循环的条件，是因为input:size（i）表示不同的数据有多少行，可以很方便的检索数据
  for k=1,input:size(1) do
    --j为在input中第k行数据，在output中的第j行，注意gradOutput与output是同样的维度大小
    local j = (k-1)*self.n+1
    --对每一列进行求和操作，即相同数据对应的梯度求和
    self.gradInput[k] = torch.sum(gradOutput[{ {j,j+self.n-1} }], 1)
  end
  return self.gradInput
end

net_utils.build_cnn(cnn,opt)
- 这个方法根据输入的参数opt与从caffe平台取得的cnn来进行构造cnn网络，这个cnn模型为VGG-16，详解在下面贴出。

function net_utils.build_cnn(cnn, opt)
  --utils.getopt(a,b,c) 其中c为默认参数
  --layer_num为从caffe中取得cnn的层数
  local layer_num = utils.getopt(opt, 'layer_num', 38)
  --backend为训练的方式，这里选定为GPU
  local backend = utils.getopt(opt, 'backend', 'cudnn')
  --encoding_size为最后cnn网路应该输出向量的长度
  local encoding_size = utils.getopt(opt, 'encoding_size', 512)
  --后端的设定若backend为cudnn则导入cudnn包，支持GPU运算，若为nn则导入nn包，支持CPU运算
  if backend == 'cudnn' then
    require 'cudnn'
    backend = cudnn
  elseif backend == 'nn' then
    require 'nn'
    backend = nn
  else
    error(string.format('Unrecognized backend "%s"', backend))
  end

  -- copy over the first layer_num layers of the CNN
  --nn.Sequential()是容器类，队列型容器，如果对容器类不熟悉的同学，可以看其他博客，可能我还没写(^_^)！，有机会补上
  local cnn_part = nn.Sequential()
  for i = 1, layer_num do
    --获得每层的module
    local layer = cnn:get(i)
    if i == 1 then
      -- convert kernels in first conv layer into RGB format instead of BGR,
      -- which is the order in which it was trained in Caffe
      --将BGR形式的参数形式转换为RGB格式的参数，因为在caffe中训练图片的颜色通道为BGR。
      --Clone参数，注意这里的clone相当与C语言中，直接将指针的地址复制，也就是weight和w中的参数指向的是同一地址，并不是深拷贝。
      local w = layer.weight:clone()
      -- swap weights to R and B channels
      print('converting first layer conv filters from BGR to RGB...')
      --从这里跟大家分析一下这个cnn网络参数的格式，参数都是4维张量，第一维的大小为batch_size，第二维为颜色通道大小为3，第三维和第四维都是图片的size。
      layer.weight[{ {}, 1, {}, {} }]:copy(w[{ {}, 3, {}, {} }])
      layer.weight[{ {}, 3, {}, {} }]:copy(w[{ {}, 1, {}, {} }])
    end
    --添加网络层
    cnn_part:add(layer)
  end
  --这时已经得到的是cnn的最后一层
  --在最后一层添加到encoding_size维度的转换，从这里可以看出VGG-16最后一层网络的维数大小为4096，这与论文比较符合。
  cnn_part:add(nn.Linear(4096,encoding_size))
  --添加非线性层，这里用的是ReLU非线性函数
  --这里backend为cunn字符串
  cnn_part:add(backend.ReLU(true))
  return cnn_part
end

net_utils.prepro(imgs,data_augment,on_gpu)
- 这个方法是对输入图像进行预处理过程，因为VGG-16网络是写死的，其只使用与width和height为224大小的图片，所以如果输入图片超过了这个size，就必须经过一定的预处理，详细解析在下面贴出

--提取batchsize长度的images并且进行预处理,这里还是跟大家说明一下数据的格式，imgs为维数为4的张量，第一维大小为batch_size，第二维的大小为3，代表三个颜色通道，第三维的大小为width，第四维的大小为height
-- takes a batch of images and preprocesses them
-- VGG-16 network is hardcoded, as is 224 as size to forward
-- VGG-16 网络是写死的网络，224大小是网络初始层固定的大小
function net_utils.prepro(imgs, data_augment, on_gpu)
  --确认data_augment与on_gpu这两个参数是否输入正常
  assert(data_augment ~= nil, 'pass this in. careful here.')
  assert(on_gpu ~= nil, 'pass this in. careful here.')
  --得到图片的高与宽
  local h,w = imgs:size(3), imgs:size(4)
  local cnn_input_size = 224
  -- cropping data augmentation, if needed
  -- 确认是否进行数据，样本的扩充
  if h > cnn_input_size or w > cnn_input_size then
    local xoff, yoff
    if data_augment then
      --如果进行数据扩充，这图片中随机提取224大小的区域，我认为这个方法同一组数据不只调用一次，torch.random（a,b）是随机生成一个在a,b之间的整数，默认a为0。
      xoff, yoff = torch.random(w-cnn_input_size), torch.random(h-cnn_input_size)
    else
      -- sample the center
      --如果不进行数据的扩充，则直接取中央的像素块
      xoff, yoff = math.ceil((w-cnn_input_size)/2), math.ceil((h-cnn_input_size)/2)
    end
    -- crop.
    imgs = imgs[{ {}, {}, {yoff,yoff+cnn_input_size-1}, {xoff,xoff+cnn_input_size-1} }]
  end
  -- ship to gpu or convert from byte to float
  --转换数据格式
  if on_gpu then imgs = imgs:cuda() else imgs = imgs:float() end
  -- lazily instantiate vgg_mean
  --其实本人在2016-8-26时并不熟悉VGG-16网络
  if not net_utils.vgg_mean then
    net_utils.vgg_mean = torch.FloatTensor{123.68, 116.779, 103.939}:view(1,3,1,1) -- in RGB order
  end
  --typsAs()是按照imgs的格式重新返回一个tensor
  net_utils.vgg_mean = net_utils.vgg_mean:typeAs(imgs) -- a noop if the types match
  -- 根据VGG——mean将数据中心化
  -- subtract vgg mean
  imgs:add(-1, net_utils.vgg_mean:expandAs(imgs))
  --这个预处理过程，实际上是VGG-16去中值的过程
  return imgs
  --返回经过处理之后的数据
end

net_utils.list_nngraph_modules(g)
- 这个方法不详解，g variable的类型是gModule，其返回的是nngraph模型的链表
net_utils.listModule(net)
- 这个方法也不详解，是将net结构以链表的形式返回
net_utils.sanitize_gradients(net),net_utils.unsanitize_gradients(net)
- 这两个方法相互对照，分别为清空梯度，恢复梯度
net_utils.decode_sequence(ix_to_word,seq)
-这个方法是用来解码的，两个输入参数ix_to_word,seq，分别代表者向量到字符串(英文字母的映射)，和需要解码的序列，详解。

--[[
take a LongTensor of size DxN with elements 1..vocab_size+1
(where last dimension is END token), and decode it into table of raw text sentences.
each column is a sequence. ix_to_word gives the mapping to strings, as a table
--]]
function net_utils.decode_sequence(ix_to_word, seq)
  --这里跟大家解析下D,N分别代表着什么，他们的实际意义是什么，N代表着序列的个数，通常为batch_size，D为seq_length
  local D,N = seq:size(1), seq:size(2)
  --这是要输出的文档
  local out = {}
  for i=1,N do
    local txt = ''
    --遍历每个序列
    for j=1,D do
      --取输入向量inputx
      local ix = seq[{j,i}]
      --将ix转换为字符输入ix_to_word映射中，得到真正的英文单词word
      local word = ix_to_word[tostring(ix)]
      --如果word不存在，代表已经到了序列末尾，执行结束代码
      if not word then break end -- END token, likely. Or null token
      --..字符串连接
      --K神的格式真是讲究，然道是处女座!_!，
      --每两个词之间用空格隔开
      if j >= 2 then txt = txt .. ' ' end
      txt = txt .. word
    end
    --将文本插入即将要输出的table
    table.insert(out, txt)
  end
  --out为全部文档
  return out
end

net_utils.clone_list(list)
- 复制链表，注意这里是深拷贝。
net_utils.language_eval(predicaitions,id)
- 这个方法用于测试预测结果，代码写得很逗

DataLoader.lua
- 这个文件有DataLoader这个类构成，这个类是用来加载数据。这个文件导入了hdf5工具包，有像我一样的新手可能问了，什么是hdf5包，这个包也是torch中用于数据处理的工具包（一点也不好笑），用来读取hdf5形式的文件。
DataLoader:_init（opt）
- 不多说，直接上。

function DataLoader:__init(opt)

  -- load the json file which contains additional information about the dataset
  print('DataLoader loading json file: ', opt.json_file)
  self.info = utils.read_json(opt.json_file)
  --ix_to_word是输入向量到词空间的一个映射
  self.ix_to_word = self.info.ix_to_word
  --vocab_size标明词个数，也是维度的标记，最后一个词为END特殊词
  self.vocab_size = utils.count_keys(self.ix_to_word)
  print('vocab size is ' .. self.vocab_size)
  -- open the hdf5 file
  print('DataLoader loading h5 file: ', opt.h5_file)
  self.h5_file = hdf5.open(opt.h5_file, 'r')
  -- extract image size from dataset
  --返回images各种维度的大小，想细究的同学可以去(https://github.com/deepmind/torch-hdf5/blob/master/luasrc/dataset.lua)学习，才疏学浅暂时还没看
  --images_size[1]为图片数量，images_size[2]为通道数量，images_size[3],images_size[4]为图片的尺寸
  local images_size = self.h5_file:read('/images'):dataspaceSize()
  assert(#images_size == 4, '/images should be a 4D tensor')
  assert(images_size[3] == images_size[4], 'width and height must match')
  self.num_images = images_size[1]
  self.num_channels = images_size[2]
  self.max_image_size = images_size[3]
  print(string.format('read %d images of size %dx%dx%d', self.num_images,
            self.num_channels, self.max_image_size, self.max_image_size))

  -- load in the sequence data
  local seq_size = self.h5_file:read('/labels'):dataspaceSize()
  --seq_size[1]应为序列的数量，seq_size[2]应为序列的长度，即为seq_lenght
  self.seq_length = seq_size[2]
  print('max sequence length in data is ' .. self.seq_length)
  -- load the pointers in full to RAM (should be small enough)
  -- 注意这里获取的是所有序列的开始向量，与end向量的位置
  self.label_start_ix = self.h5_file:read('/label_start_ix'):all()
  self.label_end_ix = self.h5_file:read('/label_end_ix'):all()
  -- separate out indexes for each of the provided splits
  self.split_ix = {}
  --这个是迭代器，用来index
  self.iterators = {}
  --self.info.images是json格式数据信息
  for i,img in pairs(self.info.images) do
    --这里的img.split是image的标签分别为“train”，“valid”,"test"
    local split = img.split
    if not self.split_ix[split] then
      -- initialize new split
      self.split_ix[split] = {}
      self.iterators[split] = 1
    end
    --将对应label的图片插入table中
    table.insert(self.split_ix[split], i)
  end
  --输出图片信息
  for k,v in pairs(self.split_ix) do
    print(string.format('assigned %d images to split %s', #v, k))
  end
end

resetIterator(split),getvocabsize(),getvocab(),getseqlength()
- 这几个函数就不多说了
DataLoader:getBatch(opt)
- 这个方法用来获得一个batch_size的数据，直接看解析

--[[
  Split is a string identifier (e.g. train|val|test)
  Returns a batch of data:
  - X (N,3,H,W) containing the images
  - y (L,M) containing the captions as columns (which is better for contiguous memory during training)
  - info table of length N, containing additional information
  The data is iterated linearly in order. Iterators for any split can be reset manually with resetIterator()
--]]
function DataLoader:getBatch(opt)
  --split用来指定获得哪种数据，train|val|test
  local split = utils.getopt(opt, 'split') -- lets require that user passes this in, for safety
  --获得batch_szie
  local batch_size = utils.getopt(opt, 'batch_size', 5) -- how many images get returned at one time (to go through CNN)
  local seq_per_img = utils.getopt(opt, 'seq_per_img', 5) -- number of sequences to return per image

  --split_ix里面存的是imgs的索引
  local split_ix = self.split_ix[split]
  assert(split_ix, 'split ' .. split .. ' not found.')
  --创建batch_img的初始张量
  -- pick an index of the datapoint to load next
  local img_batch_raw = torch.ByteTensor(batch_size, 3, 256, 256)
  --创建label_batch的初始向量
  local label_batch = torch.LongTensor(batch_size * seq_per_img, self.seq_length)
  --获得最大的索引值，为split_ix的最大数量，#操作符为去split_ix的长度
  local max_index = #split_ix
  local wrapped = false
  local infos = {}
  for i=1,batch_size do

    local ri = self.iterators[split] -- get next index from iterator
    local ri_next = ri + 1 -- increment iterator
    --如果超过了最大索引，表示已经通过了一个轮换
    if ri_next > max_index then ri_next = 1; wrapped = true end -- wrap back around
    --这是改变了self.iterators[split]中的迭代序号，为了方便下次去样本
    self.iterators[split] = ri_next
    --获得图像的索引
    ix = split_ix[ri]
    assert(ix ~= nil, 'bug: split ' .. split .. ' was accessed out of bounds with ' .. ri)

    -- fetch the image from h5
    --img是一个4维的张量，第一维为1（因为提取的是{ix,ix}，即单个图片），第二维为通道对应{1,self.num_channels}，代表了三个通道，剩下两个维度为图片的大小
    local img = self.h5_file:read('/images'):partial({ix,ix},{1,self.num_channels},
                            {1,self.max_image_size},{1,self.max_image_size})
    --添加图片
    img_batch_raw[i] = img
    -- fetch the sequence labels
    -- 首先获得ix所对应的序列的start与end序号，分别为ix1，ix2
    local ix1 = self.label_start_ix[ix]
    local ix2 = self.label_end_ix[ix]
    -- 获得描述该图片语句的数量
    local ncap = ix2 - ix1 + 1 -- number of captions available for this image
    assert(ncap > 0, 'an image does not have any label. this can be handled but right now isn\'t')
    local seq
    --查看num of caption是否满足刚开始设定的seq_per_img
    if ncap < seq_per_img then
      -- we need to subsample (with replacement)
      -- 如果数量过少则找部分样本代替
      seq = torch.LongTensor(seq_per_img, self.seq_length)
      for q=1, seq_per_img do
        local ixl = torch.random(ix1,ix2)
        --这是随机提取的，注定有同样的标记可能被提取多遍
        seq[{ {q,q} }] = self.h5_file:read('/labels'):partial({ixl, ixl}, {1,self.seq_length})
      end
    else
      -- there is enough data to read a contiguous chunk, but subsample the chunk position
      -- captions数量足够，取连续的captions，但第一个caption是随机的
      local ixl = torch.random(ix1, ix2 - seq_per_img + 1) -- generates integer in the range
      seq = self.h5_file:read('/labels'):partial({ixl, ixl+seq_per_img-1}, {1,self.seq_length})
    end
    --il是在label_batch中第i号图片的第一个索引位置
    local il = (i-1)*seq_per_img+1
    --将seq储存到label_batch中
    label_batch[{ {il,il+seq_per_img-1} }] = seq
    -- and record associated info as well
    local info_struct = {}
    info_struct.id = self.info.images[ix].id
    info_struct.file_path = self.info.images[ix].file_path
    table.insert(infos, info_struct)
  end
  local data = {}
  data.images = img_batch_raw
  --将1维，与2维交换
  data.labels = label_batch:transpose(1,2):contiguous() -- note: make label sequences go down as columns
  data.bounds = {it_pos_now = self.iterators[split], it_max = #split_ix, wrapped = wrapped}
  data.infos = infos
  return data
end