tensorflow显存、载入模型、优化器(个人笔记)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_34351621/article/details/78397920

在使用tensorflow做实验的这短暂一段时间内,遇到了不少问题,把还没忘问题写在这里,方便以后查阅。

1. 运行sess=tf.Session() 或 sess=tf.InteractiveSession()后发现所有GPU的显存全部占满

A:这是正常现象。为了节省资源,可以如下做:

import os 
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #指定第几块gpu被该程序发现(使用)
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3 #限制使用gpu显存的百分之几。如果只写这句,没有下一句的话,当限制的显存不能满足你的程序需求,就会报错OOM
config.gpu_options.allow_growth = True #允许显存的增长。如果当前限制的显存不够,没关系可以继续增加。如果只有这一句话,tensorflow运行后显存从最低慢慢增长。
sess = tf.Session(config=config)
#sess = tf.InteractiveSession(config=config)

但占用了的显存是不会自动释放的,即使当前没使用这么多显存。

2. 写了个基本网络,训练后保存为model. 在该基本网络上增加若干层并利用训练好的基本网络继续训练,想restore model时发生错误

A: 正常现象,因为 tf.train.Saver() 发现保存的model中并不存在你新加的那些Variable,因此模型不匹配,便会报错。这个的解决方案找了一天,也许因为查找的关键字不对,百度没有找到。最终Google 用英文搜索才解决。看一下 tf.train.Saver() 这个类的构造函数:

__init__(
    var_list=None,
    reshape=False,
    sharded=False,
    max_to_keep=5,
    keep_checkpoint_every_n_hours=10000.0,
    name=None,
    restore_sequentially=False,
    saver_def=None,
    builder=None,
    defer_build=False,
    allow_empty=False,
    write_version=tf.train.SaverDef.V2,
    pad_step_number=False,
    save_relative_paths=False,
    filename=None
)

第一个var_list 的官方解释是

specifies the variables that will be saved and restored. It can be
passed as a dict or a list

说明可以人为指定哪些 Variables 可以被保存和加载,而且是以字典或 list 的形式指定。那么这时,我们首先要知道有哪些 Variable。

all_variables = tf.contrib.framework.get_variables_to_restre() #得到该网络中所有Variable的信息,返回的all_variables是个list。
variables_to_restore = []  # 顾名思义
variables_not_restore = []
for v in all_variables:
    if v.name.split('/')[0] != 'New_layer':
        variables_to_restore.append(v)
    else:
        variables_not_restore.append(v)
saver = tf.train.Saver(var_list=variables_to_restore,max_to_keep=1,write_version=1)
saver.restore(sess, './model-1')
sess.run(tf.variables_initializer(var_list=variables_not_restore) #未被restore的variable也要初始化。
#tf.variables_initializer(var_list=variables_not_restore).run()

得到当前网络的所有Variables后,我需要分辨哪些是预训练模型中有的 (variables_to_restore),哪些是我新添加的不需要restore的(variables_not_restore),而且因为我新添加的所有变量是在 with tf.variable_scope( 'New_layer' ) 下定义的,因此变量名的第一部分就是‘New_layer’,方便识别新旧变量。之后如程序所示,顺利加载预训练模型。值得注意的是,未被 restore 的 Variable 仍然需要初始化。tf.variable_initializer(var_list)能初始化指定的 Variable。顺便一提,tf.global_variables_initializer()tf.variables_initializer( tf.global_variables()) 的简化写法,官网如是说。

3. tf.nn.rnn_cell.BasicLSTMCell( num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None) 处理的是以当前数据xt和上一时刻隐含状态ht-1为输入的LSTM单元,然而我需要当前两个数据xt和yt以及ht-1作为LSTM单元的输入。

A:个人认为tensorflow应该提供了这样的LSTM类,但是由于不熟悉tensorflow的用法,并没有找到。无奈仿照tf.nn.rnn_cell.BasicLSTMCell()的实现自己重写了一个有特殊要求的LSTM类。主要是重载_linear()__call__()两个函数。由于从一开始就保持了state_is_tuple=False,因此自己实现过程中索性强制要求state_is_tuple=False(并不推荐)。代码如下:

class MyLSTM(tf.contrib.rnn.RNNCell):
    def __init__(self, num_units, forget_bias=1.0, state_is_tuple=True,activation=None, reuse=None):
        super(MyLSTM,self).__init__(_reuse=reuse)
        assert state_is_tuple == False, "state_is_tuple should be 'False' in this implement"
        self._num_units = num_units
        self._forget_bias = forget_bias
        self._state_is_tuple = state_is_tuple
        self._activation = activation or tf.tanh
    def _linear(self, args, output_size, bias, bias_initializer=None, kernel_initializer=None):
        # args: is list of 2D, batch x n, Tensor
        total_arg_size = 0
        shapes = [ a.get_shape() for a in args]
        for shape in shapes:
            if shape.ndims != 2:
                raise ValueError("linear is expecting 2D arguments: %s" % shapes)
            if shape[1].value is None:
                raise ValueError("linear expects shape[1] to be provided for shape %s, "
                                 "but saw %s" % (shape, shape[1]))
            else:
                total_arg_size += shape[1].value
        dtype = [a.dtype for a in args][0]

        scope = tf.get_variable_scope()
        with tf.variable_scope(scope) as outer_scope:
            weights = tf.get_variable( "kernel",shape=[total_arg_size,output_size],dtype=dtype,initializer=kernel_initializer)
            if len(args) == 1:
                res = tf.matmul( args[0],weights)
            else:
                res = tf.matmul( tf.concat(args,axis=1), weights )
            if not bias:
                return res
            with tf.variable_scope(outer_scope) as inner_scope:
                inner_scope.set_partitioner(None)
                if bias_initializer is None:
                    bias_initializer = tf.constant_initializer(value=0.0, dtype=dtype)
                biases = tf.get_variable("bias",[output_size],dtype=dtype, initializer=bias_initializer)
            return tf.nn.bias_add(res, biases)

    @property
    def state_size(self):
        return 2*self._num_units

    @property
    def output_size(self):
        return self._num_units
    def __call__(self,input1,input2,state):
        """
        :param input1: `2-D` tensor with shape `[batch_size x input_size]`
        :param input2: same size as input1
        :param state: a `Tensor` shaped `[batch_size x 2 * self.state_size] with state_is_tuple=False
        :return: new_h, concat([new_c,new_h],1)
        """
        sigmoid = tf.sigmoid
        c, h = tf.split( value=state, num_or_size_splits=2, axis=1)

        concat = self._linear( [input1,input2,h], 4*self._num_units,True) # (batch_size, 4*self._num_units)
        i, j, f, o = tf.split(value=concat, num_or_size_splits=4, axis=1)
        new_c = ( c*sigmoid(f+self._forget_bias) + sigmoid(i)*self._activation(j))
        new_h = self._activation(new_c) * sigmoid(o)

        if self._state_is_tuple:
            new_state = tf.nn.rnn_cell.LSTMStateTuple(new_c,new_h)
        else:
            new_state = tf.concat(values=[new_c,new_h],axis=1 )
        return new_h, new_state

4. 网络不大,但是执行到train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss)时显存占用巨大。(已经设置显存随需要增长)

A:原本我是这样写的:

train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss)

Google后改为这样写:

train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss,aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)

或者:

train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss, aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)

就解决了。应该是因为我在for循环中计算tf_loss的缘故,具体原因没搞清楚,但很管用。

参考:

问题2:https://stackoverflow.com/questions/42217320/restore-variables-that-are-a-subset-of-new-model-in-tensorflow
https://github.com/DrSleep/tensorflow-deeplab-resnet/issues/11
问题3:https://stackoverflow.com/questions/45439045/tensorflow-rnn-input-of-two-different-types
问题4:https://stackoverflow.com/questions/36194394/how-i-reduce-memory-consumption-in-a-loop-in-tensorflow

展开阅读全文

没有更多推荐了,返回首页