99%的算法工程师都不知道！Tensorflow中Cudnn加速LSTM的正确打开方式

tylunas

已于 2023-03-09 22:13:29 修改

阅读量1.5k

点赞数 1

分类专栏：深度学习文章标签： tensorflow 深度学习神经网络 lstm

于 2023-03-06 16:39:26 首次发布

本文链接：https://blog.csdn.net/tylunas/article/details/129359669

版权

深度学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

RNNs是神经网络中处理时序数据常用的单元，其中LSTM用得较多。即使目前Transformers几乎统一了天下，但在序列数据上仍然有一定的用武之地。

LSTM需要按照时序一步步执行，同时计算的各个Kernel函数之间的间隙很大, 常被诟病效率不高，所以存在一定的优化空间，不同LSTM实现的效率也不一样（可以看这里的中文版）。
早就听说过Nvidia的cuDNN库对LSTM、GRU等等RNN Cell提供了定制加速，比使用原始LSTMCell快好几倍。比如说这一篇LSTM优化之路就有介绍和代码示例，简单上手跑一下，速度有了成倍提升、

但是，实际基于TensorFlow落地，实现使用GPU训练，还能使用CPU进行推理。问题还是蛮多的。

经过长期探索，目前似乎只有两条路是通的。本篇以下以LSTM为例，讲讲如何用好CudnnRNN。

首先需要说明的是，加速是有代价的。CudnnRNN系列是一个层，而不是一个RNN单元，层中可以包含多层单元，各层之间也可以设置dropout rate。但是不能自己堆叠，Tree-LSTM什么的就比较难实现了。

看到网上有些博主直接用CudnnCompatibleLSTMCell测试，然后说“根本没有加速嘛”，那因为是用的方式不对。CudnnCompatibleLSTMCell就是用于导出模型而封装的LSTMBlockCell，加速得靠Tensor RT了。

问题

直接保存的模型中会包括一个名为cudnn_lstm/opaque_kernel_saveable:0的节点，这个节点需要GPU才能执行。否则就会报告如下错误：

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op ‘CudnnRNN’ Registered devices: [CPU], …

简而言之，不改变模型结构直接保存，在CPU上面是不能推理的。

为了能够跨设备推理，CudnnRNN层的正确使用方式是这样的：

构建模型图，其中RNN层使用CudnnRNN；
在Nvidia GPU上面训练；
保存checkpoint1；
构建兼容的模型图，其中RNN层使用兼容方案；
在兼容的模型图里面加载checkpoint1；
如果要保存推理用的*.pb文件，
则将兼容的模型图导出为*.pb文件。
如果需要一个跨设备推理的checkpoint（如ELMo），
则将兼容的模型图保存checkpoint2。

第一种方法，使用tf.keras.layers.CuDNNLSTM

这个层内部调用gen_cudnn_rnn_ops.cudnn_rnn()方法，将自身数据送入cuDNN优化算子计算。
同时在保存时，模型是独立于cuDNN的，从而在CPU上直接创建tf.keras.layers.LSTM加载。

from tensorflow.keras.layers import Bidirectional, CuDNNLSTM, LSTM

    def rnn_layer(self, rnn_inputs, hidden_size, sequence_lengths, name=None):
        if self.use_gpu:
            layer = Bidirectional(CuDNNLSTM(hidden_size, return_sequences=True))
        else:
            layer = Bidirectional(LSTM(hidden_size, activation='tanh', recurrent_activation='sigmoid', return_sequences=True))
        output = layer(rnn_inputs)

第二种方法，正确使用 CudnnCompatibleLSTMCell

如果是老的 tensorflow 版本，就可能没有完整的 tf.keras 支持了。这时候应该怎么办呢？

其实，早期版本里面就已经有了CudnnRNN操作符和CPU兼容模块，如CudnnLSTM对应CudnnCompatibleLSTMCell，根据笔者测试，在tf1.4中，依然有这些模块。

但是，网上很少有完整的文档和示例代码。

（所以再强调两遍，不能直接用CudnnCompatibleLSTMCell! 不能直接用CudnnCompatibleLSTMCell!）

具体使用方法，还得参看CudnnRNN类的注释。注释中有一定的使用说明，不过讲得并不是很清楚。
比如说，tf1.10的时候兼容块的名字已经叫CudnnCompatibleLSTMCell了，但是文档里面还是CudnnCompatibleLSTM，到tf1.15里面改了正确的名字，但也没加上更详细的说明，

一个完整能跑的使用例子是 Tensorflow 源码中的测试代码tensorflow\contrib\cudnn_rnn\python\kernel_tests\cudnn_rnn_test.py。不过文件本身内容太多。

整理了一遍，可以知道的要点有这些：

单向RNNs需要先用MultiRNNCell包装，再使用tf.nn.dynamic_rnn包装；
双向RNNs需要先根据层数转成前向和后向列表，再使用tf.contrib.rnn.stack_bidirectional_dynamic_rnn包装；
GPU和CPU的两个图需要使用不同的Graph，Session和Saver，不能混用；
保存模型的Saver不要限定任何参数范围；
CudnnLSTM层不要设置任何scope。

虽然文档里没有强调，但是这一点也很重要：

CudnnRNN中步长是放在输入参数的第一个维度的，所以相对应的， stack_bidirectional_dynamic_rnn 也需要设置为 time_major, 输入张量也要转置为 [num_steps, batch_size, input_size]；

但是！就算笔者做完了以上全部步骤，还是报告一样的错误：

NotFoundError (see above for traceback): Restoring from checkpoint failed. Original error:
Key stack_bidirectional_rnn/cell_0/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias not found in checkpoint [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],
_device=“/job:localhost/replica:0/task:0/device:CPU:0”] (arg save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices) ]]

最后还是StackOverflow上面这个答案给出了解决办法：

兼容层必须放在 “cudnn_lstm” 的scope里面！

加上这一条，终于成功了…

所以，最后的代码就是如下这样了：

from tensorflow.contrib import rnn, cudnn_rnn

model

    def rnn_layer(self, rnn_inputs, hidden_size, sequence_lengths, name=None):
        """
        :param rnn_inputs: [batch_size, num_steps, embed_size]
        :param hidden_size: int, dim of the rnn cell.
        :param sequence_lengths: vector [batch_size], step of each instance
        :return: [batch_size, num_steps, 2*hidden_size]
        """
        # with tf.variable_scope("char_bi_lstm" if not name else name):
        # Need time-major
        transposed = tf.transpose(rnn_inputs, perm=[1, 0, 2])  # [num_steps. batch_size, embed_size]
        if self.use_gpu:
            layer = cudnn_rnn.CudnnLSTM(num_layers=1, num_units=hidden_size, 
                                        direction=cudnn_rnn.CUDNN_RNN_BIDIRECTION,
                                        kernel_initializer=self.initializer,
                                        bias_initializer=self.initializer)
            # output `[num_steps, batch_size, num_dirs * hidden_size]`
            output, final_states = layer(inputs=transposed, training=self.is_training)
        else:
            lstm_cell = {}
            with tf.variable_scope("cudnn_lstm"):
                for direction in ["forward", "backward"]:
                    with tf.variable_scope(direction):
                        lstm_cell[direction] = [cudnn_rnn.CudnnCompatibleLSTMCell(hidden_size)]
                (output, output_state_fw,
                 output_state_bw) = rnn.stack_bidirectional_dynamic_rnn(cells_fw=lstm_cell["forward"],
                                                                        cells_bw=lstm_cell["backward"],
                                                                        dtype=tf.float32,
                                                                        sequence_length=sequence_lengths,
                                                                        inputs=transposed, time_major=True)
        output = tf.transpose(output, perm=[1, 0, 2])  # [batch_size, num_steps, num_dirs * hidden_size]
        return output

trainer

    def train(self, x_train, y_train, x_test, y_test, embeddings, num_classes, config, fold=-1):
        sess_config = self.init(config.gpu_id) # 初始化session 配置等
        self.cpu_graph = tf.Graph()
        with self.cpu_graph.as_default() as g:
            self.cpu_model = CudnnLstmModel(num_classes, embeddings.shape[0], config, use_gpu=False) 
            self.cpu_sess = tf.Session(graph=self.cpu_graph, config=sess_config)
            self.cpu_saver = tf.train.Saver()
        graph = tf.Graph()
        with graph.as_default():
            self.model = CudnnLstmModel(num_classes, embeddings.shape[0], config, use_gpu=True)
            self.sess = tf.Session(graph=graph, config=self.sess_config)
            self.saver = tf.train.Saver(max_to_keep=1)
        ...


    def save_best_model(self, path: str, global_step=1):
        """ save and export models
        """
        if not path.endswith('/'):
            path = path + '/'  # 不要多加'/'，会导致底层文件加载判断错误，restore失败
        saved_path = self.saver.save(self.sess, path + 'gpu_model.ckpt', global_step)
        self.module_logger.info("Saved Best {}_{} model to {}\n".format(self.type_id, self.model_id, saved_path))
        self.cpu_saver.restore(self.cpu_sess, saved_path)
        output_graph_def = tf.graph_util.convert_variables_to_constants(self.cpu_sess, self.cpu_sess.graph_def,
                                                                        output_node_names=self.model.output_nodes())
        constant_model_path = path + "best_model_export.pb"
        with tf.gfile.FastGFile(path + constant_model_path, mode='wb') as f:
            f.write(output_graph_def.SerializeToString())

最后，加载模型的Saver其实可以限定参数范围。

Ablation Studies（错误的方式）

CudnnLSTM的Saver指定 (tf.global_variables(), ，也是报告一样的错误。
用CudnnLSTM的Saver做restore，会产生如下的错误：

TypeError: Cannot interpret feed_dict key as Tensor:
The name ‘save_1/Const:0’ refers to a Tensor which does not exist. The operation, ‘save_1/Const’, does not exist in the graph.
stack_bidirectional_dynamic_rnn 中输入未转置的输入矩阵，或未设置time_major，保存模型时不会报错，但推理必然报错。
还有一个错误的打开方式，那就是正确安装cuDNN的版本，在笔者tf1.5版本的环境下曾经遇到过一次，由于cuDNN7.0.5安装为7.1，即使CNN和LSTM训练似乎一切正常，但是一使用CudnnLSTM就报底层CUDA错误。

后记

以上就是笔者使用CudnnLSTM时踩过的那些坑，与提速6-8倍相比，做这些实验也算是值得的。

如果不想保存两次，只是想提高一些性能，那么使用tensorflow.contrib.rnn.LSTMBlockFusedCell也是一个不错的选择，这个层可以把训练时间缩短到原来的40%到30%，即提速2.5~3倍。

需要强调的是，LSTMBlockFusedCell虽然名字里带Cell，却和CudnnRNN一样是一个层，可以参考笔者的使用示例：

    def rnn_layer(self, rnn_inputs, hidden_size, sequence_lengths, name=None):
        """
        :param rnn_inputs: [batch_size, num_steps, embed_size]
        :param hidden_size: int, dim of the rnn cell.
        :param sequence_lengths: vector [batch_size], step of each instance
        :return: [batch_size, num_steps, 2*hidden_size] 
        """
        with tf.variable_scope("bi-lstm" if not name else name):
            transposed = tf.transpose(rnn_inputs, perm=[1, 0, 2])  # Need time-major [num_steps. batch_size, embed_size]
            forward_rnn = rnn.LSTMBlockFusedCell(hidden_size)
            backward_rnn = rnn.LSTMBlockFusedCell(hidden_size)
            backward_rnn = rnn.TimeReversedFusedRNN(backward_rnn)
            output_fw_seq, _ = forward_rnn(inputs=transposed, sequence_length=sequence_lengths, dtype=tf.float32, scope="lstm-fw")
            # output: [num_step, batch_size, hidden_size]
            output_bw_seq, _ = backward_rnn(inputs=transposed, sequence_length=sequence_lengths, dtype=tf.float32, scope="lstm-bw")
            # output: [num_step, batch_size, hidden_size]
            output = tf.concat([output_fw_seq, output_bw_seq], axis=-1)
            output = tf.transpose(output, perm=[1, 0, 2])  # [batch_size, sequence_length, 2*hidden_size]
        return output