深度学习之循环神经网络RNN及python实现

RNN(Recurrent Neural Network)是一类用于处理序列数据的神经网络。

本文主要参考内容:
https://blog.csdn.net/zhaojc1995/article/details/80572098

1. RNN

下图是一个标准结构的RNN的前向传播过程

在这里插入图片描述
其中 x x x是输入, h h h是隐藏层单元, o o o为输出, L L L为损失函数, y y y是训练集的标签, t t t代表时刻状态, U U U是从输入到隐藏层的参数值, W W W是上一个隐藏层到当前隐藏层的参数值, V V V是隐藏层到输出层的参数值, U , W , V U,W,V U,W,V也是模型的参数,也就是需要迭代更新的参数。

1.1 RNN前向传播算法

前向传播的算法,对于 t t t时刻:
对于隐藏层的pre-activation:

a ( t ) = U x ( t ) + W h ( t − 1 ) + b a^{(t)} = Ux^{(t)} + Wh^{(t-1)} + b a(t)=Ux(t)+Wh(t1)+b

对于隐藏层的post-activation:

h ( t ) = g ( a ( t ) ) = g ( U x ( t ) + W h ( t − 1 ) + b ) h^{(t)} = g(a^{(t)}) = g(Ux^{(t)} + Wh^{(t-1)} + b) h(t)=g(a(t))=g(Ux(t)+Wh(t1)+b)

其中 g ( x ) g(x) g(x)为激活函数, b b b为偏置,其实 b = b a + b h b=b_a+b_h b=ba+bh

t t t时刻的输出pre-activation:

o ( t ) = V h ( t ) + c o^{(t)} = Vh^{(t)} + c o(t)=Vh(t)+c

t t t时刻的输出post-activation:

y ^ ( t ) = σ ( o ( t ) ) \hat y^{(t)} = \sigma(o^{(t)}) y^(t)=σ(o(t))

1.2 RNN后向传播算法-BPTT

BPTT(back-propagation through time)算法是常用的训练RNN的方法,其实本质还是BP算法。

我们需要优化的参数有三个,分别是 V , W , U V,W,U V,W,U.

因为参数 V V V没有涉及到时间序列的传播中,所以比较简单

∂ L ( t ) ∂ V = ∂ L ( t ) ∂ y ^ ( t ) ∂ y ^ ( t ) ∂ o ( t ) ∂ o ( t ) ∂ V = ∂ L ( t ) ∂ y ^ ( t ) ∂ y ^ ( t ) ∂ o ( t ) ∂ [ V h ( t ) + c ] ∂ V = L ′ ( t ) ( y ^ ( t ) ) σ ′ ( o ( t ) ) h ( t ) \begin{aligned} \frac {\partial L^{(t)}} {\partial V} &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial o^{(t)}} {\partial V} \\ &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial [Vh^{(t)} + c]} {\partial V} \\ &= L'^{(t)}(\hat y^{(t)}) \sigma'(o^{(t)}) h^{(t)} \end{aligned} VL(t)=y^(t)L(t)o(t)y^(t)Vo(t)=y^(t)L(t)o(t)y^(t)V[Vh(t)+c]=L(t)(y^(t))σ(o(t))h(t)

同理:

∂ L ( t ) ∂ c = L ′ ( t ) ( y ^ ( t ) ) σ ′ ( o ( t ) ) \frac {\partial L^{(t)}} {\partial c} = L'^{(t)}(\hat y^{(t)}) \sigma'(o^{(t)}) cL(t)=L(t)(y^(t))σ(o(t))

求取 L ( t ) L^{(t)} L(t) h ( t ) h^{(t)} h(t)的导数:
∂ L ( t ) ∂ h ( t ) = ∂ L ( t ) ∂ y ^ ( t ) ∂ y ^ ( t ) ∂ o ( t ) ∂ o ( t ) ∂ h ( t ) = ∂ L ( t ) ∂ y ^ ( t ) ∂ y ^ ( t ) ∂ o ( t ) ∂ [ V h ( t ) + c ] ∂ h ( t ) = L ′ ( t ) ( y ^ ( t ) ) σ ′ ( o ( t ) ) V \begin{aligned} \frac {\partial L^{(t)}} {\partial h^{(t)}} &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial o^{(t)}} {\partial h^{(t)}} \\ &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial [Vh^{(t)} + c]} {\partial h^{(t)}} \\ &= L'^{(t)}(\hat y^{(t)}) \sigma'(o^{(t)}) V \end{aligned} h(t)L(t)=y^(t)L(t)o(t)y^(t)h(t)o(t)=y^(t)L(t)o(t)y^(t)h(t)[Vh(t)+c]=L(t)(y^(t))σ(o(t))V

因为 W , U W,U W,U涉及到时间序列的传播过程中,涉及到历史数据,为了方便起见,先求 L ( 3 ) L^{(3)} L(3)对于 W W W的偏导

∂ L ( 3 ) ∂ W = ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ a ( 2 ) ∂ a ( 2 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ h ( 1 ) ∂ h ( 1 ) ∂ a ( 1 ) ∂ a ( 1 ) ∂ W = ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ a ( 2 ) ∂ a ( 2 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ a ( 2 ) ∂ a ( 2 ) ∂ h ( 1 ) ∂ h ( 1 ) ∂ a ( 1 ) ∂ a ( 1 ) ∂ W = ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a ( 3 ) ) h ( 2 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) h ( 1 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) W g ′ ( a ( 1 ) ) h ( 0 ) \begin{aligned} \frac {\partial L^{(3)}} {\partial W} &= \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac {\partial a^{(3)}} {\partial W}+ \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial a^{(2)}} \frac {\partial a^{(2)}} {\partial W} + \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial h^{(2)}} \frac{\partial h^{(2)}} {\partial h^{(1)}} \frac {\partial h^{(1)}} {\partial a^{(1)}} \frac {\partial a^{(1)}} {\partial W} \\ & = \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac {\partial a^{(3)}} {\partial W}+ \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac{\partial a^{(3)}} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial a^{(2)}} \frac {\partial a^{(2)}} {\partial W} \\ & + \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac{\partial a^{(3)}} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial a^{(2)}} \frac{\partial a^{(2)}} {\partial h^{(1)}} \frac {\partial h^{(1)}} {\partial a^{(1)}} \frac {\partial a^{(1)}} {\partial W} \\ &= \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{(3)})h^{(2)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3})Wg'(a^{2})h^{(1)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3}) W g'(a^{2}) W g'(a^{(1)}) h^{(0)} \end{aligned} WL(3)=h(3)L(3)a(3)h(3)Wa(3)+h(3)L(3)h(2)h(3)a(2)h(2)Wa(2)+h(3)L(3)h(2)h(3)h(1)h(2)a(1)h(1)Wa(1)=h(3)L(3)a(3)h(3)Wa(3)+h(3)L(3)a(3)h(3)h(2)a(3)a(2)h(2)Wa(2)+h(3)L(3)a(3)h(3)h(2)a(3)a(2)h(2)h(1)a(2)a(1)h(1)Wa(1)=h(3)L(3)g(a(3))h(2)+h(3)L(3)g(a3)Wg(a2)h(1)+h(3)L(3)g(a3)Wg(a2)Wg(a(1))h(0)

同理可得:

∂ L ( 2 ) ∂ W = ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) h ( 1 ) + ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) W g ′ ( a ( 1 ) ) h 0 \frac {\partial L^{(2)}} {\partial W} = \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2})h^{(1)} + \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2}) W g'(a^{(1)}) h^{0} WL(2)=h(2)L(2)g(a2)h(1)+h(2)L(2)g(a2)Wg(a(1))h0

∂ L ( 1 ) ∂ W = ∂ L ( 1 ) ∂ h ( 1 ) g ′ ( a ( 1 ) ) h ( 0 ) \frac {\partial L^{(1)}} {\partial W} =\frac {\partial L^{(1)}} {\partial h^{(1)}} g'(a^{(1)}) h^{(0)} WL(1)=h(1)L(1)g(a(1))h(0)

可以得到:

∂ L ∂ W = ∑ t = 1 T ∂ L ( t ) ∂ W \frac {\partial L} {\partial W} = \sum_{t=1}^T \frac {\partial L^{(t)}} {\partial W} WL=t=1TWL(t)

同样的推导可以得到对 U U U的导数:

∂ L ( 3 ) ∂ U = ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a ( 3 ) ) x ( 3 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) x ( 2 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) W g ′ ( a ( 1 ) ) x ( 1 ) \frac {\partial L^{(3)}} {\partial U} = \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{(3)})x^{(3)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3})Wg'(a^{2})x^{(2)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3}) W g'(a^{2}) W g'(a^{(1)}) x^{(1)} UL(3)=h(3)L(3)g(a(3))x(3)+h(3)L(3)g(a3)Wg(a2)x(2)+h(3)L(3)g(a3)Wg(a2)Wg(a(1))x(1)

∂ L ( 2 ) ∂ U = ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) x ( 2 ) + ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) W g ′ ( a ( 1 ) ) x ( 1 ) \frac {\partial L^{(2)}} {\partial U} = \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2})x^{(2)} + \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2}) W g'(a^{(1)}) x^{(1)} UL(2)=h(2)L(2)g(a2)x(2)+h(2)L(2)g(a2)Wg(a(1))x(1)

∂ L ( 1 ) ∂ U = ∂ L ( 1 ) ∂ h ( 1 ) g ′ ( a ( 1 ) ) x ( 1 ) \frac {\partial L^{(1)}} {\partial U} =\frac {\partial L^{(1)}} {\partial h^{(1)}} g'(a^{(1)}) x^{(1)} UL(1)=h(1)L(1)g(a(1))x(1)

∂ L ∂ U = ∑ t = 1 T ∂ L ( t ) ∂ U \frac {\partial L} {\partial U} = \sum_{t=1}^T \frac {\partial L^{(t)}} {\partial U} UL=t=1TUL(t)

得到对偏置 b b b的推导:

∂ L ( 3 ) ∂ b = ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a ( 3 ) ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) W g ′ ( a ( 1 ) ) \frac {\partial L^{(3)}} {\partial b} = \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{(3)}) + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3})Wg'(a^{2}) + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3}) W g'(a^{2}) W g'(a^{(1)}) bL(3)=h(3)L(3)g(a(3))+h(3)L(3)g(a3)Wg(a2)+h(3)L(3)g(a3)Wg(a2)Wg(a(1))

∂ L ( 2 ) ∂ b = ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) + ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) W g ′ ( a ( 1 ) ) \frac {\partial L^{(2)}} {\partial b} = \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2})+ \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2}) W g'(a^{(1)}) bL(2)=h(2)L(2)g(a2)+h(2)L(2)g(a2)Wg(a(1))

∂ L ( 1 ) ∂ b = ∂ L ( 1 ) ∂ h ( 1 ) g ′ ( a ( 1 ) ) \frac {\partial L^{(1)}} {\partial b} =\frac {\partial L^{(1)}} {\partial h^{(1)}} g'(a^{(1)}) bL(1)=h(1)L(1)g(a(1))

∂ L ∂ b = ∑ t = 1 T ∂ L ( t ) ∂ b \frac {\partial L} {\partial b} = \sum_{t=1}^T \frac {\partial L^{(t)}} {\partial b} bL=t=1TbL(t)

2. python实现RNN

代码内容来源:https://github.com/ddbourgin/numpy-ml/tree/master/numpy_ml/neural_nets/layers

class RNNCell(object):
    """RNN cell 只运行其中一个时序单元"""
    def __init__(self, n_out, act_fn="Tanh", optimizer=None):
        self.n_in = None
        self.n_out = n_out
        self.n_timesteps = None
        self.params = {
            "U": None,
            "W": None,
            "b": None
        }

        self.act_fn = Tanh()
        self.optimizer = SGD()

        self.is_initialized = False

    def __str__(self):
        return 'RNNCell(n_in={},n_out{})'.format(str(self.n_in), str(self.n_out))

    def __call__(self, X):
        return self.forward(X)

    def __init_params(self):
        """
        初始化参数
        :return: 
        """
        self.X = []
        U = np.random.randn(self.n_in, self.n_out)
        W = np.random.randn(self.n_out, self.n_out)
        b = np.zeros((self.n_out, 1))
        self.params = {
            "U": U,
            "W": W,
            "b": b
        }
        # 保存梯度以便进行梯度更新
        self.gradients = {
            "U": np.zeros_like(U),
            "W": np.zeros_like(W),
            "b": np.zeros_like(b)
        }
        # 保存中间运行结果,以便梯度更新
        self.derived_variables = {
            "h": [],
            "a": [],
            "n_timesteps": 0,
            "current_step": 0,
            "dLdh_accumulator": None
        }

        self.is_initialized = True

    def forward(self, X):
        """
        前向传播
        :param X: numpy shape(n_ex, n_in)
        :return: 
        """
        if self.is_initialized is False:
            self.n_in = X.shape[1]
            self.__init_params()
           
        self.derived_variables["n_timesteps"] += 1
        self.derived_variables["current_step"] += 1

        b = self.params['b']
        U = self.params['U']
        W = self.params['W']

        # 隐藏层
        h_s = self.derived_variables['h']
        if 0 == len(h_s):
            n_ex, n_in = X.shape
            h_s.append(np.zeros((n_ex, self.n_out)))

        a_t = h_s[-1] @ W + X @ U + b.T
        h_t = self.act_fn(a_t)

        self.derived_variables['a'].append(a_t)
        self.derived_variables['h'].append(h_t)

        self.X.append(X)

        return h_t

    def backward(self, dLdh):
        """
        后向传播,计算参数梯度
        :param dLdh: numpy shape(n_ex, n_out)
        :return: 
        """
        self.derived_variables['current_step'] -= 1

        a_s = self.derived_variables['a']
        h_s = self.derived_variables['h']
        t = self.derived_variables['current_step']
        dh_acc = self.derived_variables['dLdh_accumulator']

        if dh_acc is None:
            dh_acc = np.zeros_like(h_s[0])

        # 获取参数
        U = self.params['U']
        W = self.params['W']
        b = self.params['b']


        dh = dLdh + dh_acc
        da = self.act_fn.grad(a_s[t]) * dh
        dX_t = da @ U.T

        self.gradients['W'] += h_s[t].T @ da
        self.gradients['U'] += self.X[t].T @ da
        self.gradients['b'] += da.sum(axis=0, keepdims=True).T

        self.derived_variables['dLdh_accumulator'] = da @ W.T

        return dX_t


    def flush_gradients(self):
        """每次迭代,都需要将中间结果清零"""
        self.X = []

        self.gradients = {
            "U": np.zeros_like(self.params['U']),
            "W": np.zeros_like(self.params['W']),
            "b": np.zeros_like(self.params['b'])
        }

        self.derived_variables = {
            "h": [],
            "a": [],
            "n_timesteps": 0,
            "current_step": 0,
            "dLdh_accumulator": None
        }

    def update(self):
        """更新参数"""
        if self.optimizer is None:
            self.params['W'] -= 0.01 * self.gradients['W']
            self.params['U'] -= 0.01 * self.gradients['U']
            self.params['b'] -= 0.01 * self.gradients['b']
        else:
            self.params['W'] = self.optimizer(self.params['W'], self.gradients['W'], 'W')
            self.params['U'] = self.optimizer(self.params['U'], self.gradients['U'], 'U')
            self.params['b'] = self.optimizer(self.params['b'], self.gradients['b'], 'b')


class RNN(object):
    """将用RNNCell将时序接起来"""
    def __init__(self, n_out, act_fn="Tanh", optimizer=None):
        self.n_in = None
        self.n_out = n_out

        self.cell = RNNCell(
            n_out = self.n_out,
            act_fn = act_fn,
            optimizer=optimizer
        )

    def __str__(self):
        return "RNN"

    def __call__(self, X):
        return self.forward(X)

    def forward(self, X):
        """
        前向传播
        :param X: numpy shape(n_ex, n_in, n_t) n_ex是样本个数,n_in是当个样本的特征个数,n_t是不同时间点的特征
        :return: numpy shape(n_ex, n_out, n_t)
        """
        Y = []
        n_ex, n_in, n_t = X.shape
        for t in range(n_t):
            yt = self.cell(X[:, :, t])
            Y.append(yt)
        return np.dstack(Y)

    def backward(self, dLdh):
        """
        后向迭代参数
        :param dLdh:
        :return:
        """
        dLdX = []
        n_ex, n_out, n_t = dLdh.shape
        for t in reversed(range(n_t)):
            dLdXt = self.cell.backward(dLdh[:, :, t])
            dLdX.insert(0, dLdXt)
        return np.dstack(dLdX)

    def flush_gradients(self):
        self.cell.flush_gradients()

    def update(self):
        """
        更新参数,更新完成将计算过程擦除
        """
        self.cell.update()
        self.cell.flush_gradients()
  • 3
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值