CSE599W lecture4 自动微分

反向梯度传播是神经网络训练的基本原理,本节介绍了自动微分(Autodiff)的基本实现方法,并最后给出了一个小练习。网上有一些针对这一节的笔记,但大多讲的不太清晰。这里尝试给出一些自己的理解过程。

符号微分

在这里插入图片描述直接根据公式计算出梯度的表达式。这种方法的缺点是:对于复杂的函数,导数式子可能会很庞大,并且我们其实是不需要保存中间结果的。

数值微分

在这里插入图片描述数值微分很好理解,就是根据导数的定义去计算出一个近似的梯度值。这个方法问题在于存在rounding error,并且计算过程比较慢(对于每一个需要计算导数的地方都要再算一遍前向传播)。因此用于检查导数正确性。

反向传播

基本的反向传播原理就是导数求导的链式法则
在这里插入图片描述一个神经网络由若干个Operator组成,反向传播的链式法则使得梯度的回传能够拆解到每个op内部去做计算。对上图这个op而言,其有两个输入,一个输出,由于梯度是反传回来的,在这个op的回传过程中,我们已经知道了 ∂ J ∂ z \frac {\partial J} {\partial z} zJ.那么我们需要按照该op的求导公式算好 ∂ z ∂ x \frac {\partial z} {\partial x} xz ∂ z ∂ y \frac {\partial z} {\partial y} yz,然后分别和 ∂ J ∂ z \frac {\partial J} {\partial z} zJ相乘,即是要回传到前面op的梯度了。

这里要注意的是,一个op的输出可能不止一路。在开始这个op的回传之前,要先算好这个op所有输出路径的梯度,加在一起才是 ∂ J ∂ z \frac {\partial J} {\partial z} zJ.
在这里插入图片描述按照这种方法来进行梯度回传,首先就要进行一次inference,并且需要保持每个op的输入值,即神经网络的中间计算结果,因为我们在每个op回传时要算 ∂ z ∂ x \frac {\partial z} {\partial x} xz ∂ z ∂ y \frac {\partial z} {\partial y} yz,即op的导函数 f ′ ( x ) f'(x) f(x).前传完成后,按照上图的方法逐个op回传。

这种方法即是最基本的梯度反向传播的实现,最贴近我们对反向传播算法的纯朴理解。

那么这种方法有什么问题呢?

  1. 我们要保留每个op的输入,这可能需要占相当大的memory;
  2. 缺乏灵活性,例如我们需要计算梯度的梯度时就无法计算了(WGAN-GP)。

自动微分

那么自动微分是什么呢。首先我认为,自动微分的基本原理依然是反向传播,和上面的方法本质上是一样的。在自动微分的计算过程中,会首先建立起前向传播的计算图,然后根据需要计算的梯度,在这个计算图上加上一些新的分支,分支产生的新的输出节点即我们要求的梯度。只有当这些梯度图都建立好后,才会开始喂数据,进行实际的计算

这里说的计算图,我的理解就是一个数据结构。描述了节点(中间计算结果)之间的连接关系(边即op)。
在这里插入图片描述上图是自动微分建立的计算图。黑色部分就是神经网络本来的部分,红色部分是为了计算梯度而添加的。可以看到,黑色和红色部分中间有若干条连线。这些连线提供了在梯度回传时不得不保存的前向中间计算结果。相较于上面一般的反向传播要保存每一个op的输入,这种方法会节约很多内存。

这种建立计算图的方法,使得梯度计算也变成了一般的图,有利于进一步对图进行优化(虽然我不知道具体是再怎么优化的)。

那么剩下来的工作就是:如何在一个现有的神经网络计算图上,补充上用于梯度计算的节点和边

在这里插入图片描述PPT里给出了一般的过程。这里先解释一下符号:上划线意思是该节点接收到的梯度,如果符号旁边还有上标数字,那么这就代表这是该节点梯度的一部分。上面也提到过,一个节点所有输出边梯度加在一起才是这个节点的梯度值。这里的 x 2 ‾ \overline{x_2} x2就有 x 2 ‾ 1 \overline{x_2}^1 x21 x 2 ‾ 2 \overline{x_2}^2 x22两部分, x 2 ‾ = x 2 ‾ 1 + x 2 ‾ 2 \overline{x_2}=\overline{x_2}^1+\overline{x_2}^2 x2=x21+x22.

这个算法具体的过程,我写在下面的注释里,可以结合PPT的过程多看几遍。

# node:节点,每个op的输出对应一个node
# out:即前向计算最终的输出节点,这里的实现没有考虑多输出
# node_to_grad:所以node的梯度值保存在这里
def gradient(out):
	# 输出节点本身的梯度就是1,因为是自己对自己求导
	node_to_grad[out] = 1
	# 从out节点开始往回找到所有要计算梯度的节点
	nodes = get_node_list(out)
	# 按照反向拓扑顺序来遍历节点
	for node in reverse_topo_order(nodes):
		# 对现在这个节点所有反传回来的梯度求和,因为是按照反向拓扑顺序来遍历的,
		# 程序跑到这里已经得到了这个节点所有的部分梯度
		grad = sum partial adjoints from output edges
		# 按照链式法则,计算每个对应输入节点应该从本节点得到的部分梯度,
		# 计算方法在node.op.gradient这个方法里描述
		input_grads = node.op.gradient(input, grad) for input in node.inputs
		# 把这些计算好的部分梯度加到对应节点的总梯度上,
		# 因为是部分梯度所以这里是add而不是赋值
		add input_grads to node_to_grad
	return node_to_grad

在这里插入图片描述最后给出反向传播和自动微分的示意图,总结一下就是:
反向传播:先前向再回传,保留每个前向计算的中间结果用于计算梯度。
自动微分:先建立神经网络的计算图,再建立梯度的计算图,建立好计算图后再喂数据进行计算。只需要保留部分中间结果。

assignment 1

这一节对应一个作业(https://github.com/dlsys-course/assignment1),用python实现上述自动微分的过程,这里我直接给出我的实现。

我个人觉得做一遍这个还是挺必要的,只看PPT的话会有点似懂非懂。

完成顺序最好是写前向的过程,再写反向梯度的求导,其中gradient函数就是上面伪代码的实现。

结果可以通过 nosetests -v autodiff_test.py 来检查,nosetests通过 pip install nose 即可安装。

import numpy as np

class Node(object):
    """Node in a computation graph."""
    def __init__(self):
        """Constructor, new node is indirectly created by Op object __call__ method.
            
            Instance variables
            ------------------
            self.inputs: the list of input nodes.
            self.op: the associated op object, 
                e.g. add_op object if this node is created by adding two other nodes.
            self.const_attr: the add or multiply constant,
                e.g. self.const_attr=5 if this node is created by x+5.
            self.name: node name for debugging purposes.
        """
        self.inputs = []
        self.op = None
        self.const_attr = None
        self.name = ""

    def __add__(self, other):
        """Adding two nodes return a new node."""
        if isinstance(other, Node):
            new_node = add_op(self, other)
        else:
            # Add by a constant stores the constant in the new node's const_attr field.
            # 'other' argument is a constant
            new_node = add_byconst_op(self, other)
        return new_node

    def __mul__(self, other):
        """DONE: Your code here"""
        if isinstance(other, Node):
            new_node = mul_op(self, other)
        else:
            new_node = mul_byconst_op(self, other)
        return new_node

    # Allow left-hand-side add and multiply.
    __radd__ = __add__
    __rmul__ = __mul__

    def __str__(self):
        """Allow print to display node name.""" 
        return self.name

    __repr__ = __str__

def Variable(name):
    """User defined variables in an expression.  
        e.g. x = Variable(name = "x")
    """
    placeholder_node = placeholder_op()
    placeholder_node.name = name
    return placeholder_node

class Op(object):
    """Op represents operations performed on nodes."""
    def __call__(self):
        """Create a new node and associate the op object with the node.
        
        Returns
        -------
        The new node object.
        """
        new_node = Node()
        new_node.op = self
        return new_node

    def compute(self, node, input_vals):
        """Given values of input nodes, compute the output value.

        Parameters
        ----------
        node: node that performs the compute.
        input_vals: values of input nodes.

        Returns
        -------
        An output value of the node.
        """
        raise NotImplementedError

    def gradient(self, node, output_grad):
        """Given value of output gradient, compute gradient contributions to each input node.

        Parameters
        ----------
        node: node that performs the gradient.
        output_grad: value of output gradient summed from children nodes' contributions

        Returns
        -------
        A list of gradient contributions to each input node respectively.
        """
        raise NotImplementedError

class AddOp(Op):
    """Op to element-wise add two nodes."""
    def __call__(self, node_A, node_B):
        new_node = Op.__call__(self)
        new_node.inputs = [node_A, node_B]
        new_node.name = "(%s+%s)" % (node_A.name, node_B.name)
        return new_node

    def compute(self, node, input_vals):
        """Given values of two input nodes, return result of element-wise addition."""
        assert len(input_vals) == 2
        return input_vals[0] + input_vals[1]

    def gradient(self, node, output_grad):
        """Given gradient of add node, return gradient contributions to each input."""
        return [output_grad, output_grad]

class AddByConstOp(Op):
    """Op to element-wise add a nodes by a constant."""
    def __call__(self, node_A, const_val):
        new_node = Op.__call__(self)
        new_node.const_attr = const_val
        new_node.inputs = [node_A]
        new_node.name = "(%s+%s)" % (node_A.name, str(const_val))
        return new_node

    def compute(self, node, input_vals):
        """Given values of input node, return result of element-wise addition."""
        assert len(input_vals) == 1
        return input_vals[0] + node.const_attr

    def gradient(self, node, output_grad):
        """Given gradient of add node, return gradient contribution to input."""
        return [output_grad]

class MulOp(Op):
    """Op to element-wise multiply two nodes."""
    def __call__(self, node_A, node_B):
        new_node = Op.__call__(self)
        new_node.inputs = [node_A, node_B]
        new_node.name = "(%s*%s)" % (node_A.name, node_B.name)
        return new_node

    def compute(self, node, input_vals):
        """Given values of two input nodes, return result of element-wise multiplication."""
        """DONE: Your code here"""
        assert len(input_vals) == 2
        return input_vals[0] * input_vals[1]

    def gradient(self, node, output_grad):
        """Given gradient of multiply node, return gradient contributions to each input."""
        """DONE: Your code here"""
        return [node.inputs[1] * output_grad, node.inputs[0] * output_grad]


class MulByConstOp(Op):
    """Op to element-wise multiply a nodes by a constant."""
    def __call__(self, node_A, const_val):
        new_node = Op.__call__(self)
        new_node.const_attr = const_val
        new_node.inputs = [node_A]
        new_node.name = "(%s*%s)" % (node_A.name, str(const_val))
        return new_node

    def compute(self, node, input_vals):
        """Given values of input node, return result of element-wise multiplication."""
        """DONE: Your code here"""
        assert len(input_vals) == 1
        return input_vals[0] * node.const_attr

    def gradient(self, node, output_grad):
        """Given gradient of multiplication node, return gradient contribution to input."""
        """DONE: Your code here"""
        return [node.const_attr * output_grad]

class MatMulOp(Op):
    """Op to matrix multiply two nodes."""
    def __call__(self, node_A, node_B, trans_A=False, trans_B=False):
        """Create a new node that is the result a matrix multiple of two input nodes.

        Parameters
        ----------
        node_A: lhs of matrix multiply
        node_B: rhs of matrix multiply
        trans_A: whether to transpose node_A
        trans_B: whether to transpose node_B

        Returns
        -------
        Returns a node that is the result a matrix multiple of two input nodes.
        """
        new_node = Op.__call__(self)
        new_node.matmul_attr_trans_A = trans_A
        new_node.matmul_attr_trans_B = trans_B
        new_node.inputs = [node_A, node_B]
        new_node.name = "MatMul(%s,%s,%s,%s)" % (node_A.name, node_B.name, str(trans_A), str(trans_B))
        return new_node

    def compute(self, node, input_vals):
        """Given values of input nodes, return result of matrix multiplication."""
        """DONE: Your code here"""
        assert len(input_vals) == 2
        return np.matmul(np.transpose(input_vals[0]) if node.matmul_attr_trans_A else input_vals[0], 
                         np.transpose(input_vals[1]) if node.matmul_attr_trans_B else input_vals[1])

    def gradient(self, node, output_grad):
        """Given gradient of multiply node, return gradient contributions to each input.
            
        Useful formula: if Y=AB, then dA=dY B^T, dB=A^T dY
        """
        """DONE: Your code here"""
        return [matmul_op(output_grad, node.inputs[1], False, True),
                matmul_op(node.inputs[0], output_grad, True, False)]

class PlaceholderOp(Op):
    """Op to feed value to a nodes."""
    def __call__(self):
        """Creates a variable node."""
        new_node = Op.__call__(self)
        return new_node

    def compute(self, node, input_vals):
        """No compute function since node value is fed directly in Executor."""
        assert False, "placeholder values provided by feed_dict"

    def gradient(self, node, output_grad):
        """No gradient function since node has no inputs."""
        return None

class ZerosLikeOp(Op):
    """Op that represents a constant np.zeros_like."""
    def __call__(self, node_A):
        """Creates a node that represents a np.zeros array of same shape as node_A."""
        new_node = Op.__call__(self)
        new_node.inputs = [node_A]
        new_node.name = "Zeroslike(%s)" % node_A.name
        return new_node

    def compute(self, node, input_vals):
        """Returns zeros_like of the same shape as input."""
        assert(isinstance(input_vals[0], np.ndarray))
        return np.zeros(input_vals[0].shape)

    def gradient(self, node, output_grad):
        return [zeroslike_op(node.inputs[0])]

class OnesLikeOp(Op):
    """Op that represents a constant np.ones_like."""
    def __call__(self, node_A):
        """Creates a node that represents a np.ones array of same shape as node_A."""
        new_node = Op.__call__(self)
        new_node.inputs = [node_A]
        new_node.name = "Oneslike(%s)" % node_A.name
        return new_node

    def compute(self, node, input_vals):
        """Returns ones_like of the same shape as input."""
        assert(isinstance(input_vals[0], np.ndarray))
        return np.ones(input_vals[0].shape)

    def gradient(self, node, output_grad):
        return [zeroslike_op(node.inputs[0])]

# Create global singletons of operators.
add_op = AddOp()
mul_op = MulOp()
add_byconst_op = AddByConstOp()
mul_byconst_op = MulByConstOp()
matmul_op = MatMulOp()
placeholder_op = PlaceholderOp()
oneslike_op = OnesLikeOp()
zeroslike_op = ZerosLikeOp()

class Executor:
    """Executor computes values for a given subset of nodes in a computation graph.""" 
    def __init__(self, eval_node_list):
        """
        Parameters
        ----------
        eval_node_list: list of nodes whose values need to be computed.
        """
        self.eval_node_list = eval_node_list

    def run(self, feed_dict):
        """Computes values of nodes in eval_node_list given computation graph.
        Parameters
        ----------
        feed_dict: list of variable nodes whose values are supplied by user.

        Returns
        -------
        A list of values for nodes in eval_node_list. 
        """
        node_to_val_map = dict(feed_dict)
        # Traverse graph in topological sort order and compute values for all nodes.
        topo_order = find_topo_sort(self.eval_node_list)
        """DONE: Your code here"""
        for node in topo_order:
            if node in node_to_val_map.keys():
                continue
            else:
                input_vals = [node_to_val_map[input_node] for input_node in node.inputs]
                node_to_val_map[node] = node.op.compute(node, input_vals)

        # Collect node values.
        node_val_results = [node_to_val_map[node] for node in self.eval_node_list]
        return node_val_results

def gradients(output_node, node_list):
    """Take gradient of output node with respect to each node in node_list.

    Parameters
    ----------
    output_node: output node that we are taking derivative of.
    node_list: list of nodes that we are taking derivative wrt.
 Returns
        -------
    A list of gradient values, one for each node in node_list respectively.

    """

    # a map from node to a list of gradient contributions from each output node
    node_to_output_grads_list = {}
    # Special note on initializing gradient of output_node as oneslike_op(output_node):
    # We are really taking a derivative of the scalar reduce_sum(output_node)
    # instead of the vector output_node. But this is the common case for loss function.
    node_to_output_grads_list[output_node] = [oneslike_op(output_node)]
    # a map from node to the gradient of that node
    node_to_output_grad = {}
    # Traverse graph in reverse topological order given the output_node that we are taking gradient wrt.
    reverse_topo_order = reversed(find_topo_sort([output_node]))

    """DONE: Your code here"""
    for node in reverse_topo_order:
        assert len(node_to_output_grads_list[node]) > 0
        node_to_output_grad[node] = sum_node_list(node_to_output_grads_list[node])
        grad_back = node.op.gradient(node, node_to_output_grad[node])
        if grad_back == None:
            continue
        for node_input in node.inputs:
            if node_input not in node_to_output_grads_list.keys():
                node_to_output_grads_list[node_input] = []
        node_to_output_grads_list[node.inputs[0]].append(grad_back[0])
        if len(grad_back) == 2:
            node_to_output_grads_list[node.inputs[1]].append(grad_back[1])

    # Collect results for gradients requested.
    grad_node_list = [node_to_output_grad[node] for node in node_list]
    return grad_node_list

##############################
####### Helper Methods ####### 
##############################

def find_topo_sort(node_list):
    """Given a list of nodes, return a topological sort list of nodes ending in them.
    
    A simple algorithm is to do a post-order DFS traversal on the given nodes, 
    going backwards based on input edges. Since a node is added to the ordering
    after all its predecessors are traversed due to post-order DFS, we get a topological
    sort.

    """
    visited = set()
    topo_order = []
    for node in node_list:
        topo_sort_dfs(node, visited, topo_order)
    return topo_order

def topo_sort_dfs(node, visited, topo_order):
    """Post-order DFS"""
    if node in visited:
        return
    visited.add(node)
    for n in node.inputs:
        topo_sort_dfs(n, visited, topo_order)
    topo_order.append(node)

def sum_node_list(node_list):
    """Custom sum function in order to avoid create redundant nodes in Python sum implementation."""
    from operator import add
    from functools import reduce
    return reduce(add, node_list)

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值