spyder中调用不了tensorflow_【TensorFlow】BytePS 和 Horovod 源码阅读（一）

最新推荐文章于 2024-01-17 17:42:07 发布

weixin_39937312

最新推荐文章于 2024-01-17 17:42:07 发布

阅读量148

点赞数

文章标签： spyder中调用不了tensorflow

本文详细解析了BytePS和Horovod项目的源码实现，重点介绍了如何使用ctypes模块调用C++代码，以及如何在TensorFlow框架内注册自定义操作。文章还探讨了Python库扩展名的确定方法。

摘要由CSDN通过智能技术生成

准备出一个系列来对 BytePS^[1] 和 Horovod^[2] 的源码进行阅读。

我准备从 /byteps/tensorflow/ops.py 这个文件入手。这个文件是用来调用相关 C++ 源代码的，其中包含 _push_pull 、broadcast 等操作。

[1] - ctypes 模块

ctypes^[3]是一个用于 Python 的外部函数库。它提供 C 兼容的数据类型，并允许在 DLL 或共享库中调用函数。它可以用来将这些库封装在纯 Python 中。

有几种方法可以将共享库加载到 Python 进程中，其中一种方法是实例化以下类

class ctypes.CDLL(name, mode=DEFAULT_MODE, handle=None, use_errno=False, use_last_error=False)

这个类的实例表示被加载的共享库。BytePS 项目中就用到了这个byteps/tensorflow/ops.py#L63 。

TF_LIB_CTYPES = ctypes.CDLL(dll_path, ctypes.RTLD_GLOBAL)

此外ctypes定义了许多基本的C兼容数据类型，项目中就用到了ctypes.c_char_p，详见byteps/tensorflow/ops.py#L91 。

TF_LIB_CTYPES.byteps_tensorflow_declare_tensor(ctypes.c_char_p(full_name))

>>> s = "Hello, World"
>>> c_s = c_wchar_p(s)
>>> print(c_s)
c_wchar_p(139966785747344)
>>> print(c_s.value)
Hello World
>>> c_s.value = "Hi, there"
>>> print(c_s)              # the memory location has changed
c_wchar_p(139966783348904)
>>> print(c_s.value)
Hi, there
>>> print(s)                # first object is unchanged
Hello, World
>>>

[2] - tensorflow.python 之 framework、platform

from tensorflow.python.framework import load_library
from tensorflow.python.framework import ops
from tensorflow.python.platform import resource_loader

这里导入了 TensorFlow^[4] 的 framework 和 platform 中的 module，主要使用了函数 load_library.load_op_library 和 resource_loader.get_path_to_datafile 。

先来看一下 load_library 文件，/framework/load_library.py#L39，load_op_library 的主要作用是“加载包含自定义 ops 和 kernel 的 TensorFlow 插件”，如此一来，就可以像调用 TensorFlow 内置 ops 一样来调用这些自定义的 ops 。

@tf_export('load_op_library')
def load_op_library(library_filename):
  """Loads a TensorFlow plugin, containing custom ops and kernels.
  Pass "library_filename" to a platform-specific mechanism for dynamically
  loading a library. The rules for determining the exact location of the
  library are platform-specific and are not documented here. When the
  library is loaded, ops and kernels registered in the library via the
  `REGISTER_*` macros are made available in the TensorFlow process. Note
  that ops with the same name as an existing op are rejected and not
  registered with the process.
  Args:
    library_filename: Path to the plugin.
      Relative or absolute filesystem path to a dynamic library file.
  Returns:
    A python module containing the Python wrappers for Ops defined in
    the plugin.
  Raises:
    RuntimeError: when unable to load the library or get the python wrappers.
  """

先来看一下 resource_loader 文件，/platform/resource_loader.py#L101，在 ops.py 中使用了函数 get_path_to_datafile ，TensorFlow 官方给出的该函数的主要作用是“获取数据依赖项中所指定文件的路径”。

@tf_export(v1=['resource_loader.get_path_to_datafile'])
def get_path_to_datafile(path):
  """Get the path to the specified file in the data dependencies.
  The path is relative to tensorflow/
  Args:
    path: a string resource path relative to tensorflow/
  Returns:
    The path to the specified file present in the data attribute of py_test
    or py_binary.
  Raises:
    IOError: If the path is not found, or the resource can't be opened.
  """

回到 BytePS 开源项目中，在 ops.py#L37 这个位置的函数 _load_library ，通过联合使用 resource_loader.get_path_to_datafile 和 load_library.load_op_library 加载本地库来载入自定义的 ops 以供使用。

def _load_library(name):
    """Loads a .so file containing the specified operators.
    Args:
      name: The name of the .so file to load.
    Raises:
      NotFoundError if were not able to load .so file.
    """
    filename = resource_loader.get_path_to_datafile(name)
    library = load_library.load_op_library(filename)
    return library

在 ops.py 中，通过代码 from tensorflow.python.framework import ops 导入的 ops 主要用来注册自定义 ops 的，使用方式如下。关于函数修饰符^[5]的使用方法，可以参考官方给出的 pep-0318 。

# ...
@ops.RegisterGradient('BytePSPushPull')
def _push_pull_grad(op, grad):
    """Gradient for push_pull op.
    Args:
      op: An operation.
      grad: `Tensor` gradient with respect to the output of the op.
    Returns:
      The gradient with respect to the input of the op.
    """
# ...
@ops.RegisterGradient('BytePSBroadcast')
def _broadcast_grad(op, grad):
    """Gradient for broadcast op.
    Args:
      op: An operation.
      grad: `Tensor` gradient with respect to the output of the op.
    Returns:
      The gradient with respect to the input of the op.
    """
# ...

[3] - C_LIB

在 ops.py#L49 处，C_LIB为_load_library函数所返回的 Python 模块。

C_LIB = _load_library('c_lib' + get_ext_suffix())

从 byteps/common/__init__.py#L24 处可知，函数 get_ext_suffix 是用来“确定各种 Python 版本的库扩展”，suffix 是后缀的意思，而这个函数所返回的也是 .pyd 或 .so 字符串。

def get_ext_suffix():
    """Determine library extension for various versions of Python."""

不过我也遇到了一个问题，就是项目中使用了如下的代码，但是我不清楚这个函数 byteps_push_pull 的实现是在哪里？

C_LIB.byteps_push_pull(tensor, name=name)

[4] - TF_LIB_CTYPES

前边介绍了 TF_LIB_CTYPES 是 ctypes.CDLL 返回的对象。在 ops.py 中，TF_LIB_CTYPES 调用了函数 byteps_tensorflow_declare_tensor，该函数被定义在同文件夹的文件 ops.h 和 http://ops.cc 文件中。

ops.h 文件内容如下：

#ifndef BYTEPS_TENSORFLOW_OPS_H
#define BYTEPS_TENSORFLOW_OPS_H

// ...

namespace byteps {
namespace tensorflow {

class TFReadyEvent : public common::ReadyEvent {
 // ...
};

class TFTensor : public common::Tensor {
 // ...
};

extern "C" void byteps_tensorflow_declare_tensor(char* name);

}  // namespace tensorflow
}  // namespace byteps

#endif  // BYTEPS_TENSORFLOW_OPS_H

ops.cc 文件内容如下：

// ...

#include "ops.h"

using namespace byteps;

namespace byteps {
namespace tensorflow {

// ...

extern "C" void byteps_tensorflow_declare_tensor(char* name) {
  std::string tensor_name(name);
  common::IsTensorDeclared(tensor_name);
  return;
}

// ...

class BytePSPushPullOp : public ::tensorflow::AsyncOpKernel {
 // ...
};

REGISTER_KERNEL_BUILDER(Name("BytepsPushPull").Device(::tensorflow::DEVICE_CPU),
                        BytePSPushPullOp);
REGISTER_KERNEL_BUILDER(Name("BytepsPushPull").Device(::tensorflow::DEVICE_GPU),
                        BytePSPushPullOp);

REGISTER_OP("BytepsPushPull")
    .Attr("T: {int32, int64, float16, float32, float64}")
    .Input("tensor: T")
    .Output("sum: T")
    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      c->set_output(0, c->input(0));
      return ::tensorflow::Status::OK();
    })
    .Doc(R"doc(
Perform an PushPull on a tensor. All other processes that do a reduction
on a tensor with the same name must have the same dimension for that tensor.
Tensors are reduced with other tensors that have the same node name for the
push_pull.
Arguments
    tensor:     A tensor to reduce.
Output
    sum:    A tensor with the same shape as `tensor`, summed across all processes.
)doc");

}  // namespace tensorflow
}  // namespace byteps

标题之所以取“BytePS 和 Horovod”，是因为 BytePS 和 Horovod 的项目代码结构很像，因为有时候可以在 BytePS 项目里的文件头部发现如下所示的文字，一起结合着看，可以看相同和不同之处分别是什么。

# Copyright 2019 Bytedance Inc. All Rights Reserved.
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
# Modifications copyright (C) 2019 Uber Technologies, Inc.

想要借助 TensorFlow 这种现有的大型项目去做一些第三方的项目，凡是涉及到细粒度修改的就一定要对原有项目有深入的理解才可以，真的是很考验代码功底。

项目地址：

bytepsgithub.com