准备出一个系列来对 BytePS[1] 和 Horovod[2] 的源码进行阅读 。
我准备从 /byteps/tensorflow/ops.py 这个文件入手。这个文件是用来调用相关 C++ 源代码的,其中包含 _push_pull
、broadcast
等操作。
[1] - ctypes 模块
ctypes
[3]是一个用于 Python 的外部函数库。它提供 C 兼容的数据类型,并允许在 DLL 或共享库中调用函数。它可以用来将这些库封装在纯 Python 中。
有几种方法可以将共享库加载到 Python 进程中,其中一种方法是实例化以下类
class ctypes.CDLL(name, mode=DEFAULT_MODE, handle=None, use_errno=False, use_last_error=False)
这个类的实例表示被加载的共享库。BytePS 项目中就用到了这个byteps/tensorflow/ops.py#L63 。
TF_LIB_CTYPES = ctypes.CDLL(dll_path, ctypes.RTLD_GLOBAL)
此外ctypes
定义了许多基本的C兼容数据类型,项目中就用到了ctypes.c_char_p
,详见byteps/tensorflow/ops.py#L91 。
TF_LIB_CTYPES.byteps_tensorflow_declare_tensor(ctypes.c_char_p(full_name))
>>> s = "Hello, World"
>>> c_s = c_wchar_p(s)
>>> print(c_s)
c_wchar_p(139966785747344)
>>> print(c_s.value)
Hello World
>>> c_s.value = "Hi, there"
>>> print(c_s) # the memory location has changed
c_wchar_p(139966783348904)
>>> print(c_s.value)
Hi, there
>>> print(s) # first object is unchanged
Hello, World
>>>
[2] - tensorflow.python 之 framework、platform
from tensorflow.python.framework import load_library
from tensorflow.python.framework import ops
from tensorflow.python.platform import resource_loader
这里导入了 TensorFlow
[4] 的 framework 和 platform 中的 module,主要使用了函数 load_library.load_op_library
和 resource_loader.get_path_to_datafile
。
先来看一下 load_library
文件,/framework/load_library.py#L39,load_op_library
的主要作用是“加载包含自定义 ops 和 kernel 的 TensorFlow 插件”,如此一来,就可以像调用 TensorFlow 内置 ops 一样来调用这些自定义的 ops 。
@tf_export('load_op_library')
def load_op_library(library_filename):
"""Loads a TensorFlow plugin, containing custom ops and kernels.
Pass "library_filename" to a platform-specific mechanism for dynamically
loading a library. The rules for determining the exact location of the
library are platform-specific and are not documented here. When the
library is loaded, ops and kernels registered in the library via the
`REGISTER_*` macros are made available in the TensorFlow process. Note
that ops with the same name as an existing op are rejected and not
registered with the process.
Args:
library_filename: Path to the plugin.
Relative or absolute filesystem path to a dynamic library file.
Returns:
A python module containing the Python wrappers for Ops defined in
the plugin.
Raises:
RuntimeError: when unable to load the library or get the python wrappers.
"""
先来看一下 resource_loader
文件,/platform/resource_loader.py#L101,在 ops.py 中使用了函数 get_path_to_datafile
,TensorFlow
官方给出的该函数的主要作用是“获取数据依赖项中所指定文件的路径”。
@tf_export(v1=['resource_loader.get_path_to_datafile'])
def get_path_to_datafile(path):
"""Get the path to the specified file in the data dependencies.
The path is relative to tensorflow/
Args:
path: a string resource path relative to tensorflow/
Returns:
The path to the specified file present in the data attribute of py_test
or py_binary.
Raises:
IOError: If the path is not found, or the resource can't be opened.
"""
回到 BytePS 开源项目中,在 ops.py#L37 这个位置的函数 _load_library
,通过联合使用 resource_loader.get_path_to_datafile
和 load_library.load_op_library
加载本地库来载入自定义的 ops 以供使用。
def _load_library(name):
"""Loads a .so file containing the specified operators.
Args:
name: The name of the .so file to load.
Raises:
NotFoundError if were not able to load .so file.
"""
filename = resource_loader.get_path_to_datafile(name)
library = load_library.load_op_library(filename)
return library
在 ops.py 中,通过代码 from tensorflow.python.framework import ops
导入的 ops 主要用来注册自定义 ops 的,使用方式如下。关于函数修饰符[5]的使用方法,可以参考官方给出的 pep-0318 。
# ...
@ops.RegisterGradient('BytePSPushPull')
def _push_pull_grad(op, grad):
"""Gradient for push_pull op.
Args:
op: An operation.
grad: `Tensor` gradient with respect to the output of the op.
Returns:
The gradient with respect to the input of the op.
"""
# ...
@ops.RegisterGradient('BytePSBroadcast')
def _broadcast_grad(op, grad):
"""Gradient for broadcast op.
Args:
op: An operation.
grad: `Tensor` gradient with respect to the output of the op.
Returns:
The gradient with respect to the input of the op.
"""
# ...
[3] - C_LIB
在 ops.py#L49 处,C_LIB
为_load_library
函数所返回的 Python 模块。
C_LIB = _load_library('c_lib' + get_ext_suffix())
从 byteps/common/__init__.py#L24 处可知,函数 get_ext_suffix
是用来“确定各种 Python 版本的库扩展”,suffix
是后缀的意思,而这个函数所返回的也是 .pyd
或 .so
字符串。
def get_ext_suffix():
"""Determine library extension for various versions of Python."""
不过我也遇到了一个问题,就是项目中使用了如下的代码,但是我不清楚这个函数 byteps_push_pull
的实现是在哪里?
C_LIB.byteps_push_pull(tensor, name=name)
[4] - TF_LIB_CTYPES
前边介绍了 TF_LIB_CTYPES
是 ctypes.CDLL
返回的对象。在 ops.py 中,TF_LIB_CTYPES
调用了函数 byteps_tensorflow_declare_tensor
,该函数被定义在同文件夹的文件 ops.h
和 http://ops.cc
文件中。
ops.h
文件内容如下:
#ifndef BYTEPS_TENSORFLOW_OPS_H
#define BYTEPS_TENSORFLOW_OPS_H
// ...
namespace byteps {
namespace tensorflow {
class TFReadyEvent : public common::ReadyEvent {
// ...
};
class TFTensor : public common::Tensor {
// ...
};
extern "C" void byteps_tensorflow_declare_tensor(char* name);
} // namespace tensorflow
} // namespace byteps
#endif // BYTEPS_TENSORFLOW_OPS_H
ops.cc
文件内容如下:
// ...
#include "ops.h"
using namespace byteps;
namespace byteps {
namespace tensorflow {
// ...
extern "C" void byteps_tensorflow_declare_tensor(char* name) {
std::string tensor_name(name);
common::IsTensorDeclared(tensor_name);
return;
}
// ...
class BytePSPushPullOp : public ::tensorflow::AsyncOpKernel {
// ...
};
REGISTER_KERNEL_BUILDER(Name("BytepsPushPull").Device(::tensorflow::DEVICE_CPU),
BytePSPushPullOp);
REGISTER_KERNEL_BUILDER(Name("BytepsPushPull").Device(::tensorflow::DEVICE_GPU),
BytePSPushPullOp);
REGISTER_OP("BytepsPushPull")
.Attr("T: {int32, int64, float16, float32, float64}")
.Input("tensor: T")
.Output("sum: T")
.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
c->set_output(0, c->input(0));
return ::tensorflow::Status::OK();
})
.Doc(R"doc(
Perform an PushPull on a tensor. All other processes that do a reduction
on a tensor with the same name must have the same dimension for that tensor.
Tensors are reduced with other tensors that have the same node name for the
push_pull.
Arguments
tensor: A tensor to reduce.
Output
sum: A tensor with the same shape as `tensor`, summed across all processes.
)doc");
} // namespace tensorflow
} // namespace byteps
标题之所以取“BytePS 和 Horovod”,是因为 BytePS 和 Horovod 的项目代码结构很像,因为有时候可以在 BytePS 项目里的文件头部发现如下所示的文字,一起结合着看,可以看相同和不同之处分别是什么 。
# Copyright 2019 Bytedance Inc. All Rights Reserved.
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
# Modifications copyright (C) 2019 Uber Technologies, Inc.
想要借助 TensorFlow 这种现有的大型项目去做一些第三方的项目,凡是涉及到细粒度修改的就一定要对原有项目有深入的理解才可以,真的是很考验代码功底。
项目地址:
bytepsgithub.com参考
- ^https://github.com/bytedance/byteps
- ^https://github.com/horovod/horovod
- ^https://docs.python.org/3/library/ctypes.html
- ^https://github.com/tensorflow/tensorflow
- ^https://www.python.org/dev/peps/pep-0318/