Tensor
tensorflow,从名字上看由tensor+flow组成。本文来看看Tensor是什么,是怎么实现的。
tensorflow里的tensor可以抽象的认为由<n维数组,数组元素类型(dtype),数组各维大小(shape)>三元组成,同时在这三元组上有一些操作:创建,删除,复制,改变shape, 切片等等。如果用C++来简单定义:
struct Tensor {
std::vector<int> shape; //表示多维数组各维大小,如三维数组:shape={2,3,4}
int dtype; //表示数据类型,根据类型能报data转成对应的数组
void *data; //连续内存空间,保存了数组中所有元素
};
shape是可以修改的,比如一个2x3的数组,也可以变成3x2,只要元素个数不变就行。
data是一段连续的内存空间,正如c++中的数组 T[2][3]. 如果dtype是整数,那么就是int data[2][3],
data是个指针,如果强制转成int *data. 那么data, data+1, data+2, ..., data+5就是各个元素。
还能切片slice:如如把data的第一维拿出来,就是data[1]. 因为是2*3数组, data[0], data[1]都3个元素。slice之后tensor还引用着原tensor的内存。而且通过引用计数保存原tensor内存释放了,slice也是可用的
Tensor实现
然而,在工程实现中,还要考虑data的对齐,如8字节对齐。也要考虑 data的内存分配方式,tensorflow里定义了allocator接口,来实现各种不同的分配方式。考虑到模型参数保存,checkpoint保存等,tensor还得支持序列化,tensorflow使用protobuf来序列化tensor.
Tensor的实现在:
ls tensorflow/core/framework/tensor.*
tensor.cc tensor.h tensor.proto
基本操作
Tensor成员
TensorShape shape_; //形状
TensorBuffer* buf_; //数据
构造
- 空构造:不是scalar, shape {0}, NumElements() ==0。
- type+shape构造,会分配内存:Tensor(DataType type, const TensorShape& shape); 默认用CPUAllocator
- allocator+type+shape构造:Tensor(Allocator* a, DataType type, const TensorShape& shape);
- 带buffer构建:Tensor(DataType type, const TensorShape& shape, TensorBuffer* buf);
- 基于常量(scalar)的构建函数,重载了很多 explicit Tensor(float scalar_value)
切片
按第一维切片,但是不复制数据,不能保证对齐IsAligned
Tensor Slice(int64_t dim0_start, int64_t dim0_limit) const;
Tensor SubSlice(int64_t index) const;
序列化
bool FromProto(const TensorProto& other) TF_MUST_USE_RESULT;
bool FromProto(Allocator* a, const TensorProto& other) TF_MUST_USE_RESULT;
/// \brief Fills in `proto` with `*this` tensor's content.
///
/// `AsProtoField()` fills in the repeated field for `proto.dtype()`, while
/// `AsProtoTensorContent()` encodes the content in `proto.tensor_content()`
/// in a compact form.
void AsProtoField(TensorProto* proto) const;
void AsProtoTensorContent(TensorProto* proto) const;
拷贝
- 复制构造和移动构造都支持
- operator=支持复制和移动
访问
/// Returns the data type.
DataType dtype() const { return shape_.data_type(); }
/// Returns the shape of the tensor.
const TensorShape& shape() const { return shape_; }
/// \brief Convenience accessor for the tensor shape.
///
/// For all shape accessors, see comments for relevant methods of
/// `TensorShape` in `tensor_shape.h`.
int dims() const { return shape().dims(); }
/// Convenience accessor for the tensor shape.
int64_t dim_size(int d) const { return shape().dim_size(d); }
/// Convenience accessor for the tensor shape.
int64_t NumElements() const { return shape().num_elements(); }
size_t AllocatedBytes() const
bool IsAligned() const
bool CopyFrom(const Tensor& other,
const TensorShape& shape)
Tensor t;
d = t.scalar<float>(); //访问scalar
d = t.vec<float>(); //以一维数组方式访问: d[0]
d = t.matrix<float>(); //以矩阵方式访问: d(2,3)
//单个元素访问
flat = t.flat<float>()
d = flat.data()
for(auto i = 0; i < t.NumElements(); i++) d[i]
template <typename T>
typename TTypes<T>::Flat flat() {
return shaped<T, 1>({NumElements()});
}
template <typename T>
typename TTypes<T>::UnalignedFlat unaligned_flat() {
return unaligned_shaped<T, 1>({NumElements()});
}
//用于memcpy
/// REQUIRES: `DataTypeCanUseMemcpy(dtype())`.
StringPiece tensor_data() const;
void* data() const;
Debug信息
std::string SummarizeValue(int64_t max_entries, bool print_v2 = false) const;
std::string DebugString(int num_values) const;
std::string DebugString() const { return DebugString(3); }
std::string DeviceSafeDebugString() const;
void FillDescription(TensorDescription* description) const;
Tensor shape type的实现在如下文件中
tensor.h: TensorBuffer来执行级data内存。
$ ls tensorflow/core/framework/tensor*
tensorflow/core/framework/tensor.cc tensorflow/core/framework/tensor_shape.proto tensorflow/core/framework/tensor_testutil.h
tensorflow/core/framework/tensor.h tensorflow/core/framework/tensor_shape_test.cc tensorflow/core/framework/tensor_testutil_test.cc
tensorflow/core/framework/tensor.proto tensorflow/core/framework/tensor_slice.cc tensorflow/core/framework/tensor_types.h
tensorflow/core/framework/tensor_description.proto tensorflow/core/framework/tensor_slice.h tensorflow/core/framework/tensor_util.cc
tensorflow/core/framework/tensor_key.h tensorflow/core/framework/tensor_slice.proto tensorflow/core/framework/tensor_util.h
tensorflow/core/framework/tensor_reference.h tensorflow/core/framework/tensor_slice_test.cc tensorflow/core/framework/tensor_util_test.cc
tensorflow/core/framework/tensor_shape.cc tensorflow/core/framework/tensor_test.cc
tensorflow/core/framework/tensor_shape.h tensorflow/core/framework/tensor_testutil.cc
$ ls tensorflow/core/framework/shape*
tensorflow/core/framework/shape_inference.cc tensorflow/core/framework/shape_inference_test.cc tensorflow/core/framework/shape_inference_testutil.h
tensorflow/core/framework/shape_inference.h tensorflow/core/framework/shape_inference_testutil.cc tensorflow/core/framework/shape_inference_testutil_test.cc
$ ls tensorflow/core/framework/type*
tensorflow/core/framework/type_index.h tensorflow/core/framework/typed_allocator.cc tensorflow/core/framework/types.cc tensorflow/core/framework/types.proto
tensorflow/core/framework/type_traits.h tensorflow/core/framework/typed_allocator.h tensorflow/core/framework/types.h tensorflow/core/framework/types_test.cc
Tensor支持的数据类型
定义在tensorflow/core/framework/types.proto中
enum DataType {
// Not a legal value for DataType. Used to indicate a DataType field
// has not been set.
DT_INVALID = 0;
// Data types that all computation devices are expected to be
// capable to support.
DT_FLOAT = 1;
DT_DOUBLE = 2;
DT_INT32 = 3;
DT_UINT8 = 4;
DT_INT16 = 5;
DT_INT8 = 6;
DT_STRING = 7;
DT_COMPLEX64 = 8; // Single-precision complex
DT_INT64 = 9;
DT_BOOL = 10;
DT_QINT8 = 11; // Quantized int8
DT_QUINT8 = 12; // Quantized uint8
DT_QINT32 = 13; // Quantized int32
DT_BFLOAT16 = 14; // Float32 truncated to 16 bits. Only for cast ops.
DT_QINT16 = 15; // Quantized int16
DT_QUINT16 = 16; // Quantized uint16
DT_UINT16 = 17;
DT_COMPLEX128 = 18; // Double-precision complex
DT_HALF = 19;
DT_RESOURCE = 20;
DT_VARIANT = 21; // Arbitrary C++ data types
DT_UINT32 = 22;
DT_UINT64 = 23;
}
序列化tensor.proto
// Protocol buffer representing a tensor.
message TensorProto {
DataType dtype = 1;
// Shape of the tensor. TODO(touts): sort out the 0-rank issues.
TensorShapeProto tensor_shape = 2;
// Only one of the representations below is set, one of "tensor_contents" and
// the "xxx_val" attributes. We are not using oneof because as oneofs cannot
// contain repeated fields it would require another extra set of messages.
// Version number.
//
// In version 0, if the "repeated xxx" representations contain only one
// element, that element is repeated to fill the shape. This makes it easy
// to represent a constant Tensor with a single value.
int32 version_number = 3;
// Serialized raw tensor content from either Tensor::AsProtoTensorContent or
// memcpy in tensorflow::grpc::EncodeTensorToByteBuffer. This representation
// can be used for all tensor types. The purpose of this representation is to
// reduce serialization overhead during RPC call by avoiding serialization of
// many repeated small items.
bytes tensor_content = 4;
// Type specific representations that make it easy to create tensor protos in
// all languages. Only the representation corresponding to "dtype" can
// be set. The values hold the flattened representation of the tensor in
// row major order.
// DT_HALF, DT_BFLOAT16. Note that since protobuf has no int16 type, we'll
// have some pointless zero padding for each value here.
repeated int32 half_val = 13 [packed = true];
// DT_FLOAT.
repeated float float_val = 5 [packed = true];
// DT_DOUBLE.
repeated double double_val = 6 [packed = true];
// DT_INT32, DT_INT16, DT_UINT16, DT_INT8, DT_UINT8.
repeated int32 int_val = 7 [packed = true];
// DT_STRING
repeated bytes string_val = 8;
// DT_COMPLEX64. scomplex_val(2*i) and scomplex_val(2*i+1) are real
// and imaginary parts of i-th single precision complex.
repeated float scomplex_val = 9 [packed = true];
// DT_INT64
repeated int64 int64_val = 10 [packed = true];
// DT_BOOL
repeated bool bool_val = 11 [packed = true];
// DT_COMPLEX128. dcomplex_val(2*i) and dcomplex_val(2*i+1) are real
// and imaginary parts of i-th double precision complex.
repeated double dcomplex_val = 12 [packed = true];
// DT_RESOURCE
repeated ResourceHandleProto resource_handle_val = 14;
// DT_VARIANT
repeated VariantTensorDataProto variant_val = 15;
// DT_UINT32
repeated uint32 uint32_val = 16 [packed = true];
// DT_UINT64
repeated uint64 uint64_val = 17 [packed = true];
}
// Protocol buffer representing the serialization format of DT_VARIANT tensors.
message VariantTensorDataProto {
// Name of the type of objects being serialized.
string type_name = 1;
// Portions of the object that are not Tensors.
bytes metadata = 2;
// Tensors contained within objects being serialized.
repeated TensorProto tensors = 3;
}
tensor_util.h和tensor_util_test.cc中有使用tensor的样例
提供了如下功能:
- tensor深拷贝
- slice深拷贝
-
Concat 连接
-
Split 分割
-
ConcatSplitStrings 字符串连接
-
CreatesStringTensorProto: 从文件的protobuf中反序列化出dtype=DT_STRING的tensor
-
CreatesInt32TensorProto
-
CreatesInt64TensorProto
-
CreatesUInt32TensorProto
-
CreatesUInt64TensorProto
-
...各种类型都有从文件反序列化
-
CompressTensorProtoInPlaceTooSmall 各种tensor proto压缩
-
CompressTensorProtoInPlaceAllEqual
-
CompressTensorProtoConstantTail
-
CompressTensorProtoNegatizeZero
Tensor内存分配:Allocator
接口
tensorflow/core/framework/allocator.h
// Allocator is an abstract interface for allocating and deallocating
// device memory.
class Allocator {
public:
// Align to 64 byte boundary.
static constexpr size_t kAllocatorAlignment = 64;
virtual ~Allocator();
// Return a string identifying this allocator
virtual std::string Name() = 0;
// Return an uninitialized block of memory that is "num_bytes" bytes
// in size. The returned pointer is guaranteed to be aligned to a
// multiple of "alignment" bytes.
// REQUIRES: "alignment" is a power of 2.
virtual void* AllocateRaw(size_t alignment, size_t num_bytes) = 0;
// Return an uninitialized block of memory that is "num_bytes" bytes
// in size with specified allocation attributes. The returned pointer is
// guaranteed to be aligned to a multiple of "alignment" bytes.
// REQUIRES: "alignment" is a power of 2.
virtual void* AllocateRaw(size_t alignment, size_t num_bytes,
const AllocationAttributes& allocation_attr) {
// The default behavior is to use the implementation without any allocation
// attributes.
return AllocateRaw(alignment, num_bytes);
}
// Deallocate a block of memory pointer to by "ptr"
// REQUIRES: "ptr" was previously returned by a call to AllocateRaw
virtual void DeallocateRaw(void* ptr) = 0;
// Returns true if this allocator tracks the sizes of allocations.
// RequestedSize and AllocatedSize must be overridden if
// TracksAllocationSizes is overridden to return true.
virtual bool TracksAllocationSizes() const { return false; }
// Returns true if this allocator allocates an opaque handle rather than the
// requested number of bytes.
//
// This method returns false for most allocators, but may be used by
// special-case allocators that track tensor usage. If this method returns
// true, AllocateRaw() should be invoked for all values of `num_bytes`,
// including 0.
//
// NOTE: It is the caller's responsibility to track whether an allocated
// object is a buffer or an opaque handle. In particular, when this method
// returns `true`, users of this allocator must not run any constructors or
// destructors for complex objects, since there is no backing store for the
// tensor in which to place their outputs.
virtual bool AllocatesOpaqueHandle() const { return false; }
// Returns the user-requested size of the data allocated at
// 'ptr'. Note that the actual buffer allocated might be larger
// than requested, but this function returns the size requested by
// the user.
//
// REQUIRES: TracksAllocationSizes() is true.
//
// REQUIRES: 'ptr!=nullptr' and points to a buffer previously
// allocated by this allocator.
virtual size_t RequestedSize(const void* ptr) const {
CHECK(false) << "allocator doesn't track sizes";
return size_t(0);
}
// Returns the allocated size of the buffer at 'ptr' if known,
// otherwise returns RequestedSize(ptr). AllocatedSize(ptr) is
// guaranteed to be >= RequestedSize(ptr).
//
// REQUIRES: TracksAllocationSizes() is true.
//
// REQUIRES: 'ptr!=nullptr' and points to a buffer previously
// allocated by this allocator.
virtual size_t AllocatedSize(const void* ptr) const {
return RequestedSize(ptr);
}
// Returns either 0 or an identifier assigned to the buffer at 'ptr'
// when the buffer was returned by AllocateRaw. If non-zero, the
// identifier differs from every other ID assigned by this
// allocator.
//
// REQUIRES: TracksAllocationSizes() is true.
//
// REQUIRES: 'ptr!=nullptr' and points to a buffer previously
// allocated by this allocator.
virtual int64_t AllocationId(const void* ptr) const { return 0; }
// Returns the allocated size of the buffer at 'ptr' if known,
// otherwise returns 0. This method can be called when
// TracksAllocationSizes() is false, but can be extremely slow.
//
// REQUIRES: 'ptr!=nullptr' and points to a buffer previously
// allocated by this allocator.
virtual size_t AllocatedSizeSlow(const void* ptr) const {
if (TracksAllocationSizes()) {
return AllocatedSize(ptr);
}
return 0;
}
virtual absl::optional<AllocatorStats> GetStats() { return absl::nullopt; }
virtual bool ClearStats() TF_MUST_USE_RESULT { return false; }
virtual void SetSafeFrontier(uint64 count) {}
// Returns the type of the memory allocated by this allocator.
virtual AllocatorMemoryType GetMemoryType() const {
return AllocatorMemoryType::kUnknown;
}
};
可以继承并实现自己的allocator
CPUAllocator
tensorflow/core/framework/cpu_allocator_impl.h
class CPUAllocator : public Allocator {
public:
CPUAllocator()
: single_allocation_warning_count_(0),
total_allocation_warning_count_(0) {}
~CPUAllocator() override {}
string Name() override { return "cpu"; }
void* AllocateRaw(size_t alignment, size_t num_bytes) override {
if (num_bytes > static_cast<size_t>(LargeAllocationWarningBytes()) &&
single_allocation_warning_count_ < kMaxSingleAllocationWarnings) {
++single_allocation_warning_count_;
LOG(WARNING) << "Allocation of " << num_bytes << " exceeds "
<< 100 * kLargeAllocationWarningThreshold
<< "% of free system memory.";
}
void* p = port::AlignedMalloc(num_bytes, alignment);
if (cpu_allocator_collect_stats) {
const std::size_t alloc_size = port::MallocExtension_GetAllocatedSize(p);
mutex_lock l(mu_);
++stats_.num_allocs;
stats_.bytes_in_use += alloc_size;
stats_.peak_bytes_in_use =
std::max<int64_t>(stats_.peak_bytes_in_use, stats_.bytes_in_use);
stats_.largest_alloc_size =
std::max<int64_t>(stats_.largest_alloc_size, alloc_size);
if (stats_.bytes_in_use > TotalAllocationWarningBytes() &&
total_allocation_warning_count_ < kMaxTotalAllocationWarnings) {
++total_allocation_warning_count_;
LOG(WARNING) << "Total allocated memory " << stats_.bytes_in_use
<< "exceeds " << 100 * kTotalAllocationWarningThreshold
<< "% of free system memory";
}
if (p != nullptr) {
AddTraceMe("MemoryAllocation", p, num_bytes, alloc_size);
}
}
return p;
}
void DeallocateRaw(void* ptr) override {
if (cpu_allocator_collect_stats) {
const std::size_t alloc_size =
port::MallocExtension_GetAllocatedSize(ptr);
mutex_lock l(mu_);
stats_.bytes_in_use -= alloc_size;
AddTraceMe("MemoryDeallocation", ptr, 0, alloc_size);
}
port::AlignedFree(ptr);
}
void AddTraceMe(absl::string_view traceme_name, const void* chunk_ptr,
std::size_t req_bytes, std::size_t alloc_bytes) {
tensorflow::profiler::TraceMe::InstantActivity(
[this, traceme_name, chunk_ptr, req_bytes,
alloc_bytes]() TF_NO_THREAD_SAFETY_ANALYSIS {
const auto& annotation =
profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation();
return tensorflow::profiler::TraceMeEncode(
traceme_name, {{"allocator_name", Name()},
{"bytes_reserved", stats_.bytes_reserved},
{"bytes_allocated", stats_.bytes_in_use},
{"peak_bytes_in_use", stats_.peak_bytes_in_use},
{"requested_bytes", req_bytes},
{"allocation_bytes", alloc_bytes},
{"addr", reinterpret_cast<uint64>(chunk_ptr)},
{"tf_op", annotation.pending_op_name},
{"id", annotation.pending_step_id},
{"region_type", annotation.pending_region_type},
{"data_type", annotation.pending_data_type},
{"shape", annotation.pending_shape_func()}});
},
/*level=*/profiler::TraceMeLevel::kInfo);
}
absl::optional<AllocatorStats> GetStats() override {
if (!cpu_allocator_collect_stats) return absl::nullopt;
mutex_lock l(mu_);
return stats_;
}
bool ClearStats() override {
if (!cpu_allocator_collect_stats) return false;
mutex_lock l(mu_);
stats_.num_allocs = 0;
stats_.peak_bytes_in_use = stats_.bytes_in_use;
stats_.largest_alloc_size = 0;
return true;
}
size_t AllocatedSizeSlow(const void* ptr) const override {
return port::MallocExtension_GetAllocatedSize(ptr);
}
AllocatorMemoryType GetMemoryType() const override {
return AllocatorMemoryType::kHostPageable;
}
private:
mutex mu_;
AllocatorStats stats_ TF_GUARDED_BY(mu_);
// Use <atomic> for single allocations to avoid mutex contention when
// statistics are disabled.
std::atomic<int> single_allocation_warning_count_;
int total_allocation_warning_count_ TF_GUARDED_BY(mu_);
TF_DISALLOW_COPY_AND_ASSIGN(CPUAllocator);
};
//注册cpu allocator
REGISTER_MEM_ALLOCATOR("DefaultCPUAllocator", 100, CPUAllocatorFactory);
Allocator注册
tensorflow/core/framework/allocator_registry.h
class AllocatorFactoryRegistry {
public:
AllocatorFactoryRegistry() {}
~AllocatorFactoryRegistry() {}
void Register(const char* source_file, int source_line, const string& name,
int priority, AllocatorFactory* factory);
// Returns 'best fit' Allocator. Find the factory with the highest priority
// and return an allocator constructed by it. If multiple factories have
// been registered with the same priority, picks one by unspecified criteria.
Allocator* GetAllocator();
// Returns 'best fit' SubAllocator. First look for the highest priority
// factory that is NUMA-enabled. If none is registered, fall back to the
// highest priority non-NUMA-enabled factory. If NUMA-enabled, return a
// SubAllocator specific to numa_node, otherwise return a NUMA-insensitive
// SubAllocator.
SubAllocator* GetSubAllocator(int numa_node);
// Returns the singleton value.
static AllocatorFactoryRegistry* singleton();
ProcessStateInterface* process_state() const { return process_state_; }
protected:
friend class ProcessState;
ProcessStateInterface* process_state_ = nullptr;
private:
mutex mu_;
bool first_alloc_made_ = false;
struct FactoryEntry {
const char* source_file;
int source_line;
string name;
int priority;
std::unique_ptr<AllocatorFactory> factory;
std::unique_ptr<Allocator> allocator;
// Index 0 corresponds to kNUMANoAffinity, other indices are (numa_node +
// 1).
std::vector<std::unique_ptr<SubAllocator>> sub_allocators;
};
std::vector<FactoryEntry> factories_ TF_GUARDED_BY(mu_);
// Returns any FactoryEntry registered under 'name' and 'priority',
// or 'nullptr' if none found.
const FactoryEntry* FindEntry(const string& name, int priority) const
TF_EXCLUSIVE_LOCKS_REQUIRED(mu_);
TF_DISALLOW_COPY_AND_ASSIGN(AllocatorFactoryRegistry);
};