文章目录
前言
本系列内容力求将nvdla的内核态驱动整理清楚,如果有分析不对的请指出。
前面已经分析了一大块代码了,链接分别如下:
系列文章1:NVDLA内核态驱动代码整理一
系列文章2:NVDLA内核态驱动代码整理二
系列文章3:NVDLA内核态驱动代码整理三
系列文章4:NVDLA内核态驱动代码整理四
系列文章5:NVDLA内核态驱动代码整理五
系列文章6:NVDLA内核态驱动代码整理六
欢迎阅读硬件信号和架构分析系列文章1:
架构开篇介绍文章:NVDLA内核态驱动代码整理三
系列文章1:NVDLA硬件信号和架构设计整理一
系列文章2:NVDLA硬件信号和架构设计整理二
系列文章3:NVDLA硬件信号和架构设计整理三
本章分析conv.c
代码以及相关的寄存器,因为已经有了NVDLA硬件信号和架构设计整理二对架构中卷积实现的细节做了介绍。提示:建议先阅读系列文章1:NVDLA硬件信号和架构设计整理一中的寄存器后再阅读本篇。
一. 结构体与函数讲解
1.1 几个有意思的结构体——与用户态驱动代码联动
看代码,熟悉Convolution Core
硬件架构部分之后,可以稍微…轻而易举地理解为什么需要设置下述工作模式。
static const uint8_t map_precision[] = {
FIELD_ENUM(CDMA_D_MISC_CFG_0, IN_PRECISION, INT8),
FIELD_ENUM(CDMA_D_MISC_CFG_0, IN_PRECISION, INT16),
FIELD_ENUM(CDMA_D_MISC_CFG_0, IN_PRECISION, FP16),
};
static const uint8_t map_conv[] = {
FIELD_ENUM(CACC_D_MISC_CFG_0, CONV_MODE, DIRECT),
FIELD_ENUM(CACC_D_MISC_CFG_0, CONV_MODE, WINOGRAD),
};
static const uint8_t map_weight_fmt[] = {
FIELD_ENUM(CSC_D_WEIGHT_FORMAT_0, WEIGHT_FORMAT, UNCOMPRESSED),
FIELD_ENUM(CSC_D_WEIGHT_FORMAT_0, WEIGHT_FORMAT, COMPRESSED),
};
数据精度需要在CDMA
阶段确定;注意CMAC
阶段只是负责Stripe Operation
等四层操作,并不确定究竟是采用Direct Convolution
还是Winograd Convolution
模式,因此会在CACC
阶段确定;而CSC
和CBUF
的关系看起来其实更像是一级缓存和二级缓存的关系,因为CSC
从CBUF
取出数据后送到CMAC
中,会根据CMAC
空或者满来决定是否往CMAC
模块送数据,同时CSC
有专用的计数器来计数。不过还是有几个问题:
1、`CDMA`阶段会根据`Direct Convolution`、`Winograd Convolution`还是`Direct Convolution For Input Image`来确定到底该如何在下一阶段的`CBUF_RAM`中排布数据。
2、`CDMA`阶段如果考虑权重的稀疏性,那么也需要传入WMB的存储tag数据。
3、如果依据前面两点,是否考虑权重的稀疏性应该在早期就应该配置完成,而不是在`CSC`阶段。我们先按下不表,看看后面会怎么解释这个问题!
我们先看看这里面的语法,以FIELD_ENUM(CDMA_D_MISC_CFG_0, IN_PRECISION, INT8)
为例向前追溯:
FIELD_ENUM(CDMA_D_MISC_CFG_0, IN_PRECISION, INT8)
||
||
\/
#define FIELD_ENUM(r, f, e) (r##_##f##_##e)
#define CDMA_D_MISC_CFG_0 _MK_ADDR_CONST(0x3014)
||
||
\/
#ifndef _MK_ADDR_CONST
#define _MK_ADDR_CONST(_constant_) _constant_
#endif
接下来一点点解释:#define FIELD_ENUM(r, f, e) (r##_##f##_##e)
,这个宏定义接受三个参数r
、f
和e
,并使用##
运算符将它们连接在一起。这里FIELD_ENUM
用于生成一个表示寄存器、字段和枚举值组合的常量。例如,如果调用FIELD_ENUM(GLB, STATUS, ENABLE)
,则展开为GLB_STATUS_ENABLE
。_MK_ADDR_CONST
直接使用给定的常量,这里就是_MK_ADDR_CONST(0x3014)
,常量就是0x3014
。而FIELD_ENUM(CDMA_D_MISC_CFG_0, IN_PRECISION, INT8)展开为CDMA_D_MISC_CFG_0_IN_PRECISION_INT8
。
继续看代码,会发现一堆很长的定义:
static const uint8_t map_img_fmt[][2] = {
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R8), 1},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R10), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R12), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R16), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R16_I), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R16_F), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A16B16G16R16), 8},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_X16B16G16R16), 8},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A16B16G16R16_F), 8},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A16Y16U16V16), 8},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_V16U16Y16A16), 8},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A16Y16U16V16_F), 8},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A8B8G8R8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A8R8G8B8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_B8G8R8A8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R8G8B8A8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_X8B8G8R8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_X8R8G8B8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_B8G8R8X8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R8G8B8X8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A2B10G10R10), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A2R10G10B10), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_B10G10R10A2), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_R10G10B10A2), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A2Y10U10V10), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_V10U10Y10A2), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_A8Y8U8V8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_V8U8Y8A8), 4},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y8___U8V8_N444), 1},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y8___V8U8_N444), 1},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y10___U10V10_N444), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y10___V10U10_N444), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y12___U12V12_N444), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y12___V12U12_N444), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y16___U16V16_N444), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
PIXEL_FORMAT, T_Y16___V16U16_N444), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
DATAIN_FORMAT, FEATURE), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
DATAIN_FORMAT, PIXEL), 1},
};
我们首先看看出现了0、1、2、4、8这些奇怪的数据,再结合图像表征格式,从长短来判断,这些数字大概率是表示像素的位深度。随后关注T_R8
、T_R10
等,其实可以发现这些宏同样出现在了用户态驱动代码
中,在/umd/apps/runtime/DImage.h
中有这样一段:
class NvDlaImage
{
public:
typedef enum _PixelFormat
{
T_R8 = 0,
T_R8_I = 1,
T_R10 = 2,
T_R12 = 3,
T_R16 = 4,
T_R16_I = 5,
T_R16_F = 6,
T_A16B16G16R16 = 7,
T_A16B16G16R16_F = 8,
T_X16B16G16R16 = 9,
T_A16Y16U16V16 = 10,
T_A16Y16U16V16_F = 11,
T_V16U16Y16A16 = 12,
T_A8B8G8R8 = 13,
T_A8R8G8B8 = 14,
T_B8G8R8A8 = 15,
T_R8G8B8A8 = 16,
T_X8B8G8R8 = 17,
T_X8R8G8B8 = 18,
T_B8G8R8X8 = 19,
T_R8G8B8X8 = 20,
T_A2B10G10R10 = 21,
T_A2R10G10B10 = 22,
T_B10G10R10A2 = 23,
T_R10G10B10A2 = 24,
T_A2Y10U10V10 = 25,
T_V10U10Y10A2 = 26,
T_A8Y8U8V8 = 27,
T_V8U8Y8A8 = 28,
T_Y8___U8V8_N444 = 29,
T_Y8___V8U8_N444 = 30,
T_Y10___U10V10_N444 = 31,
T_Y10___V10U10_N444 = 32,
T_Y12___U12V12_N444 = 33,
T_Y12___V12U12_N444 = 34,
T_Y16___U16V16_N444 = 35,
T_Y16___V16U16_N444 = 36,
D_F8_CHW_I = 37,
D_F16_CHW_I = 38,
D_F16_CHW_F = 39,
D_F8_CxHWx_x32_I = 40,
D_F8_CxHWx_x8_I = 41,
D_F16_CxHWx_x16_I = 42,
D_F16_CxHWx_x16_F = 43,
D_F32_CHW_F = 44,
D_F32_CxHWx_x8_F = 45,
T_R8G8B8 = 46,
T_B8G8R8 = 47,
} PixelFormat;
......
当然细心一点会发现好像最后两行并没有出现在上述列表中:
......
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
DATAIN_FORMAT, FEATURE), 2},
{FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
DATAIN_FORMAT, PIXEL), 1},
......
这里的FEATURE
和PIXEL
两个宏在/umd/core/include/nvdla/IType.h
中出现:
class DataCategory
{
public:
typedef NvU8 UnderlyingType;
enum Enum {
IMAGE = NVDLA_DATA_CATEGORY_IMAGE,
WEIGHT = NVDLA_DATA_CATEGORY_WEIGHT,
FEATURE = NVDLA_DATA_CATEGORY_FEATURE,
PLANAR = NVDLA_DATA_CATEGORY_PLANAR,
BIAS = NVDLA_DATA_CATEGORY_BIAS,
};
static inline UnderlyingType max() { return 4U; }
const char* c_str() const {
const char * names[5] = { "IMAGE", "WEIGHT", "FEATURE", "PLANAR", "BIAS" };
return names[m_v];
}
ENUM_CLASS_MEMBERS(DataCategory);
};
template<> inline int EnumMax<DataCategory>() { return DataCategory::max() + 1; } // used by checkers
||
||
\/
#define NVDLA_DATA_CATEGORY_IMAGE 0U
#define NVDLA_DATA_CATEGORY_WEIGHT 1U
#define NVDLA_DATA_CATEGORY_FEATURE 2U
#define NVDLA_DATA_CATEGORY_PLANAR 3U
#define NVDLA_DATA_CATEGORY_BIAS 4U
不过虽然说在用户态驱动
程序中出现了,但是这并不代表什么,因为FIELD_ENUM
宏只是个字符操作而已。不过和用户态驱动
程序有对应也算是意外之喜,也许两者之间有什么关联,不过那是后话了。在查找PIXEL_FORMAT
时(然而并没有查到DATAIN_FORMAT
的内核定义),查到内核
中的定义:
/* Primary surface pixel format select */
typedef enum _PIXEL_FORMAT {
_8BPP = 0, _15BPP, _16BPP, _24BPP, _32BPP
} PIXEL_FORMAT;
看到_8BPP
、_15BPP
、_16BPP
、_24BPP
、_32BPP
等有意思的表示,参照这里的描述:
24BPP的结构是:前8位表示“红”,中8位表示“绿”,后8位表示“蓝”。如下所示:
RGB888 R7 R6 R5 R4 R3 R2 R1 R0 G7 G6 G5 G4 G3 G2 G1 G0 B7 B6 B5 B4 B3 B2 B1 B0
16BPP的结构是:前5位表示“红”,中6位表示“绿”,后5位表示“蓝”。如下所示:
16bit RGB656 R4 R3 R2 R1 R0 G5 G4 G3 G2 G1 G0 B4 B3 B2 B1 B0
两者之间如何实现转换?
24BPP -> 16BPP 的转换
24BPP R7 R6 R5 R4 R3 R2 R1 R0 G7 G6 G5 G4 G3 G2 G1 G0 B7 B6 B5 B4 B3 B2 B1 B01
16BPP R7 R6 R5 R4 R3 G7 G6 G5 G4 G3 G2 B7 B6 B5 B4 B3
即取24BPP的RGB值的高位赋给16BPP。不过这种方法会影响显示精度,但是影响不大。
16BPP -> 24BPP 的转换
16BPP R4 R3 R2 R1 R0 G5 G4 G3 G2 G1 G0 B4 B3 B2 B1 B0
24BPP R4 R3 R2 R1 R0 0 0 0 G5 G4 G3 G2 G1 G0 0 0 B4 B3 B2 B1 B0 0 0 0
即取16BPP的RGB值赋给24BPP的高位。
继续看代码,以下又是一段结构体:
static const uint8_t map_pixel[] = {
FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0, PIXEL_MAPPING, PITCH_LINEAR),
};
static const uint8_t map_ram[] = {
FIELD_ENUM(CDMA_D_DAIN_RAM_TYPE_0, DATAIN_RAM_TYPE, MCIF),
FIELD_ENUM(CDMA_D_DAIN_RAM_TYPE_0, DATAIN_RAM_TYPE, CVIF),
};
static const uint8_t map_mean[] = {
FIELD_ENUM(CDMA_D_MEAN_FORMAT_0, MEAN_FORMAT, DISABLE),
FIELD_ENUM(CDMA_D_MEAN_FORMAT_0, MEAN_FORMAT, ENABLE),
};
参数配置均和CDMA
相关。
1.2 dla_conv_stat_data函数——与硬件RTL代码的寄存器配置联动
继续看代码,接下来的dla_conv_stat_data
函数和打印相关:
void
dla_conv_stat_data(struct dla_processor *processor,
struct dla_processor_group *group)
{
uint64_t end_time = 0;
struct dla_conv_stat_desc *conv_stat;
conv_stat = &processor->stat_data_desc->conv_stat;
end_time = dla_get_time_us();
conv_stat->data_read_stall = cdma_reg_read(D_PERF_DAT_READ_STALL);
conv_stat->weight_read_stall = cdma_reg_read(D_PERF_WT_READ_STALL);
conv_stat->data_read_latency = cdma_reg_read(D_PERF_DAT_READ_LATENCY);
conv_stat->weight_read_latency = cdma_reg_read(D_PERF_WT_READ_LATENCY);
conv_stat->nan_data_num = cdma_reg_read(D_NAN_INPUT_DATA_NUM);
conv_stat->nan_weight_num = cdma_reg_read(D_NAN_INPUT_WEIGHT_NUM);
conv_stat->inf_data_num = cdma_reg_read(D_INF_INPUT_DATA_NUM);
conv_stat->inf_weight_num = cdma_reg_read(D_INF_INPUT_WEIGHT_NUM);
conv_stat->saturation_count = cacc_reg_read(D_OUT_SATURATION);
conv_stat->runtime = (uint32_t)(end_time - group->start_time);
}
对相关的结构体追溯:
struct dla_processor {
const char *name;
uint8_t op_type;
uint8_t consumer_ptr;
uint8_t roi_index;
uint8_t group_status;
uint8_t rdma_status;
uint8_t last_group;
struct dla_common_op_desc *tail_op;
struct dla_processor_group groups[DLA_NUM_GROUPS];
union dla_stat_container *stat_data_desc;
int32_t (*is_ready)(struct dla_processor *processor,
struct dla_processor_group *group);
int32_t (*enable)(struct dla_processor_group *group);
int32_t (*program)(struct dla_processor_group *group);
void (*set_producer)(int32_t group_id, int32_t rdma_id);
void (*dump_config)(struct dla_processor_group *group);
void (*rdma_check)(struct dla_processor_group *group);
void (*get_stat_data)(struct dla_processor *processor,
struct dla_processor_group *group);
void (*dump_stat)(struct dla_processor *processor);
};
||
||
\/
union dla_stat_container { // stat_data_desc,可以发现流水线上的子模块都有相应的状态记录结构体
struct dla_bdma_stat_desc bdma_stat;
struct dla_conv_stat_desc conv_stat;
struct dla_sdp_stat_desc sdp_stat;
struct dla_pdp_stat_desc pdp_stat;
struct dla_cdp_stat_desc cdp_stat;
struct dla_rubik_stat_desc rubik_stat;
};
||
||
\/
struct dla_conv_stat_desc {
uint32_t data_read_stall;
uint32_t weight_read_stall;
uint32_t data_read_latency;
uint32_t weight_read_latency;
uint32_t saturation_count;
uint32_t nan_data_num;
uint32_t nan_weight_num;
uint32_t inf_data_num;
uint32_t inf_weight_num;
uint32_t runtime;
} __packed __aligned(4);
同样也有几个有意思的函数:
conv_stat->data_read_stall = cdma_reg_read(D_PERF_DAT_READ_STALL);
||
||
\/
#define cdma_reg_read(reg) reg_read(CDMA_REG(reg))
||
||
\/
#define CDMA_REG(name) CDMA_##name##_0
接着再检索CDMA_D_PERF_DAT_READ_STALL_0
,顺带贴出相关的寄存器信息:
// Register CDMA_D_PERF_DAT_READ_STALL_0
#define CDMA_D_PERF_DAT_READ_STALL_0 (_MK_ADDR_CONST(0x50d8))
#define CDMA_D_PERF_DAT_READ_STALL_0_SECURE (0x0)
#define CDMA_D_PERF_DAT_READ_STALL_0_DUAL (0x0)
#define CDMA_D_PERF_DAT_READ_STALL_0_SCR (0)
#define CDMA_D_PERF_DAT_READ_STALL_0_WORD_COUNT (0x1)
#define CDMA_D_PERF_DAT_READ_STALL_0_RESET_VAL (_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_RESET_MASK (_MK_MASK_CONST(0xffffffff))
#define CDMA_D_PERF_DAT_READ_STALL_0_SW_DEFAULT_VAL (_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_SW_DEFAULT_MASK (_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_READ_MASK (_MK_MASK_CONST(0xffffffff))
#define CDMA_D_PERF_DAT_READ_STALL_0_WRITE_MASK (_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_SHIFT (_MK_SHIFT_CONST(0))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_FIELD \
(_MK_FIELD_CONST(0xffffffff, \
CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_SHIFT))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_RANGE (31:0)
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_WOFFSET (0x0)
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_DEFAULT \
(_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_DEFAULT_MASK \
(_MK_MASK_CONST(0xffffffff))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_SW_DEFAULT \
(_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_SW_DEFAULT_MASK \
(_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_PARITY_PROTECTION \
(_MK_MASK_CONST(0x0))
#define CDMA_D_PERF_DAT_READ_STALL_0_DAT_RD_STALL_PLATFORM_DEPENDENT \
(_MK_MASK_CONST(0x1))
我们对比一下硬件中关于相关寄存器的地址定义:
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_NAN_INPUT_DATA_NUM`` | ``0x50c4`` | Count NaN number in input data cube, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_NAN_INPUT_WEIGHT_NUM`` | ``0x50c8`` | Count NaN number in weight kernels, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_INF_INPUT_DATA_NUM`` | ``0x50cc`` | Count infinity number in input data cube, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_INF_INPUT_WEIGHT_NUM`` | ``0x50d0`` | Count infinity number in weight kernels, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_PERF_ENABLE`` | ``0x50d4`` | Enable/disable performance counter |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_PERF_DAT_READ_STALL`` | ``0x50d8`` | Count blocking cycles of read request of input data, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_PERF_WT_READ_STALL`` | ``0x50dc`` | Count blocking cycles of read request of weight data, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_PERF_DAT_READ_LATENCY`` | ``0x50e0`` | Count total latency cycles of read response of input data, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
| ``D_PERF_WT_READ_LATENCY`` | ``0x50e4`` | Count total latency cycles of read request of weight data, update per layer |
+-------------------------------+----------------+--------------------------------------------------------------------------------------------------------------------+
对比一下NV_NVDLA_CDMA_dual_reg.v
中相关寄存器:
wire nvdla_cdma_d_inf_input_data_num_0_wren = (reg_offset_wr == (32'h50cc & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_inf_input_weight_num_0_wren = (reg_offset_wr == (32'h50d0 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_nan_flush_to_zero_0_wren = (reg_offset_wr == (32'h50c0 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_nan_input_data_num_0_wren = (reg_offset_wr == (32'h50c4 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_nan_input_weight_num_0_wren = (reg_offset_wr == (32'h50c8 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_perf_dat_read_latency_0_wren = (reg_offset_wr == (32'h50e0 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_perf_dat_read_stall_0_wren = (reg_offset_wr == (32'h50d8 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_perf_enable_0_wren = (reg_offset_wr == (32'h50d4 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_perf_wt_read_latency_0_wren = (reg_offset_wr == (32'h50e4 & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
wire nvdla_cdma_d_perf_wt_read_stall_0_wren = (reg_offset_wr == (32'h50dc & 32'h00000fff)) & reg_wr_en ; //spyglass disable UnloadedNet-ML //(W528)
.v
代码中的32'h0000_0fff
是因为硬件留给CDMA
的地址空间是从32'h0000_5000
到32'h0000_5fff
。到这里我们大概心中有数了,驱动程序本身通过访问寄存器或者说地址来实现操作,那过程是这样,驱动运行程序的开头和结尾又是什么样的还不得而知!我们接着往下看!
1.3 get_in_format函数
继续看代码,接下来的get_in_format
函数是为了获取输入格式是Feature
还是Pixel
:
static uint32_t
get_in_format(uint8_t format)
{
uint32_t in_format = 0;
if (format >= FORMAT_T_R8 && format < FORMAT_FEATURE) {
in_format = FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
DATAIN_FORMAT, PIXEL);
} else if (format == FORMAT_FEATURE) {
in_format = FIELD_ENUM(CDMA_D_DATAIN_FORMAT_0,
DATAIN_FORMAT, FEATURE);
} else {
assert(0);
}
return in_format;
}
找一找相关的宏:
/**
* @ingroup Processors
* @name Data formats
* @brief Data formats supported by DLA engine
* @{
*/
#define FORMAT_T_R8 0
#define FORMAT_T_R10 1
#define FORMAT_T_R12 2
#define FORMAT_T_R16 3
#define FORMAT_T_R16_I 4
#define FORMAT_T_R16_F 5
#define FORMAT_T_A16B16G16R16 6
#define FORMAT_T_X16B16G16R16 7
#define FORMAT_T_A16B16G16R16_F 8
#define FORMAT_T_A16Y16U16V16 9
#define FORMAT_T_V16U16Y16A16 10
#define FORMAT_T_A16Y16U16V16_F 11
#define FORMAT_T_A8B8G8R8 12
#define FORMAT_T_A8R8G8B8 13
#define FORMAT_T_B8G8R8A8 14
#define FORMAT_T_R8G8B8A8 15
#define FORMAT_T_X8B8G8R8 16
#define FORMAT_T_X8R8G8B8 17
#define FORMAT_T_B8G8R8X8 18
#define FORMAT_T_R8G8B8X8 19
#define FORMAT_T_A2B10G10R10 20
#define FORMAT_T_A2R10G10B10 21
#define FORMAT_T_B10G10R10A2 22
#define FORMAT_T_R10G10B10A2 23
#define FORMAT_T_A2Y10U10V10 24
#define FORMAT_T_V10U10Y10A2 25
#define FORMAT_T_A8Y8U8V8 26
#define FORMAT_T_V8U8Y8A8 27
#define FORMAT_T_Y8___U8V8_N444 28
#define FORMAT_T_Y8___V8U8_N444 29
#define FORMAT_T_Y10___U10V10_N444 30
#define FORMAT_T_Y10___V10U10_N444 31
#define FORMAT_T_Y12___U12V12_N444 32
#define FORMAT_T_Y12___V12U12_N444 33
#define FORMAT_T_Y16___U16V16_N444 34
#define FORMAT_T_Y16___V16U16_N444 35
#define FORMAT_FEATURE 36
数值从0
到35
的区间范围内都是PIXEL
,而36
是FEATURE
。根据FILED_ENUM
的功能找到如下三个宏:
#define CDMA_D_DATAIN_FORMAT_0 _MK_ADDR_CONST(0x3018)
#define CDMA_D_DATAIN_FORMAT_0_DATAIN_FORMAT_FEATURE _MK_ENUM_CONST(0x0)
#define CDMA_D_DATAIN_FORMAT_0_DATAIN_FORMAT_PIXEL _MK_ENUM_CONST(0x1)
1.4 dla_conv_set_producer函数——写寄存器
继续看代码,接下来的dla_conv_set_producer
函数是为了根据选择好的乒乓寄存器组编号来配置Convolution Core
的四个子模块cacc
、cmac
、csc
和cacc
的S_POINTER
寄存器,这里的S_POINTER
是指向CSB Master
和访问Groups
的数据路径的指针Pointer
:
void
dla_conv_set_producer(int32_t group_id, int32_t rdma_group_id)
{
uint32_t reg;
/* set producer pointer for all sub-modules */
reg = group_id << SHIFT(CACC_S_POINTER_0, PRODUCER);
cacc_reg_write(S_POINTER, reg);
cmac_a_reg_write(S_POINTER, reg);
cmac_b_reg_write(S_POINTER, reg);
csc_reg_write(S_POINTER, reg);
cdma_reg_write(S_POINTER, reg);
}
接着追溯相关的宏:
reg = group_id << SHIFT(CACC_S_POINTER_0, PRODUCER);
||
||
\/
#define SHIFT(reg, field) (reg##_##field##_SHIFT)
#define CACC_S_POINTER_0 _MK_ADDR_CONST(0x7004)
||
||
\/
#define CACC_S_POINTER_0_PRODUCER_SHIFT _MK_SHIFT_CONST(0)
#define CACC_S_POINTER_0_PRODUCER_FIELD _MK_FIELD_CONST(0x1, CACC_S_POINTER_0_PRODUCER_SHIFT)
因此group_id<<0
,大概率猜测这里的group_id
和用于设置的乒乓寄存器组Group
有关系,因为上方代码中初始化了cacc
、cmac
、csc
和cacc
等Convolution Core
的四个子模块。接下来还是追溯那几个有意思的寄存器写函数:
cacc_reg_write(S_POINTER, reg);
=> #define cacc_reg_write(reg, val) reg_write(CACC_REG(reg), val)
=> #define CACC_REG(name) CACC_##name##_0
=> 实际的表达式应该是 reg_write(CACC_S_POINTER_0, reg),也就是把reg值写入`CACC_S_POINTER_0`寄存器
cmac_a_reg_write(S_POINTER, reg);
=> #define cmac_a_reg_write(reg, val) reg_write(CMAC_A_REG(reg), val)
=> #define CMAC_A_REG(name) CMAC_A_##name##_0
=> 实际的表达式应该是 reg_write(CMAC_A_S_POINTER_0, reg),也就是把reg值写入`CMAC_A_S_POINTER_0`寄存器
cmac_b_reg_write(S_POINTER, reg);
=> #define cmac_b_reg_write(reg, val) reg_write(CMAC_B_REG(reg), val)
=> #define CMAC_B_REG(name) CMAC_B_##name##_0
=> 实际的表达式应该是 reg_write(CMAC_B_S_POINTER_0, reg),也就是把reg值写入`CMAC_B_S_POINTER_0`寄存器
csc_reg_write(S_POINTER, reg);
=> #define csc_reg_write(reg, val) reg_write(CSC_REG(reg), val)
=> #define CSC_REG(name) CSC_##name##_0
=> 实际的表达式应该是 reg_write(CSC_S_POINTER_0, reg),也就是把reg值写入`CSC_S_POINTER_0`寄存器
cdma_reg_write(S_POINTER, reg);
=> #define cdma_reg_write(reg, val) reg_write(CDMA_REG(reg), val)
=> #define CDMA_REG(name) CDMA_##name##_0
=> 实际的表达式应该是 reg_write(CDMA_S_POINTER_0, reg),也就是把reg值写入`CDMA_S_POINTER_0`寄存器
因为CMAC
有2个独立单元,所以标记了A
和B
。接着追溯上方提到的若干寄存器的地址:
#define CACC_S_POINTER_0 (_MK_ADDR_CONST(0x9004))
#define CMAC_A_S_POINTER_0 (_MK_ADDR_CONST(0x7004))
#define CMAC_B_S_POINTER_0 (_MK_ADDR_CONST(0x8004))
#define CSC_S_POINTER_0 (_MK_ADDR_CONST(0x6004))
#define CDMA_S_POINTER_0 (_MK_ADDR_CONST(0x5004))
1.5 dla_conv_enable函数——读写寄存器
继续看代码,dla_conv_enable
函数是为了在确认CBUF
向Convolution Core
的数据流动已经结束了以后,启动CDMA
的性能计数器,然后使能所有子模块,使能的方式就是把CACC_D_OP_ENABLE_0_OP_EN_ENABLE
这个宏常量交给cacc
、cmac
、csc
和cdma
的D_OP_ENABLE
寄存器,然后就可以启动了。代码如下:
int
dla_conv_enable(struct dla_processor_group *group)
{
uint32_t reg;
struct dla_engine *engine = dla_get_engine();
dla_trace("Enter: %s", __func__);
do {
reg = cdma_reg_read(S_CBUF_FLUSH_STATUS);
} while (!(reg & MASK(CDMA_S_CBUF_FLUSH_STATUS_0, FLUSH_DONE)));
if (engine->stat_enable == (uint32_t)1) {
cdma_reg_write(D_PERF_ENABLE, 1);
group->start_time = dla_get_time_us();
}
/* enable all sub-modules */
reg = FIELD_ENUM(CACC_D_OP_ENABLE_0, OP_EN, ENABLE);
cacc_reg_write(D_OP_ENABLE, reg);
cmac_a_reg_write(D_OP_ENABLE, reg);
cmac_b_reg_write(D_OP_ENABLE, reg);
csc_reg_write(D_OP_ENABLE, reg);
cdma_reg_write(D_OP_ENABLE, reg);
dla_trace("Exit: %s", __func__);
RETURN(0);
}
咱继续追溯:
do {
reg = cdma_reg_read(S_CBUF_FLUSH_STATUS);
} while (!(reg & MASK(CDMA_S_CBUF_FLUSH_STATUS_0, FLUSH_DONE)));
其中,reg = cdma_reg_read(S_CBUF_FLUSH_STATUS)的追溯如下:
=> #define cdma_reg_read(reg) reg_read(CDMA_REG(reg))
=> #define CDMA_REG(name) CDMA_##name##_0
=> 所以 reg = cdma_reg_read(S_CBUF_FLUSH_STATUS)等效于 reg = reg_read(CDMA_S_CBUF_FLUSH_STATUS_0)
=> 也就是读取`CDMA_S_CBUF_FLUSH_STATUS_0`寄存器的值。
其中,MASK(CDMA_S_CBUF_FLUSH_STATUS_0, FLUSH_DONE)的追溯如下:
=> #define MASK(reg, field) (reg##_##field##_FIELD)
=> 所以 MASK(CDMA_S_CBUF_FLUSH_STATUS_0, FLUSH_DONE)等效于 CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_FIELD
=> #define CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_SHIFT (_MK_SHIFT_CONST(0))
#define CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_FIELD \
(_MK_FIELD_CONST(0x1, CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_SHIFT))
=> #define _MK_FIELD_CONST(_mask_, _shift_) \
((_MK_MASK_CONST(_mask_) << _MK_SHIFT_CONST(_shift_)))
=> #define _MK_SHIFT_CONST(_constant_) (_constant_)
=> 所以,更进一步地,CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_FIELD表示0x1 << CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_SHIFT,后者为0,所以其实就是0x1 << 0
那么这段代码其实做了一件事情,就是判断CDMA_S_CBUF_FLUSH_STATUS_0
寄存器和CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_FIELD
(也就是0x1 << CDMA_S_CBUF_FLUSH_STATUS_0_FLUSH_DONE_SHIFT
)当有一个为0
时则停止循环,可能的含义是CDMA
、CBUF
的数据传输结束了,才可以启动Convolution Core
。
接着追溯寄存器:
if (engine->stat_enable == (uint32_t)1) {
cdma_reg_write(D_PERF_ENABLE, 1);
group->start_time = dla_get_time_us();
}
=> #define cdma_reg_write(reg, val) reg_write(CDMA_REG(reg), val)
=> #define CDMA_REG(name) CDMA_##name##_0
=> 所以cdma_reg_write(D_PERF_ENABLE, 1)等效为reg_write(CDMA_D_PERF_ENABLE_0, 1)
=> 其地址是#define CDMA_D_PERF_ENABLE_0 (_MK_ADDR_CONST(0x50d4))
这里的开启条件是stat_enable
为1,表示Convolution Core
被启动后使用cdma_reg_write(D_PERF_ENABLE, 1)
启动CDMA
子模块的Performance Counter
。接下来继续看代码:
/* enable all sub-modules */
reg = FIELD_ENUM(CACC_D_OP_ENABLE_0, OP_EN, ENABLE);
cacc_reg_write(D_OP_ENABLE, reg);
cmac_a_reg_write(D_OP_ENABLE, reg);
cmac_b_reg_write(D_OP_ENABLE, reg);
csc_reg_write(D_OP_ENABLE, reg);
cdma_reg_write(D_OP_ENABLE, reg);
其中,reg = FIELD_ENUM(CACC_D_OP_ENABLE_0, OP_EN, ENABLE)
=> #define FIELD_ENUM(r, f, e) (r##_##f##_##e)
=> reg = CACC_D_OP_ENABLE_0_OP_EN_ENABLE
=> 再看下这个是什么情况,#define CACC_D_OP_ENABLE_0_OP_EN_ENABLE _MK_ENUM_CONST(0x1)
其中,cacc_reg_write(D_OP_ENABLE, reg)
=> #define cacc_reg_write(reg, val) reg_write(CACC_REG(reg), val)
=> #define CACC_REG(name) CACC_##name##_0
=> 所以 cacc_reg_write(D_OP_ENABLE, reg)等效为reg_write(CACC_D_OP_ENABLE_0, reg)
其中,cmac_a_reg_write(D_OP_ENABLE, reg)
=> #define cmac_a_reg_write(reg, val) reg_write(CMAC_A_REG(reg), val)
=> #define CMAC_A_REG(name) CMAC_A_##name##_0
=> 所以 cmac_a_reg_write(D_OP_ENABLE, reg)等效为reg_write(CMAC_A_D_OP_ENABLE_0, reg)
其中,cmac_b_reg_write(D_OP_ENABLE, reg)
=> #define cmac_b_reg_write(reg, val) reg_write(CMAC_B_REG(reg), val)
=> #define CMAC_B_REG(name) CMAC_B_##name##_0
=> 所以 cmac_b_reg_write(D_OP_ENABLE, reg)等效为reg_write(CMAC_B_D_OP_ENABLE_0, reg)
其中,csc_reg_write(D_OP_ENABLE, reg)
=> #define csc_reg_write(reg, val) reg_write(CSC_REG(reg), val)
=> #define CSC_REG(name) CSC_##name##_0
=> 所以 csc_reg_write(D_OP_ENABLE, reg)等效为reg_write(CSC_D_OP_ENABLE_0, reg)
其中,cdma_reg_write(D_OP_ENABLE, reg)
=> #define cdma_reg_write(reg, val) reg_write(CDMA_REG(reg), val)
=> #define CDMA_REG(name) CDMA_##name##_0
=> 所以 cdma_reg_write(D_OP_ENABLE, reg)等效为 reg_write(CDMA_D_OP_ENABLE_0, reg)
上述D_OP_ENABLE
都是为了将寄存器值设置为0x1
以启动当前Register Group
的操作。
1.6 dla_conv_rdma_check函数
继续看代码,dla_conv_rdma_check
函数用于是否启动remote DMA
。代码如下:
void
dla_conv_rdma_check(struct dla_processor_group *group)
{
group->is_rdma_needed = 0;
}
接下来的processor_conv_program
函数由于篇幅过长,放在下一篇进行讲解。
二. conv.c的函数整理一
函数 | 功能 |
---|---|
dla_conv_stat_data 函数 | dla_conv_stat_data 函数和打印相关 |
get_in_format 函数 | get_in_format 函数是为了获取输入格式是Feature 还是Pixel |
dla_conv_set_producer 函数 | dla_conv_set_producer 函数是为了根据选择好的乒乓寄存器组编号来配置Convolution Core 的四个子模块cacc 、cmac 、csc 和cacc 的S_POINTER 寄存器,这里的S_POINTER 是指向CSB Master 和访问Groups 的数据路径的指针Pointer |
dla_conv_enable 函数 | dla_conv_enable 函数是为了在确认CBUF 向Convolution Core 的数据流动已经结束了以后,启动CDMA 的性能计数器,然后使能所有子模块,使能的方式就是把CACC_D_OP_ENABLE_0_OP_EN_ENABLE 这个宏常量交给cacc 、cmac 、csc 和cdma 的D_OP_ENABLE 寄存器,然后就可以启动了。 |
dla_conv_rdma_check 函数 | dla_conv_rdma_check 函数用于是否启动remote DMA 。 |
总结
本文是介绍conv.c
的上半篇。