NVDLA硬件信号和架构设计整理一

最新推荐文章于 2024-10-09 10:43:28 发布

DentionY

最新推荐文章于 2024-10-09 10:43:28 发布

阅读量898

点赞数 2

分类专栏： NVDLA硬件信号和架构设计文章标签：硬件架构

本文链接：https://blog.csdn.net/weixin_41029027/article/details/134298203

版权

NVDLA硬件信号和架构设计专栏收录该内容

4 篇文章 6 订阅

订阅专栏

本文详细介绍了NVDLA的四个主要外部接口（CSB,IRQ,SRAMIF,DBBIF）的功能，以及内部寄存器接口的结构，特别是寄存器组的乒乓缓冲机制，以提高硬件效率。

摘要由CSDN通过智能技术生成

前言

本系列力求将硬件模块设计方法和相关信号理清楚，便于后续改造。由于在《NVDLA内核态驱动代码整理三》中开坑开出来四个外部接口模块IRQ、CSB、SRAMIF、DBBIF，在阅读NVDLA Hardware Architectural Specification发现有大量信号介绍的部分，同时有大量的NVDLA IP Core内部的寄存器方案，因此决定先把这部分整理完。

一、外部接口介绍

以small NVDLA IP Core的架构作为开篇，引出外部接口。如下：
在这里插入图片描述

外部接口模块	功能
`External interrupt (IRQ)`	`NVDLA`中的某些状态要求向正在命令`NVDLA`的处理器异步报告，这些状态包括操作完成和错误条件。外部中断接口提供了一个单独的输出引脚，以补充`CSB`接口。
`Configuration space bus (CSB)`	主机系统通过一个非常简单的地址/数据接口访问和配置`NVDLA`寄存器组。一些系统可以通过合适的总线桥将主机CPU直接连接到`CSB`接口；其他可能更大的系统将把一个小型微控制器连接到`CSB`接口，将管理`NVDLA`的一些工作卸载到外部核心。
`SRAM connection (SRAMIF)`	一些系统可能需要比系统`DRAM`所能提供的更多的吞吐量和更低的延迟，并且可能希望使用小型`SRAM`作为高速缓存来提高`NVDLA`的性能。为连接到`NVDLA`的可选`SRAM`提供了符合`AXI4`的辅助接口。
`Data backbone (DBBIF)`	`NVDLA`包含自己的`DMA`引擎，用于加载和存储系统其余部分的值（包括参数和数据集）。`data backbone`是`AMBA AXI4`兼容接口，旨在访问大量相对高延迟的存储器（如系统DRAM）。

1.1 Configuration space bus (CSB)接口介绍

看看NVDLA官网怎么解释CSB总线的：

 1. The CPU uses the CSB (Configuration Space Bus) interface to access NVDLA registers.

 2. The CSB interface is intentionally extremely simple, and low-performance.
 
 4. The CSB bus consists of three channels: the request channel, the read data channel, and the write response channel.

所以CSB模块用来访问NVDLA的寄存器，且CSB接口的设计极度简单，且low performance。CSB模块除了clock和reset以外有三个通道：request channel、read data channel、write response channel。

1.1.1 Request Channel介绍

当且仅当valid信号（来自主机）和ready信号（来自NVDLA）在同一时钟周期中均被断言时，request channel上发生data transaction。对CSB的每个request具有32位数据的固定请求大小，并且具有固定的16位地址大小。CSB不支持burst request。
在这里插入图片描述
相关信号的定义可以找NV_NVDLA_apb2csb.v，这里谈谈csb2nvdla_nposted：
1、Posted write transactions are writes where the requester does not expect to and will not receive a write completion from receiver on write ack channel. The requester will not know if the write encounters an error.
这里表达的意思是Posted Write Transactions是指一类写操作，这类写操作满足下述要求：requester不希望也不会来自write ack channel上的receiver上receive写完成(write completion)信号。
2、Non-posted write transactions are writes where the requester expects to receive a write completion or write error on write ack channel from receiver.
这里表达的意思是Non-Posted Write Transactions是指一类写操作，这类写操作满足下述要求：requester希望receive来自write ack channel上的receiver的写完成(write completion)或者写错误(write error)信号。

1.1.2 Read Channel and Read Channel

在这里插入图片描述

1.1.3 Interface Timing

NVDLA官网给了一个时序图：
在这里插入图片描述
对照csb2nvdla_pd.nposted信号，会发现如果该信号为高，则在2拍以后会有个nvdla2csb_wr_complete信号显示为高来响应。读操作也需要花2拍来响应。

1.2 External interrupt (IRQ)接口介绍

NVDLA提供了一个异步（中断驱动）返回通道，用于向CPU传递事件通知。中断信号是电平驱动的中断，只要NVDLA核心具有未决中断，中断信号就被断言为高。NVDLA中断信号与CSB接口在同一时钟域上。
在这里插入图片描述

相关介绍可以见：NVDLA内核态驱动代码整理三中的2. NVDLA架构介绍部分，涉及了中断号在ZCU102上的设计与设备树上一致性的验证。

1.3 System Interconnect: Data backbone (DBBIF)接口介绍

NVDLA有两个与存储系统交互的主要接口，它们分别称为DBBIF（称为core2dbb）和SRAMIF（称为core2sram）。DBBIF接口旨在连接到连接到系统内存的片上网络，而SRAMIF旨在连接到具有较低内存延迟和潜在更高吞吐量的可选片上SRAM。DBBIF和SRAMIF接口都符合AXI4。
在这里插入图片描述

1.4 On-Chip SRAM Interface - SRAMIF接口介绍

DBBIF接口旨在连接到连接到系统内存的片上网络，而SRAMIF旨在连接到具有较低内存延迟和潜在更高吞吐量的可选片上SRAM。DBBIF和SRAMIF接口都符合AXI4。
SRAMIF接口协议与DBBIF接口完全相同，但信号已重命名为前缀nvdla_core2sram_{aw,ar,w,b,r}_，分别用于aw、ar、w、b和r通道。SRAMIF和DBBIF的写确认之间的返回顺序不受限制。例如，有两个BDMA层，即layer0和layer1，允许第0层写入DBBIF，第1层写入SRAMIF。

二、寄存器接口信号

这一部分介绍status registers, configuration registers, command registers和profiling registers。

2.1 NVDLA为了启动做了备份寄存器

一般来说，对硬件进行编程的传统流程如下：首先，CPU配置寄存器，设置“使能”位，然后等待硬件产生“完成”中断，最后配置一组新的寄存器重新启动进程。这种方式将导致硬件在两个连续的硬件层之间变得空闲，从而降低系统效率。

那么NVDLA为了隐藏CPU的重编程延迟，引入了乒乓寄存器编程。对于大多数NVDLA子单元，有两组寄存器：当子单元使用第一个寄存器组的配置执行时，CPU可以在后台对第二组进行编程，完成后设置第二组的“使能”位。当硬件完成对第一个寄存器组所描述层的处理时，它将清除第一个寄存器组的“使能”位，随后如果第二组的“使能”位已经设置完成，则切换到第二组。不断重复该流程，第二组将成为活动组，第一组将成为CPU在后台写入的“影子”组。这种机制允许硬件在活动层之间平滑切换，不会浪费CPU配置的周期。

那么哪些部件具有备份寄存器呢？首先来看NVDLA的流水线设计情况：

CDMA (convolution DMA)

CBUF (convolution buffer)

CSC (convolution sequence controller)

CMAC (convolution MAC array)

CACC (convolution accumulator)

SDP (single data processor)

SDP_RDMA (single data processor, read DMA)

PDP (planar data processor)

PDP_RDMA (planar data processor, read DMA)

CDP (channel data processor)

CDP_RDMA (channel data processor, read DMA)

BDMA (bridge DMA)

RUBIK (reshape engine)

前五个pipeline stage是conv core pipeiline的一部分；所有这些pipeline stage（除了CBUF和CMAC）都使用备份寄存器的乒乓缓冲区。

具体的实现方式如下：
在这里插入图片描述
这张图里有两个很重要的概念producer和consumer，频繁出现在代码中。
照搬官网的描述，

 In detail, each register file implementation has three register groups; the two ping-pong groups (duplicated register
 group 0, and group 1) share the same addresses, and the third register group is a dedicated non-shadowed group 
(shown above as the “single register group”). The PRODUCER register field in the POINTER register is used to 
select which of the ping-pong groups is to be accessed from the CSB interface; the CONSUMER register field 
indicates which register the datapath is sourcing from. By default, both pointers select group 0. Registers are 
named according to which register set they belong to; a register is in a duplicated register group if its name starts 
with D_, and otherwise, it is in the single register group.

详细来说，主要是由2个共享地址的ping-pong寄存器组Duplicated Register Group和1个专用的寄存器组Single Register Group组成（注意这两者的命名区别的名称前缀是否有D_）。那么Producer Register是为了决定哪个Ping-Pong Register Groups被CSB接口访问，Consumer Register是为了指明数据通路是由哪个寄存器组决定的。CONSUMER指针是一个只读寄存器，CPU可以检查该寄存器以确定数据路径选择了哪个乒乓组，而PRODUSER指针完全由CPU控制，并且在对硬件层编程之前应该设置使用哪一个寄存器组。

2.2 NVDLA使用备份寄存器的一个例子

对NVDLA子单元进行乒乓模式的寄存器编程，这里以CDMA为例。


 1. After reset, both group 0 and group 1 are in an idle state. The CPU should read the CDMA_POINTER register, and set PRODUCER to the value of CONSUMER. (After reset, CONSUMER is expected to be 0.)
 # 复位之后两个寄存器组都处于idle，CPU读取CDMA_POINTER寄存器并且设置PRODUCER为CONSUMER的值
 
 2. The CPU programs the parameters for the first hardware layer into register group 0. After configuration completes, the CPU sets the enable bit in the D_OP_ENABLE register.
 # CPU配置寄存器组0，在完成后，配置D_OP_ENABLE寄存器为使能。
 
 3. Hardware begins processing the first hardware layer.
 # 硬件处理第一层。

 4. The CPU reads the S_STATUS register to ensure that register group 1 is idle.
 # CPU读取S_STATUS状态寄存器来确定寄存器组1是idle。

 5. The CPU sets PRODUCER to 1 and begins programming the parameters for the second hardware layer into group 1. After those registers are programmed, it sets the enable bit in group 1’s D_OP_ENABLE.
 # CPU设置PRODUCER为1，配置寄存器组1，在完成后，配置D_OP_ENABLE寄存器为使能。

 6. The CPU checks the status of the register group 0 by reading S_STATUS; if it is still executing, the CPU waits for an interrupt.
 # 随后读取状态寄存器来检查寄存器组0的状态，如果还处于执行状态，CPU等待需要寄存器组0的任务完成后发送的中断。

 7. Hardware finishes the processing of the current hardware layer. Upon doing so, it sets the status of the previously active group to idle in the S_STATUS register, and clears the enable bit of the D_OP_ENABLE register.
 # 硬件处理完当前任务，随后设置状态寄存器为idle，并且清楚D_OP_ENABLE寄存器。

 8. Hardware advances the CONSUMER field to the next register group (in this case, group 1). After advancing the CONSUMER field, it determines whether the enable bit is set on the new group. If so, it begins processing the next hardware layer immediately; if not, hardware waits until the enable bit is set.
 # 硬件将CONSUMER字段推进到下一个寄存器组（在本例中为组1）。在推进CONSUMER字段之后，它确定是否在新组上设置了使能位。如果是，它立即开始处理下一个硬件层；否则，硬件将等待，直到设置了启用位。

 9. Hardware asserts the “done” interrupt for the previous hardware layer. If the CPU was blocked waiting for a “done” interrupt, it now proceeds programming, as above.
 # 硬件断言上一个硬件层的“done”中断。如果CPU在等待“done”中断时被阻塞，继续编程。

 10. Repeat, as needed.