NCHW - NHWC - CHWN 数据排列_算法移植数据排布wnhc转为ncwh-CSDN博客

NCHW - NHWC - CHWN 数据排列

In deep learning, in order to improve data transmission bandwidth and computing performance, NCHW and NHWC data formats are often used, which represent logical data formats such as image or feature map (which can be simply understood as the storage order of data in memory).
在深度学习中，为了提升数据传输带宽和计算性能，会使用 NCHW、NHWC 和 CHWN 等数据格式，它们是 image 或 feature map 等的逻辑数据格式 (数据在内存中的存放顺序)。

Edgeboard is an embedded AI solution developed by Baidu based on FPGA chip.
百度的推理设备 EdgeBoard 选择 NHWC 数据格式。EdgeBoard 是百度基于 FPGA 芯⽚研发的嵌入式 AI 解决方案，EdgeBoard 提供了 FPGA 软核和计算卡模块两种形态。

1. Logical and physical representation of data format - 数据格式的逻辑表示与物理表示

NCHW and NHWC data formats are often used to represent data in deep learning, where N, H, W and C are defined as follows:
深度学习中经常会使用 NCHW and NHWC 数据格式来表示数据，其中 N、H、W、C 定义如下：

N: Number of pictures in a batch, number of pictures processed at a time.
一个批次内图片的数量，一次处理的图片数量。
H: Number of pixels in vertical height direction, height of picture.
垂直高度方向的像素个数，图片的高。
W: Number of pixels in horizontal width direction, width of picture.
水平宽度方向的像素个数，图片的宽。
C: Number of channels. For example, the gray image is 1, colour RGB image is 3.
通道数。

当 N = 1 时，NHWC = HWC = HW * C
当 N = 1 时，NCHW = CHW = C * HW

下图表示 N = 2，C = 16，H = 5，W = 4 的数据排列，其中左图是逻辑表示，右图是物理表示。
在这里插入图片描述

1.1 NCHW

Taking NCHW as an example, its logical representation is shown in the upper left figure. When n = 0, the three coordinates respectively identify the directions of C, H and W. The first element is 000, and the second element is along the W direction, i.e. 001, followed by 002, 003; then along the H direction, i.e. 004, 005, 006, 007, … and so on until 019; then along the C direction, 020, 021, 022, … until 319; then along the N direction, i.e. n = 1, and then along the W direction, H direction and C direction.
以 NCHW 为例，其逻辑表示如左上图。当 n = 0 时，三个坐标分别标识了 C、H、W 的方向。第一个元素是 000，第二个元素沿着 W 方向，即 001，随后是 002, 003；然后沿 H 方向，即 004， 005，006，007，… 如此反复到 019 后；再沿 C 方向，020，021，022，… 一直到 319；再沿 N 方向，也就是 n = 1，然后重复 W 方向，H 方向和 C 方向。

According to the above division of NCHW, the physical address representation is defined as follows (as shown in the top right figure):
根据以上 NCHW 的划分，物理地址表示定义如下 (如右上图)：

[a:0] indicates W direction, from left to right in a row.
表示 W 方向，在一行内从左到右。

[a:1] from H direction, line by line from top to bottom.
表示 H 方向，一行一行的从上到下。

[a:2] in direction C, from one channel to another.
表示 C 方向，从一个通道到另外一个通道。

[a:3] means from N direction, from n = 0 to n = 1.
表示 N 方向，从 n = 0 到 n = 1。

The physical distribution (one-dimensional representation in memory) of the final NCHW data format is expressed as 000 001 002 003 004 … 018 019 020 … 318 319 320 … 637 638 639. It can be understood that all pixels of a channel are arranged row by row, and then the next channel is arranged, i.e. n = 0 and then n = 1.
最终 NCHW 数据格式的物理分布 (在内存中的一维表示) 表示为 000 001 002 003 004 … 018 019 020 … 318 319 320 … 637 638 639。可以理解成把一个通道的所有像素一行一行地排列起来，然后排下一个通道，即 n = 0 排列完后再排 n = 1。

1.2 NHWC

Similarly, NHWC is first along the direction of C, then along the direction of W, then along the direction of H, and finally along the direction of N. So the storage order of the memory is: the first element is 000, the second element goes along the C direction, i.e. 020, 040, 060, … to 300, then switches to the W direction, 001, 021, 041, 061, …, 301, … to 303, then switches to the H direction, i.e. 004, 024, …, 304, finally to 319, then switches to the N direction, 320, 340, … to 639.
同理 NHWC 表示是先沿 C 方向，再沿 W 方向，再沿 H 方向，最后沿 N 方向。所以在内存的存放顺序是，第一个元素是 000，第二个沿 C 方向，即 020，040, 060，… 一直到 300，之后切换到 W 方向，001，021，041，061，… ，301，到了 303 后再切换到 H 方向，即 004，024，…，304，最后到了 319，再切换到 N 方向，320，340，… 一直到 639。

[b:0] indicates the direction of C. The first pixel goes from one channel to another.
表示 C 方向，第一个像素从一个通道到另外一个通道。

[b:1] from the W direction, the first pixel of the last channel returns to the second pixel of the first channel.
表示 W 方向，最后一个通道第一个像素回到第一个通道的第二个像素。

[b:2] in the H direction, the last pixel in the first row of the last channel returns to the first pixel in the second row of the first channel.
表示 H 方向，最后一个通道第一行最后一个像素回到第一个通道的第二行的第一个像素。

[b:3] from N direction, from n = 0 to n = 1.
表示 N 方向，从 n = 0 到 n = 1。

The physical representation of NHWC is 000 020 … 300 001 021 … 283 303 004 … 319 320 340 … 339 359 … 639. It can be understood that all channels of one pixel of a batch are arranged first, and then the next pixel. After n = 0 is arranged, arrange n = 1 again.
NHWC 其物理表示为 000 020 … 300 001 021 … 283 303 004 … 319 320 340 … 339 359 … 639。可以理解成把一个 batch 的一个像素的所有通道先排列起来，然后排下一个像素。n = 0 排列完成后，再排 n = 1。

1.3 CHWN

Similarly, the logical representation of CHWN is first along the N direction, then along the W direction, then along the H direction, and finally along the C direction.
同理 CHWN 其逻辑表示，先沿 N 方向，再沿 W 方向，再沿 H 方向，最后是沿 C 方向。

[c:0] indicates from the first pixel of n = 0 to the first pixel of n = 1 in the N direction.
表示从 N 方向，从 n = 0 的第一个像素到 n = 1 的第一个像素。

[c:1] from the first pixel of n = 1 to the second pixel of n = 0.
表示从 N 方向，从 n = 1 的第一个像素回到 n = 0 的第二个像素。

[c:2] in the H direction, from the last pixel in the first row of n = 1 to the first pixel in the second row of n = 0.
表示在 H 方向，从 n = 1 的第一行最后一个像素回到 n = 0 的第二行第一个像素。

[c:3] from the last pixel of the first channel with n = 1 to the first pixel of the second channel with n = 0.
表示从 N 方向，从 n = 1 的第一个通道最后一个像素回到 n = 0 的第二个通道第一个像素。

CHWN is physically represented as 000 032 001 321 … 003 323 004 324 … 019 339 020 … . It can be understood that the first pixel of the first channel of N images in a batch is arranged first, and then the second pixel is arranged; then the second channel and the third channel are arranged
CHWN 物理表示为 000 032 001 321 … 003 323 004 324 … 019 339 020 … 。可以理解成先把一个 batch 中 N 幅图像的第一个通道第一个像素排列起来，然后排第二个像素；再排第二个通道，第三个通道 … 。

2. Offset address of data in memory - 数据在内存中的偏移地址

在这里插入图片描述
Deep learning involves a lot of data calculation, which needs to extract data from memory, so it needs to calculate the offset address of the data for fetching. With the above logical representation and physical representation, we can deduce the formula of mapping 4-dimensional logical representation (N, C, H, W) to one-dimensional memory offset address.
深度学习中涉及大量的数据计算，计算需要从内存中取出数据，因此需要计算出数据的偏移地址以便进行取数。有了上面的逻辑表示和物理表示，可以推导出 4 维逻辑表示 (N, C, H, W) 映射到一维内存中偏移地址的公式。

定义位置 (n，c，h，w) 表示第 n 个 batch 的第 c 通道的第 h 行的第 w 列，那么该位置在不同数据格式下，在内存中的偏移地址计算公式如下：

NCHW: offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w 
NHWC: offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c 
CHWN: offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n

其中 N、C、H、W 为常量，n、c、h、w 为变量。

In NCHW, CHW = C * H * W, indicating a batch, which can be understood as a BGR 3 channel picture, indicating a cube. HW = H * W, indicating a plane, which can be understood as a channel of BGR 3 channel picture (gray scale picture is a channel picture). W is a line, which can be understood as a line on a channel.
在 NCHW 中，CHW = C * H * W 表示一个 batch，可以理解成一个 BGR 3 通道的图片，表达的是一个立方体。HW = H * W 表示一个平面，可以理解成是 BGR 3 通道图片的一个通道 (灰度图就是一个通道图片)。

如果想计算出绿色圈，即 341 的位置 (n = 1，c = 1， h = 0， w = 1)。我们需要先跳过 n = 0 的数据 (CHW)，图中箭头 1 指向的蓝色框区域。再跳过 n = 1 的第一个通道 (HW)，图中箭头 2 指向蓝色框区域。此时进入到了 n = 1 的第二个通道，跳过 h = 0 行 (0 * W)。最后跳过 w 个数到达偏移位置。

3. NCHW - NHWC

input_batch = 1, input_height = 5, input_width = 5, input_channel = 3
kernel_batch = output_channel = 2, kernel_height = 3, kernel_width = 3, kernel_channel = input_channel = 3
output_batch = 1, output_height = 3, output_width = 3, output_channel = 2

在这里插入图片描述

The figure above shows the calculation process of convolution. According to the operation characteristics of convolution, all channels in the same position window are multiplied by convolution parameters and accumulated, which can be calculated in the following two ways:
上图表示卷积的计算过程。根据卷积的运算特点，相同位置窗口所有通道数与卷积的参数相乘后累加，可以有下面两种计算方式：

3.1 Channel before pixel - 先通道后像素

Channel before pixel: first, multiply all channels of one pixel point with convolution parameters, and then accumulate them, and then proceed to the next pixel until convolution kernel window multiplication and accumulation are completed.
先通道后像素：先把一个像素点的所有通道数与卷积的参数相乘后累加，再进行下一个像素，直到卷积核窗口乘累加完成。比如第一次滑动窗口的计算公式。

Filter * Input Volume
(w 0,0,0) * (x 0,0,0) + (w 1,0,0) * (x 1,0,0) + (w 2,0,0) * (x 2,0,0) + 
(w 0,0,1) * (x 0,0,1) + (w 1,0,1) * (x 1,0,1) + (w 2,0,1) * (x 2,0,1) + 
(w 0,0,2) * (x 0,0,2) + (w 1,0,2) * (x 1,0,2) + (w 2,0,2) * (x 2,0,2) + 

(w 0,1,0) * (x 0,1,0) + (w 1,1,0) * (x 1,1,0) + (w 2,1,0) * (x 2,1,0) + 
(w 0,1,1) * (x 0,1,1) + (w 1,1,1) * (x 1,1,1) + (w 2,1,1) * (x 2,1,1) +
(w 0,1,2) * (x 0,1,2) + (w 1,1,2) * (x 1,1,2) + (w 2,1,2) * (x 2,1,2) +

(w 0,2,0) * (x 0,2,0) + (w 1,2,0) * (x 1,2,0) + (w 2,2,0) * (x 2,2,0) + 
(w 0,2,1) * (x 0,2,1) + (w 1,2,1) * (x 1,2,1) + (w 2,2,1) * (x 2,2,1) + 
(w 0,2,2) * (x 0,2,2) + (w 1,2,2) * (x 1,2,2) + (w 2,2,2) * (x 2,2,2) 

= 

Input Volume * Filter
0 * -1 + 0 * -1 + 0 * 0 +  => (0)
0 * 1 + 0 * -1 + 0 * 0 +  => (0)
0 * 0 + 0 * 0 + 0 * -1 +  => (0)

0 * 0 + 0 * 0 + 0 * 0 +  => (0)
0 * 1 + 1 * 0 + 2 * 1 +  => (2)
1 * 0 + 0 * 0 +1 * 0 +  => (0)

0 * 0 + 0 * 0 + 0 * 1 +  => (0)
2 * 1 + 0 * -1 + 1 * -1 +  => (1)
2 * 1 + 0 * 0 + 0 * -1 +  => (2)

= 5

5 + 1 (bias) = 6

3.2 Pixel before channel - 先像素后通道

Pixel before channel: first multiply the sliding window of one channel with convolution parameters and then accumulate, and then proceed to the next channel until all channels are multiplied and accumulated. For example, the calculation formula of the first sliding window.
先像素后通道：先把一个通道的滑动窗口与卷积参数相乘后累加，再进行下一个通道，直到所有通道乘累加完成。比如第一次滑动窗口的计算公式。

Filter * Input Volume
(w 0,0,0) * (x 0,0,0) + (w 0,0,1) * (x 0,0,1) + (w 0,0,2) * (x 0,0,2) + 
(w 0,1,0) * (x 0,1,0) + (w 0,1,1) * (x 0,1,1) + (w 0,1,2) * (x 0,1,2) + 
(w 0,2,0) * (x 0,2,0) + (w 0,2,1) * (x 0,2,1) + (w 0,2,2) * (x 0,2,2) + 

(w 1,0,0) * (x 1,0,0) + (w 1,0,1) * (x 1,0,1) + (w 1,0,2) * (x 1,0,2) + 
(w 1,1,0) * (x 1,1,0) + (w 1,1,1) * (x 1,1,1) + (w 1,1,2) * (x 1,1,2) + 
(w 1,2,0) * (x 1,2,0) + (w 1,2,1) * (x 1,2,1) + (w 1,2,2) * (x 1,2,2) + 

(w 2,0,0) * (x 2,0,0) + (w 2,0,1) * (x 2,0,1) + (w 2,0,2) * (x 2,0,2) + 
(w 2,1,0) * (x 2,1,0) + (w 2,1,1) * (x 2,1,1) + (w 2,1,2) * (x 2,1,2) + 
(w 2,2,0) * (x 2,2,0) + (w 2,2,1) * (x 2,2,1) + (w 2,2,2) * (x 2,2,2) 

= 

Input Volume * Filter
0 * -1 + 0 * 1 + 0 * 0 +  => (0)
0 * 0 + 0 * 1 + 1 * 0 +  => (0)
0 * 0 + 2 * 1 + 2 * 1 +  => (4)

0 * -1 + 0 * -1 + 0 * 0 +  => (0)
0 * 0 + 1 * 0 + 0 * 0 +  => (0)
0 * 0 + 0 * -1 + 0 * 0 +  => (0)

0 * 0 + 0 * 0 + 0 * -1 +  => (0)
0 * 0 + 2 * 1 + 1 * 0 +  => (2)
0 * 1 + 1 * -1 + 0 * -1  => (-1)

= 5

5 + 1 (bias) = 6

toggle ['tɒɡ(ə)l]：n. 套索扣，转换键，切换键 v. (两种状态之间) 切换

两种方式计算的结果是一样。

For NHWC format, that is, first channel and then pixel, all channel data of one pixel are put together. This corresponds to the three channel values of the first pixel, the second pixel and the third pixel. Their addresses in memory are all continuous, that is to say, the first line of the kernel can be taken out at one time, and the 3 × 3 kernel needs three times.
对于 NHWC 格式，即先通道后像素，是把一个像素的所有通道的数据放在一起。对应上图第一个像素的 3 个通道值，第二个像素的 3 个通道值，第三个像素的 3 个通道值，它们在内存中的地址都是连续的。一次就可以把对应 kernel 第一行需要计算的 feature map 数据 (3 x 3) 取出，对应 1 x 3 x 3 x 3 的 kernel 需要计算的 feature map 数据 3 次取数。

1 x 3 x 3 x 3
kernel_batch = output_channel = 1, kernel_height = 3, kernel_width = 3, kernel_channel = input_channel = 3

For NCHW format, i.e. pixel before channel, all pixels of a channel are arranged in order, so for a 3 x 3 convolution core, it is necessary to skip n numbers for each 3 numbers, and then take 3 numbers. One channel needs to be taken three times, and three channels need to be taken nine times.
对于 NCHW 格式，即先像素后通道，是把一个通道的所有像素按顺序排列，这样对于一个 3 x 3 的卷积核，需要每取 3 个数就需要跳跃 input_width + 2 * pad_width 个数后，再取 3 个数。一个通道需要取 3 次，3 个通道需要取 9 次。

In the actual network, the number of channels is usually much larger than that of convolution kernel (unlike the above figure, there are only three channels, usually dozens or hundreds of channels). In this way, for NHWC format, the number of fetches will be much less than NCHW.
在实际网络中，通常输入通道数会远大于 (kernel_width) 卷积 kernel 列数 (不会像上图只有 3 个通道，通常是几十、几百个通道)。这样对于 NHWC 格式来说说，取数的次数会比 NCHW 少很多。

In order to increase the universality of the network it supports and reduce the restrictions on the large input size and high storage weight network, the NHWC format can be used to read the feature map and weight data to the FPGA on-chip cache in batches. For example, for the 3 × 3 kernel, we can read only three rows (3 x kernel_width x kernel_channel) of feature map data to FPGA for calculation, and then one row of output can be obtained Data can be transmitted to the off chip large-capacity DDR cache without relying on the next 3 x kernel_width x kernel_channel feature map input data to complete the input and output data transmission of each batch.
为了增加硬件所支持网络的广泛性，减少对大输入尺寸和高存储量权重网络的限制，采用 NHWC 的格式可以实现分批次地把 feature map 和 weight 数据读取到 FPGA 的片上缓存，例如对于 3 x 3 的 kernel，我们可以只读取三行 (3 x kernel_width x kernel_channel) feature map 的数据到 FPGA 内进行计算，即可得到一行输出数据，并传输到片外大容量缓存 DDR，而不需依赖下一个 3 x kernel_width x kernel_channel 的 feature map 输入数据就可完成每一批次的输入输出数据传输。