接下来实现卷积层Conv的计算
为了简单起见,我用了一个N=1,C=2,H=W=5的输入。同时卷积层kernel大小KH = 3, KW = 3, IC=OC=2。
const int N = 1, H = 5, W = 5, C = 2;
const int IC = C, OC = IC, KH = 3, KW = 3;
这样可以用手工计算卷积的结果来比较mkldnn的输出,观察我们的代码是否运行正确。
创建输入数据的buffer, 权重和偏置数组weights和bias
// Allocate a buffer for the image
std::vector<float> image(image_size);
std::vector<float> weights(weights_size);
std::vector<float> bias(bias_size);
image数组按照它在内存里的偏移地址赋值;weights里OC=0时候全赋值为1,OC=1是全赋值为2; bias在OC=0时候赋值为0.01,OC=1是赋值为0.02 。这样根据Convolution Y=W*X+B的算法,我们只要观察最终输出数组的小数部分,就可以很容易的辨别出哪个数属于通道C=0,哪个数属于通道C=1
// Initialize the image with some values
for (int n = 0; n < N; ++n)
for (int h = 0; h < H; ++h)
for (int w = 0; w < W; ++w)
for (int c = 0; c < C; ++c) {
int off = offset(n, h, w, c); // Get the physical offset of a pixel
/*image[off] = -std::cos(off / 10.f);*/
/*image[off] = n * 1000 + h * 100 + w * 10 + c;*/
image[off] = off;
//std::cout << "off=" << off << " : " << image[off] << std::endl;
}
// [Create user's data]
for (int n = 0; n < OC; ++n)
{
for (int c = 0; c < IC; ++c)
{
for (int h = 0; h < KH; ++h)
{
for (int w = 0; w < KW; ++w)
{
int off = offset_ws(n, c, h, w); // Get the physical offset of a pixel
//weights[off] = -std::cos(off / 10.f);
if (n == 0)
{
weights[off] = 1;
}
else
{
weights[off] = 2;
}
}
}
}
}
for (int n = 0; n < OC; ++n)
{
bias[n] = 0.01+0.01*n;
}
卷积层的代码实现和流程图
memory::dims conv3_src_tz = { N, C, H, W };
memory::dims conv3_weights_tz = { OC, IC, KH, KW };
memory::dims conv3_bias_tz = { OC };
memory::dims conv3_dst_tz = { N, OC, H, W };
memory::dims conv3_strides = { 1, 1 };
memory::dims conv3_padding = { 1, 1 };
// [Init src_md]
auto src3_md = memory::desc(
conv3_src_tz, // logical dims, the order is defined by a primitive
memory::data_type::f32, // tensor's data type
memory::format_tag::nhwc // memory format, NHWC in this case
);
// [Init src_md]
auto conv3_weights_md = memory::desc(
conv3_weights_tz, memory::data_type::f32,
memory::format_tag::oihw //
);
auto conv3_bias_md = memory::desc({ conv3_bias_tz }, memory::data_type::f32, memory::format_tag::x);
auto dst3_md = memory::desc(
conv3_dst_tz, // logical dims, the order is defined by a primitive
memory::data_type::f32, // tensor's data type
memory::format_tag::nhwc // memory format, NHWC in this case
);
auto user_src_mem = memory(src3_md, cpu_engine, image.data());
auto user_conv3_weights_mem = memory(conv3_weights_md, cpu_engine, weights.data());
auto user_conv3_bias_mem = memory(conv3_bias_md, cpu_engine, bias.data());
// For dst_mem the library allocates buffer
auto user_conv3_dst_mem = memory(dst3_md, cpu_engine); //for conv output
auto conv3_d = convolution_forward::desc(prop_kind::forward_inference,
algorithm::convolution_direct, src3_md, conv3_weights_md,
conv3_bias_md,
dst3_md, conv3_strides, conv3_padding,
conv3_padding);
auto conv3_pd = convolution_forward::primitive_desc(conv3_d, cpu_engine);
// create convolution primitive and add it to net
auto conv3 = convolution_forward(conv3_pd);
conv3.execute(
cpu_stream,
{
{ MKLDNN_ARG_SRC, user_src_mem },
{ MKLDNN_ARG_WEIGHTS, user_conv3_weights_mem },
{ MKLDNN_ARG_BIAS, user_conv3_bias_mem },
{ MKLDNN_ARG_DST, user_conv3_dst_mem }
}
);
// Wait the stream to complete the execution
cpu_stream.wait();
最后观察一下数组的内容
1. 输入数组
0 1 | 2 3 | 4 5 | 6 7 | 8 9 |
10 11 | 12 13 | 14 15 | 16 17 | 18 19 |
20 21 | 22 23 | 24 25 | 26 27 | 28 29 |
30 31 | 32 33 | 34 35 | 36 37 | 38 39 |
40 41 | 42 43 | 44 45 | 46 47 | 48 49 |
2.权重数组
1 1 1 |
1 1 1 |
1 1 1 |
1 1 1 |
1 1 1 |
1 1 1 |
2 2 2 |
2 2 2 |
2 2 2 |
2 2 2 |
2 2 2 |
2 2 2 |
3. 输出数组
52.01 104.02 |90.01 180.02 |114.01 228.02 |138.01 276.02 |100.01 200.02 |
138.01 276.02 |225.01 450.02 |261.01 522.02 |297.01 594.02 |210.01 420.02 |
258.01 516.02 |405.01 810.02 |441.01 882.02 |477.01 954.02 |330.01 660.02 |
378.01 756.02 |585.01 1170.02 |621.01 1242.02 |657.01 1314.02 |450.01 900.02 |
292.01 584.02 |450.01 900.02 |474.01 948.02 |498.01 996.02 |340.01 680.02 |
可以看到,我们的src_md,dst_md描述的内存是按照nhwc排列的,最终的内存里数据观察小数的排列规则看也是如此。再手工计算验证一下
src_mem的内容
所以左上角(0,0)的dst_mem的计算为
C=0时 2+10+12+1+3+11+13+0.01=52.01
C=1时 2*2+10*2+12*2+1*2+3*2+11*2+13*2+0.02=104.02
和我们打印的输出是一样的
观察mkldnn库在console里的输出,这次Conv计算的时间为0.526ms
搞定,收工!!!
最后代码奉上,仅供参考
https://github.com/tisandman555/mkldnn_study/blob/master/conv.cpp