Today, we are talking about YOLOv3, a great way to extract several characteristics in different objects, which includes a lot of complex parts and each of them also has plenty of details.
Let’s begin with Feature Pyramid Network.
The most important core of YOLOv3 is to transform object detection into regression based on only one CNN to make model run faster.
Maybe, your first sight is not so comfortable because of too many convolution as well as residual layers. In fact, it just does the same job over and over again to divide your picture into several different boxes and pay attention to each center of boxes to determine which is the best case in specific objects.
There are some python codes to build this CNN. Just finish it for several times.
x = ZeroPadding2D(((1,0),(1,0)))(x)
x = DarknetConv2D_BN_Leaky(num_filters, (3,3), strides=(2,2))(x)
for i in range(num_blocks):
y = DarknetConv2D_BN_Leaky(num_filters//2, (1,1))(x)
y = DarknetConv2D_BN_Leaky(num_filters, (3,3))(y)
x = Add()([x,y])
return x
Most importantly, YOLOv3 is so faster than other models based on the structure of feature pyramid network(FPN). For instance, let’s imagine a picture of forest, obviously it’s easy for you to find one tree in it because the tree occupes for a large space. But if you want to find one specific leaf in the picture, which only occupy for a few pixels with small size, it seems not so easy. With the depth of your model after several convolution and average pooling processes, the larger object will be more attention-getting in comparison with the smaller one.
To solve this problem, “Upsampling” is coming! Each size of pyramid is so different but after upsampling to smaller one, it will become bigger with the same size of the last one, with ability to concatenate with it properly to find more features in it.
In addition, it only has three layers of characteristic, dividing pictures into 1313,2626,52*52 boxes. In order to make full use of these three layers, “upsampling” is the best way to build a bridge between different sizes of layers. Based on that, it’s so easy to rebuild your own pyramid and recycle each of layers efficiently.
Well, it’s not so hard to achieve a characteristic of pyramid because there are only a few words which can help you do a good job. Let’s pay more attention to some important codes!
x = DarknetConv2D_BN_Leaky(32, (3,3))(x) #一个特殊卷积块,通道数调整为416*416*3(不变)
x = resblock_body(x, 64, 1)#208*208 重复1次
x = resblock_body(x, 128, 2) #104*104*128 重复2次
x = resblock_body(x, 256, 8) #52*52*256 8次
feat1 = x
x = resblock_body(x, 512, 8) #26*26*512 8次
feat2 = x
x = resblock_body(x, 1024, 4) #13*13*1024 4次
feat3 = x
return feat1,feat2,feat3
Haha, it helps us find the most important point in the picture!
In conclusion, darknet-53 is well-known as the simple structure as well as perfect results in object detection. Based on this, it’s so easy for everyone to build your own CNN with less than 30 minutes.
That’s all, thank you for watching~