FCN summary

最新推荐文章于 2024-10-17 15:50:21 发布

正经仙女

最新推荐文章于 2024-10-17 15:50:21 发布

阅读量60

点赞数

文章标签：深度学习神经网络 cnn

本文链接：https://blog.csdn.net/qq_45958671/article/details/130457130

版权

Jonathan Long, Evan Shelhamer, and Trevor Darrell's "Fully Convolutional Networks for Semantic Segmentation" is a groundbreaking paper in the field of computer vision, introducing a new method for semantic segmentation using fully convolutional networks (FCNs).

The author first points out that traditional convolutional neural networks (CNN) are designed for image level classification and regression tasks, with the goal of assigning a label to the entire image. Usually, convolutional neural networks connect several fully connected layers after the convolutional layer, mapping the feature maps generated by the convolutional layer into a fixed length one-dimensional feature vector. For example, AlexNet's ImageNet model outputs a 1000 dimensional vector to represent the probability that the input image belongs to each class. However, in semantic segmentation, the goal is to assign a label to each pixel in the image. This requires a different approach to consider the spatial relationships between pixels.

To address this challenge, the author proposes FCN, which is a CNN that has been modified to generate dense output graphs rather than scalar outputs. As the name suggests, each layer of FCN is a convolutional layer. Specifically, they replace the fully connected layer at the end of the CNN with a convolutional layer that preserves spatial information. A fully convolutional neural network can accept input images of any size, and use deconvolution layers to upsample the feature map of the last convolutional layer to restore it to the same size as the input image. This allows for a prediction for each pixel while preserving the spatial information in the original input image. Finally, perform pixel by pixel classification on the feature map.

From this figure, we can see the forward propagation and Reverse learning process of the FCN-32s network model. Through a series of convolution down sampling, a feature layer with 21 channels is obtained. The upsampling performed afterwards resulted in a feature map of the same size as the original image, where each pixel had 21 values in the channel direction. After softmax processing of these 21 values, we took the category with the highest value as the predicted category for that pixel.

The process of converting a fully connected layer into a convolutional layer is clear. In our classification network, if we obtain a feature matrix, we first flatten it and then perform fully connected operations, but as a result, we lose height and width information. The use of convolution operation can preserve height and width information while maintaining the same parameter quantity.

Three models presented in the paper: FCN-32s, FCN-16s, FCN-8s

The backbone of FCN is VGG-16, as shown in the following figure:

According to the different upsampling rates, FCN is divided into three versions

1. FCN-32s directly uses Conv7's features for 32 times upsampling to obtain predicted results

2. FCN-16s first performs 2x upsampling on the features of Conv7, then fuses them with the features of pool4, and then performs 16x upsampling to obtain the predicted results

3. FCN-8s first performs 2x upsampling on the features of Conv7, then fuses them with the features of pool4, and then performs 2x upsampling on the fused features of Conv7 and pool4, and fuses them with the features of pool3. Finally, performs 8x upsampling to obtain the predicted results

The author demonstrated the effectiveness of FCN on several benchmark datasets for semantic segmentation, including PASCAL VOC 2011 and 2012, as well as SIFT Flow. They indicate that FCN performs better than previous state-of-the-art methods on these datasets and achieves near human performance on certain tasks.