FCN summary

Jonathan Long, Evan Shelhamer, and Trevor Darrell's "Fully Convolutional Networks for Semantic Segmentation" is a groundbreaking paper in the field of computer vision, introducing a new method for semantic segmentation using fully convolutional networks (FCNs).

The author first points out that traditional convolutional neural networks (CNN) are designed for image level classification and regression tasks, with the goal of assigning a label to the entire image. Usually, convolutional neural networks connect several fully connected layers after the convolutional layer, mapping the feature maps generated by the convolutional layer into a fixed length one-dimensional feature vector. For example, AlexNet's ImageNet model outputs a 1000 dimensional vector to represent the probability that the input image belongs to each class. However, in semantic segmentation, the goal is to assign a label to each pixel in the image. This requires a different approach to consider the spatial relationships between pixels.

To address this challenge, the author proposes FCN, which is a CNN that has been modified to generate dense output graphs rather than scalar outputs. As the name suggests, each layer of FCN is a convolutional layer. Specifically, they replace the fully connected layer at the end of the CNN with a convolutional layer that preserves spatial information. A fully convolutional neural network can accept input images of any size, and use deconvolution layers to upsample the feature map of the last convolutional layer to restore it to the same size as the input image. This allows for a prediction for each pixel while preserving the spatial information in the original input image. Finally, perform pixel by pixel classification on the feature map.

From this figure, we can see the forward propagation and Reverse learning process of the FCN-32s network model. Through a series of convolution down sampling, a feature layer with 21 channels is obtained. The upsampling performed afterwards resulted in a feature map of the same size as the original image, where each pixel had 21 values in the channel direction. After softmax processing of these 21 values, we took the category with the highest value as the predicted category for that pixel.

The process of converting a fully connected layer into a convolutional layer is clear. In our classification network, if we obtain a feature matrix, we first flatten it and then perform fully connected operations, but as a result, we lose height and width information. The use of convolution operation can preserve height and width information while maintaining the same parameter quantity.

Three models presented in the paper: FCN-32s, FCN-16s, FCN-8s

The backbone of FCN is VGG-16, as shown in the following figure:

According to the different upsampling rates, FCN is divided into three versions

1. FCN-32s directly uses Conv7's features for 32 times upsampling to obtain predicted results

2. FCN-16s first performs 2x upsampling on the features of Conv7, then fuses them with the features of pool4, and then performs 16x upsampling to obtain the predicted results

3. FCN-8s first performs 2x upsampling on the features of Conv7, then fuses them with the features of pool4, and then performs 2x upsampling on the fused features of Conv7 and pool4, and fuses them with the features of pool3. Finally, performs 8x upsampling to obtain the predicted results

The author demonstrated the effectiveness of FCN on several benchmark datasets for semantic segmentation, including PASCAL VOC 2011 and 2012, as well as SIFT Flow. They indicate that FCN performs better than previous state-of-the-art methods on these datasets and achieves near human performance on certain tasks.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值