TITLE: Rethinking the Inception Architecture for Computer Vision
AUTHER: Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna
ASSOCIATION: Google Inc., University College London
- Several general and specific design priciples are discussed
General Design Principles
- Avoid representational bottlenecks, especially early in the network. One should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand.
- Higher dimensional representations are easier to process locally within a network. Increasing the activations per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.
- Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power.
- Balance the width and depth of the network.
Sepecific Design Strategy
- Factorizing Convolutions with Large Filter Size includes Factorization into smaller convolutions and Spatial Factorization into Asymmetric Convolutions. Both help to improve the speed and the complexity of the learnt function.
- Utility of Auxiliary Classifiers act as regularizer rather than help evolving the low-level features. Near the end of training, the network with the auxiliary branches starts to overtake the accuracy of the network without any auxiliary branch and reaches a slightly higher plateau.
- Efficient Grid Size Reduction reduces the computational cost while removing the representational bottleneck.
Some Other Ideas
In this paper, a very intereting experiment is of value to be noted. With different perceptive field size, the networks can achieve similar results if similar computational cost is constant.
In my own trials of SSD, I found networks of similar computational cost with differnt perceptive field size have very different result in detection task. For example, Network A has a perceptive field size of 112x112, while Network B is 170x170. Network B has a slightly better performance on classificatino task on Network A. On the contrary, after the two networks are finetuned on 200*200 images on detection task, Network A is better. Thus, how about we train a network with the perceptive field size of, let’s say, 56x56 and finetune it on 100x100 images? Will it have a comparable result?