Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in Neural Information Processing Systems. 2015. (Citations: 116).
1 Motivation
The Show, Attend and Tell only allow attention constrained to fixed grid. We want the model can attend to arbitary part of the image.
The pooling operation allows a network to be somewhat spatially invariant to the position of features. However, due to the typically small spatial support for max-pooling, this
spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps in a CNN are not actually invariant to large transformations of the input data.
deformations.