Key Problems
- Limited structure design space.
- Learning bias
- As both the loss functions and the category distributions between classification and detection tasks are different, we argue that this will lead to different searching/optimization spaces. Therefore, learning may be biased towards a local minimum which is not the best for detection task.
- Domain mismatch
- State-of-the-art object objectors rely heavily on the offthe-shelf networks pre-trained on large-scale classification datasets like ImageNet
- transferring pre-trained models from classification to detection between discrepant domains is even more difficult
Architecture
Principles
- training detection network from scratch requires a proposal-free framework.
- Deep Supervision
- Transition w/o Pooling Layer. We introduce this layer in order to increase the number of dense blocks without reducing the final feature map resolution.
- Stem Block
- stem block can reduce the information loss from raw input images.
- Dense Prediction Structure
- *
Contributions
- DSOD is a simple yet efficient framework which could learn object detectors from scratch
- DSOD is fairly flexible, so that we can tailor various network structures for different computing platforms such as server, desktop, mobile and even embedded devices.
- We present DSOD, to the best of our knowledge, world first framework that can train object detection networks from scratch with state-of-the-art performance.
- We introduce and validate a set of principles to design efficient object detection networks from scratch through step-by-step ablation studies.
- We show that our DSOD can achieve state-of-the-art performance on three standard benchmarks (PASCAL VOC 2007, 2012 and MS COCO datasets) with realtime processing speed and more compact models.
Experiments
Others
- a well-designed network structure can outperform state-ofthe-art solutions without using the pre-trained models
- only the proposal-free method (the 3rd category) can converge successfully without the pre-trained models.
- RoI pooling generates features for each region proposals, which hinders the gradients being smoothly back-propagated from region-level to convolutional feature maps.
- The proposal-based methods work well with pretrained network models because the parameter initialization
is good for those layers before RoI pooling, while this is not
true for training from scratch