Optimal Brain Damage
— LeCun, Denker and Solla, 1989, Advances in Neural Information Processing Systems
- We introduce OBD for reducing the size of a learning network by selectively deleting weights based on second-derivative information.
- We show that OBD can be used both as a) an automatic network minimization procedure and b) as an interactive tool to suggest better architectures, which contrasts with the view of weight deletion as a more-or-less automatic procedure.
- It is possible to take perfectly reasonable network, delete half or more of the weights and wind up with a network that works just as well, or better (Seems like a precursor for Lottery Tichet Hypothesis). But we emphasize that the starting point was a state-of-the-art network.
Motivation
The traditional way to trade-off complexity with acc is to add a regularization cost (e.g. VC-dim, description length, #non-zero free parameters, etc.)
A simple strategy consists of deleting parameters with “small saliency”. Other things being equal, small-magnitude parameters will have the least saliency, so a reasonable initial strategy is to train the network, delete small-magnitude parameters in order, then retrain the network. This procedure can be iterated. In the limit, it reduces to a continuous weight-decay during training using disproportionally rapid decay of small-magnitude parameters. Two drawbacks are a) require fine-tuning, b) le