论文《A Deep Neural Network Compression Pipeline: Pruning, Quantization, Huffman Encoding》
Pruning
- by learning only the important connections.
all connections with weights below a threshold are removed from the network.
retrain the network to learn the final weights for the remaining sparse connections.
- store by compressed sparse row(CSR) or compressed sparse column(CSC) format
requires 2nnz + n + 1, nnz is the number of non-zero elements and n is the number of columns or rows.
store the index difference instead of the absolute position
by 9× and 13× for AlexNet and VGG-16 model.
Quantization
- quantize the weights to enforce weight sharing
Network quantization, further compresses the pruned network by reducing the number of bits required to represent each weight.
- Weight Sharing
- k-means clustering
- Initialization of Shared Weights
- Forgy(random).
Since there are two peaks in the bimodal distribution, Forgy method tend to concentrate around those two peaks. - Density-based.
This method makes the centroids denser around the two peaks, but more scatted than the Forgy method. - Linear initialization.
Linear initialization linearly spaces the centroids between the [min, max] of the original weights.
- Forgy(random).
- Feed-forward and Back-propagation
Huffman coding
Huffman coding
Huffman code is a type of optimal prefix code that is commonly used for loss-less data compression.
总结
这篇论文的想法是比较好的,但是因为裁剪部分权值,会导致filter矩阵的稀疏性,所以需要特别的稀疏矩阵计算库才能支持以上的操作。