Neural Network Compression: Difference between revisions

Revision as of 22:15, 3 February 2021

Brief survey on neural network compression techniques consisting of popular papers I pull from existing surveys.

Pruning

Sensitivity Methods

The idea here is to measure how sensitive each weight (i.e. connection) or neuron is.
I.e., if you remove the neuron, how will it change the output?
Typically, weights are pruned by zeroing them out and freezing them.

In general, the procedure is

Train the network with a lot of parameters.
Compute sensitivity for each parameter.
Delete low-saliency parameters.
Continue training and repeat pruning until the number of parameters is low enough or error is too high.

Sometimes, pruning can also increase accuracy and improve generalization.

Mozer and Smolensky (1988)^[1] use a gate for each neuron. Then the sensitivity and be estimated with the derivative w.r.t the gate.
Karnin^[2] estimates the sensitivity by monitoring the change in weight during training.
LeCun e al. present Optimal Brain Damage ^[3] which uses the

Redundancy Methods

Srinivas and Babu^[4] propose a pair-wise similarity on each neuron: \(\displaystyle s = \Vert a_j^2 \Vert_1 \Vert W_i - W_j \Vert^2_{2}\) where \(\displaystyle a_j\) is the vector of weights on neuron j at the layer above and \(\displaystyle W\) are neuron weights. This combines a weight metric and a similarity metric into one sensitivity metric. When a neuron is pruned, the matrix for the current and next layers need to be updated.

Structured Pruning

Structured pruning focuses on keeping the dense structure of the network such that the pruned weights can benefit using standard dense matrix multiplication operations.

Wen et al. (2016) ^[5] propose Structured Sparsity Learning (SSL) on CNNs. Given filters of size (N, C, M, K), i.e. (out-channels, in-channels, height, width), they use a group lasso loss/regularization to penalize usage of extra input and output channels. They also learn filter shapes using this regularization.

Quantization

There are many codebases which use 8-bit or 16-bit representations instead of the standard 32-bit floats.
Work on quantization typically focus on different representations and mixed-precision training, though quantization can also be used to speed up inference.

Google uses bfloat16 for training on TPUs.

Factorization

Also known as tensor decomposition.

Denil et al. (2013)^[6] propose a low-rank factorization: \(\displaystyle W=UV\) where \(\displaystyle U\) is \(\displaystyle n_v \times n_\alpha\) and \(\displaystyle V\) is \(\displaystyle n_\alpha \times n_h\). Here, vectors are left-multiplied against \(\displaystyle W\). The compare several scenarios: training both U and V, randomly setting U with identity basis vectors, randomly setting U with iid Gaussian entries, and more.

Libraries

Both Tensorflow and PyTorch have built in libraries for pruning:

These support magnitude-based pruning which zero out small weights.

Resources

Surveys

Pruning algorithms a survey (1993) by Russel Reed
A Survey of Model Compression and Acceleration for Deep Neural Networks (2017) by Cheng et al.
An Overview of Neural Network Compression (2020) by James O' Neill

References

↑ Mozer, M. C., & Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. (NeurIPS 1988). PDF
↑ Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. (IEEE TNNLS 1990). IEEE Xplore
↑ LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989, November). Optimal brain damage. (NeurIPS 1989). PDF
↑ Srinivas, S., & Babu, R. V. (2015). Data-free parameter pruning for deep neural networks. PDF
↑ Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665. Arxiv
↑ Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543. Arxiv

[mozer1988skeletonization-1] Mozer, M. C., & Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. (NeurIPS 1988). PDF

[karnin1990simple-2] Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. (IEEE TNNLS 1990). IEEE Xplore

[lecun1989optimal-3] LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989, November). Optimal brain damage. (NeurIPS 1989). PDF

[srinivas2015data-4] Srinivas, S., & Babu, R. V. (2015). Data-free parameter pruning for deep neural networks. PDF

[wen2016learning-5] Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665. Arxiv

[denil2013predicting-6] Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543. Arxiv

[1]

[2]

[3]

[4]

[5]

[6]

Revision as of 22:09, 3 February 2021 view source David (talk \| contribs) Bureaucrats, Interface administrators, Administrators 5,498 edits No edit summary ← Older edit		Revision as of 22:15, 3 February 2021 view source David (talk \| contribs) Bureaucrats, Interface administrators, Administrators 5,498 edits →Quantization Newer edit →
Line 32:		Line 32:

	* Google uses [https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus bfloat16] for training on TPUs.		* Google uses [https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus bfloat16] for training on TPUs.


	==Factorization==		==Factorization==