Neural Network Compression: Difference between revisions
m David moved page Private:Neural Network Compression to Neural Network Compression over redirect |
|||
(16 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Brief survey on neural network compression techniques. | Brief survey on neural network compression techniques sampled from existing surveys. | ||
==Pruning== | ==Pruning== | ||
===Sensitivity Methods=== | ===Sensitivity Methods=== | ||
The idea here is to measure how sensitive each neuron is. | The idea here is to measure how sensitive each weight/connection or neuron is. | ||
I.e. | I.e. if you remove the neuron, how will it change the output? | ||
Typically, weights are pruned by zeroing them out, freezing them, and fine-tuning the unfrozen weights. | |||
In general, the procedure is | |||
# Train the network with a lot of parameters. | |||
# Compute sensitivity for each parameter. | |||
# Delete low-saliency parameters. | |||
# Continue training to fine-tune remaining parameters. | |||
# Repeat pruning until the number of parameters is low enough or the error is too high. | |||
Sometimes, pruning can also increase accuracy and improve generalization. | |||
* Mozer and Smolensky (1988)<ref name="mozer1988skeletonization"></ref> use a gate for each neuron. Then the sensitivity and be estimated with the derivative w.r.t the gate. | |||
* Karnin<ref name="karnin1990simple"></ref> estimates the sensitivity by monitoring the change in weight during training. | |||
* LeCun ''e al.'' present ''Optimal Brain Damage'' <ref name="lecun1989optimal"></ref> which uses the second derivative of each weight. | |||
===Redundancy Methods=== | |||
* Srinivas and Babu<ref name="srinivas2015data"></ref> propose a pair-wise similarity on each neuron: <math>s = \Vert a_j^2 \Vert_1 \Vert W_i - W_j \Vert^2_{2}</math> where <math>a_j</math> is the vector of weights on neuron j at the layer above and <math>W</math> are neuron weights. This combines a weight metric and a similarity metric into one sensitivity metric. When a neuron is pruned, the matrix for the current and next layers need to be updated. | |||
===Structured Pruning=== | |||
Structured pruning focuses on keeping the dense structure of the network such that the pruned network can benefit using standard dense matrix multiplication operations.<br> | |||
This is in contrast to unstructured pruning which zeros out values in the weight matrix but may not necessarilly run faster. | |||
* Wen ''et al.'' (2016) <ref name="wen2016learning"></ref> propose Structured Sparsity Learning (SSL) on CNNs. Given filters of size (N, C, M, K), i.e. (out-channels, in-channels, height, width), they use a group lasso loss/regularization to penalize usage of extra input and output channels. They also learn filter shapes using this regularization. | |||
==Quantization== | |||
There are many codebases which use 8-bit or 16-bit representations instead of the standard 32-bit floats. | |||
Work on quantization typically focus on different representations and mixed-precision training, though quantization can also be used to speed up inference. | |||
* Google uses [https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus bfloat16] for training on TPUs. | |||
* Gupta ''et al.''<ref name="gupta2015limited"></ref> train using a custom 16-bit representation with ''stochastic rounding''. They observe little to no degradation on MNIST MLP and CIFAR10 CNN classification accuracy. Stochastic rounding rounds to the nearest value with probability based on distance to that value. | |||
==Factorization== | ==Factorization== | ||
Also known as tensor decomposition. | |||
* Denil ''et al.'' (2013)<ref name="denil2013predicting"></ref> propose a low-rank factorization: <math>W=UV</math> where <math>U</math> is <math>n_v \times n_\alpha</math> and <math>V</math> is <math>n_\alpha \times n_h</math>. Here, vectors are left-multiplied against <math>W</math>. The compare several scenarios: training both U and V, randomly setting U with identity basis vectors, randomly setting U with iid Gaussian entries, and more. | |||
==Libraries== | |||
Both Tensorflow and PyTorch have built in libraries for pruning: | |||
* [https://pytorch.org/tutorials/intermediate/pruning_tutorial.html PyTorch pruning tutorial] | |||
* [https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#overview TF: Pruning with Keras] | |||
These support magnitude-based pruning which zero out small weights. | |||
==Resources== | ==Resources== | ||
Line 13: | Line 52: | ||
* [https://axon.cs.byu.edu/~martinez/classes/678/Papers/Reed_PruningSurvey.pdf Pruning algorithms a survey] (1993) by Russel Reed | * [https://axon.cs.byu.edu/~martinez/classes/678/Papers/Reed_PruningSurvey.pdf Pruning algorithms a survey] (1993) by Russel Reed | ||
* [https://arxiv.org/pdf/1710.09282.pdf A Survey of Model Compression and Acceleration for Deep Neural Networks] (2017) by Cheng et al. | * [https://arxiv.org/pdf/1710.09282.pdf A Survey of Model Compression and Acceleration for Deep Neural Networks] (2017) by Cheng et al. | ||
* [https://arxiv.org/abs/2006.03669 An Overview of Neural Network Compression] (2020) by James O' Neill | |||
==References== | |||
{{reflist|refs= | |||
<ref name="mozer1988skeletonization">Mozer, M. C., & Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. (NeurIPS 1988). [https://proceedings.neurips.cc/paper/1988/file/07e1cd7dca89a1678042477183b7ac3f-Paper.pdf PDF]</ref> | |||
<ref name="karnin1990simple">Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. (IEEE TNNLS 1990). [https://ieeexplore.ieee.org/document/80236 IEEE Xplore]</ref> | |||
<ref name="lecun1989optimal">LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989, November). Optimal brain damage. (NeurIPS 1989). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.7223&rep=rep1&type=pdf PDF]</ref> | |||
<ref name="srinivas2015data">Srinivas, S., & Babu, R. V. (2015). Data-free parameter pruning for deep neural networks. [https://arxiv.org/abs/1507.06149 PDF]</ref> | |||
<ref name="denil2013predicting">Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. [https://arxiv.org/abs/1306.0543 Arxiv]</ref> | |||
<ref name="wen2016learning">Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. [https://arxiv.org/abs/1608.03665 Arxiv]</ref> | |||
<ref name="gupta2015limited">Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P.. (2015). Deep Learning with Limited Numerical Precision. (ICML 2015) [http://proceedings.mlr.press/v37/gupta15.html Link]</ref> | |||
}} |
Latest revision as of 18:50, 5 August 2022
Brief survey on neural network compression techniques sampled from existing surveys.
Pruning
Sensitivity Methods
The idea here is to measure how sensitive each weight/connection or neuron is.
I.e. if you remove the neuron, how will it change the output?
Typically, weights are pruned by zeroing them out, freezing them, and fine-tuning the unfrozen weights.
In general, the procedure is
- Train the network with a lot of parameters.
- Compute sensitivity for each parameter.
- Delete low-saliency parameters.
- Continue training to fine-tune remaining parameters.
- Repeat pruning until the number of parameters is low enough or the error is too high.
Sometimes, pruning can also increase accuracy and improve generalization.
- Mozer and Smolensky (1988)[1] use a gate for each neuron. Then the sensitivity and be estimated with the derivative w.r.t the gate.
- Karnin[2] estimates the sensitivity by monitoring the change in weight during training.
- LeCun e al. present Optimal Brain Damage [3] which uses the second derivative of each weight.
Redundancy Methods
- Srinivas and Babu[4] propose a pair-wise similarity on each neuron: \(\displaystyle s = \Vert a_j^2 \Vert_1 \Vert W_i - W_j \Vert^2_{2}\) where \(\displaystyle a_j\) is the vector of weights on neuron j at the layer above and \(\displaystyle W\) are neuron weights. This combines a weight metric and a similarity metric into one sensitivity metric. When a neuron is pruned, the matrix for the current and next layers need to be updated.
Structured Pruning
Structured pruning focuses on keeping the dense structure of the network such that the pruned network can benefit using standard dense matrix multiplication operations.
This is in contrast to unstructured pruning which zeros out values in the weight matrix but may not necessarilly run faster.
- Wen et al. (2016) [5] propose Structured Sparsity Learning (SSL) on CNNs. Given filters of size (N, C, M, K), i.e. (out-channels, in-channels, height, width), they use a group lasso loss/regularization to penalize usage of extra input and output channels. They also learn filter shapes using this regularization.
Quantization
There are many codebases which use 8-bit or 16-bit representations instead of the standard 32-bit floats.
Work on quantization typically focus on different representations and mixed-precision training, though quantization can also be used to speed up inference.
- Google uses bfloat16 for training on TPUs.
- Gupta et al.[6] train using a custom 16-bit representation with stochastic rounding. They observe little to no degradation on MNIST MLP and CIFAR10 CNN classification accuracy. Stochastic rounding rounds to the nearest value with probability based on distance to that value.
Factorization
Also known as tensor decomposition.
- Denil et al. (2013)[7] propose a low-rank factorization: \(\displaystyle W=UV\) where \(\displaystyle U\) is \(\displaystyle n_v \times n_\alpha\) and \(\displaystyle V\) is \(\displaystyle n_\alpha \times n_h\). Here, vectors are left-multiplied against \(\displaystyle W\). The compare several scenarios: training both U and V, randomly setting U with identity basis vectors, randomly setting U with iid Gaussian entries, and more.
Libraries
Both Tensorflow and PyTorch have built in libraries for pruning:
These support magnitude-based pruning which zero out small weights.
Resources
Surveys
- Pruning algorithms a survey (1993) by Russel Reed
- A Survey of Model Compression and Acceleration for Deep Neural Networks (2017) by Cheng et al.
- An Overview of Neural Network Compression (2020) by James O' Neill
References
<templatestyles src="Reflist/styles.css" />
- ↑ Mozer, M. C., & Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. (NeurIPS 1988). PDF
- ↑ Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. (IEEE TNNLS 1990). IEEE Xplore
- ↑ LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989, November). Optimal brain damage. (NeurIPS 1989). PDF
- ↑ Srinivas, S., & Babu, R. V. (2015). Data-free parameter pruning for deep neural networks. PDF
- ↑ Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. Arxiv
- ↑ Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P.. (2015). Deep Learning with Limited Numerical Precision. (ICML 2015) Link
- ↑ Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. Arxiv