Neural Network Compression: Difference between revisions

(7 intermediate revisions by the same user not shown)

Line 1:

Brief survey on neural network compression techniques ~~consisting of popular papers I pull~~ from existing surveys.

Brief survey on neural network compression techniques sampled from existing surveys.

==Pruning==

===Sensitivity Methods===

The idea here is to measure how sensitive each weight ~~(i.e.~~ connection) or neuron is.

The idea here is to measure how sensitive each weight/connection or neuron is.

I.e., if you remove the neuron, how will it change the output?

I.e. if you remove the neuron, how will it change the output?

Typically, weights are pruned by zeroing them out ~~and~~ freezing them.

Typically, weights are pruned by zeroing them out, freezing them, and fine-tuning the unfrozen weights.

In general, the procedure is

Line 11:

# Compute sensitivity for each parameter.

# Delete low-saliency parameters.

# Continue training ~~and repeat~~ pruning until the number of parameters is low enough or error is too high.

# Continue training to fine-tune remaining parameters.

# Repeat pruning until the number of parameters is low enough or the error is too high.

Sometimes, pruning can also increase accuracy and improve generalization.

Line 17:

Line 18:

* Mozer and Smolensky (1988)<ref name="mozer1988skeletonization"></ref> use a gate for each neuron. Then the sensitivity and be estimated with the derivative w.r.t the gate.

* Karnin<ref name="karnin1990simple"></ref> estimates the sensitivity by monitoring the change in weight during training.

* LeCun ''e al.'' present ''Optimal Brain Damage'' <ref name="lecun1989optimal"></ref> which uses the

* LeCun ''e al.'' present ''Optimal Brain Damage'' <ref name="lecun1989optimal"></ref> which uses the second derivative of each weight.

===Redundancy Methods===

Line 23:

Line 24:

===Structured Pruning===

Structured pruning focuses on keeping the dense structure of the network such that the pruned ~~weights~~ can benefit using standard dense matrix multiplication operations.

Structured pruning focuses on keeping the dense structure of the network such that the pruned network can benefit using standard dense matrix multiplication operations.<br>

This is in contrast to unstructured pruning which zeros out values in the weight matrix but may not necessarilly run faster.

* Wen ''et al.'' (2016) <ref name="wen2016learning"></ref> propose Structured Sparsity Learning (SSL) on CNNs. Given filters of size (N, C, M, K), i.e. (out-channels, in-channels, height, width), they use a group lasso loss/regularization to penalize usage of extra input and output channels. They also learn filter shapes using this regularization.

Line 32:

Line 34:

* Google uses [https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus bfloat16] for training on TPUs.

* Gupta ''et al.''<ref name="gupta2015limited"></ref> train using a custom 16-bit representation with ''stochastic rounding''. They observe little to no degradation on MNIST MLP and CIFAR10 CNN classification accuracy. Stochastic rounding rounds to the nearest value with probability based on distance to that value.

==Factorization==

Line 58:

Line 60:

<ref name="lecun1989optimal">LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989, November). Optimal brain damage. (NeurIPS 1989). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.7223&rep=rep1&type=pdf PDF]</ref>

<ref name="srinivas2015data">Srinivas, S., & Babu, R. V. (2015). Data-free parameter pruning for deep neural networks. [https://arxiv.org/abs/1507.06149 PDF]</ref>

<ref name="denil2013predicting">Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning~~. arXiv preprint arXiv:1306.0543~~. [https://arxiv.org/abs/1306.0543 Arxiv]</ref>

<ref name="denil2013predicting">Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. [https://arxiv.org/abs/1306.0543 Arxiv]</ref>

<ref name="wen2016learning">Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks~~. arXiv preprint arXiv:1608.03665~~. [https://arxiv.org/abs/1608.03665 Arxiv]</ref>

<ref name="wen2016learning">Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. [https://arxiv.org/abs/1608.03665 Arxiv]</ref>

<ref name="gupta2015limited">Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P.. (2015). Deep Learning with Limited Numerical Precision. (ICML 2015) [http://proceedings.mlr.press/v37/gupta15.html Link]</ref>

}}

@@ Line 1: / Line 1: @@
-Brief survey on neural network compression techniques consisting of popular papers I pull from existing surveys.
+Brief survey on neural network compression techniques sampled from existing surveys.
 ==Pruning==
 ===Sensitivity Methods===
-The idea here is to measure how sensitive each weight (i.e. connection) or neuron is.
+The idea here is to measure how sensitive each weight/connection or neuron is.
-I.e., if you remove the neuron, how will it change the output?
+I.e. if you remove the neuron, how will it change the output?
-Typically, weights are pruned by zeroing them out and freezing them.
+Typically, weights are pruned by zeroing them out, freezing them, and fine-tuning the unfrozen weights.
 In general, the procedure is
@@ Line 11: / Line 11: @@
 # Compute sensitivity for each parameter.
 # Delete low-saliency parameters.
-# Continue training and repeat pruning until the number of parameters is low enough or error is too high.
+# Continue training to fine-tune remaining parameters.
+# Repeat pruning until the number of parameters is low enough or the error is too high.
 Sometimes, pruning can also increase accuracy and improve generalization.
@@ Line 17: / Line 18: @@
 * Mozer and Smolensky (1988)<ref name="mozer1988skeletonization"></ref> use a gate for each neuron. Then the sensitivity and be estimated with the derivative w.r.t the gate.
 * Karnin<ref name="karnin1990simple"></ref> estimates the sensitivity by monitoring the change in weight during training.
-* LeCun ''e al.'' present ''Optimal Brain Damage'' <ref name="lecun1989optimal"></ref> which uses the
+* LeCun ''e al.'' present ''Optimal Brain Damage'' <ref name="lecun1989optimal"></ref> which uses the second derivative of each weight.
 ===Redundancy Methods===
@@ Line 23: / Line 24: @@
 ===Structured Pruning===
-Structured pruning focuses on keeping the dense structure of the network such that the pruned weights can benefit using standard dense matrix multiplication operations.
+Structured pruning focuses on keeping the dense structure of the network such that the pruned network can benefit using standard dense matrix multiplication operations.<br>
+This is in contrast to unstructured pruning which zeros out values in the weight matrix but may not necessarilly run faster.
 * Wen ''et al.'' (2016) <ref name="wen2016learning"></ref> propose Structured Sparsity Learning (SSL) on CNNs. Given filters of size (N, C, M, K), i.e. (out-channels, in-channels, height, width), they use a group lasso loss/regularization to penalize usage of extra input and output channels. They also learn filter shapes using this regularization.
@@ Line 32: / Line 34: @@
 * Google uses [https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus bfloat16] for training on TPUs.
+* Gupta ''et al.''<ref name="gupta2015limited"></ref> train using a custom 16-bit representation with ''stochastic rounding''. They observe little to no degradation on MNIST MLP and CIFAR10 CNN classification accuracy. Stochastic rounding rounds to the nearest value with probability based on distance to that value.
 ==Factorization==
@@ Line 58: / Line 60: @@
 <ref name="lecun1989optimal">LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., & Jackel, L. D. (1989, November). Optimal brain damage. (NeurIPS 1989). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.7223&rep=rep1&type=pdf PDF]</ref>
 <ref name="srinivas2015data">Srinivas, S., & Babu, R. V. (2015). Data-free parameter pruning for deep neural networks. [https://arxiv.org/abs/1507.06149 PDF]</ref>
-<ref name="denil2013predicting">Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543. [https://arxiv.org/abs/1306.0543 Arxiv]</ref>
+<ref name="denil2013predicting">Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & De Freitas, N. (2013). Predicting parameters in deep learning. [https://arxiv.org/abs/1306.0543 Arxiv]</ref>
-<ref name="wen2016learning">Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665. [https://arxiv.org/abs/1608.03665 Arxiv]</ref>
+<ref name="wen2016learning">Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. [https://arxiv.org/abs/1608.03665 Arxiv]</ref>
+<ref name="gupta2015limited">Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P.. (2015). Deep Learning with Limited Numerical Precision. (ICML 2015) [http://proceedings.mlr.press/v37/gupta15.html Link]</ref>
 }}