Hyperparameters: Difference between revisions
Created page with "Hyperparameters are parameters to your network which you need to tune yourself. These include learning rate, the number of nodes, kernel size, batch size, optimizer. Below a..." |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 34: | Line 34: | ||
==CNNs== | ==CNNs== | ||
===Kernel Size=== | ===Kernel Size=== | ||
Historically, people used larger kernels like 27x27. | |||
Since VGGnet, people now use smaller kernels like 3x3 or 4x4 along with more layers. | |||
===Stride=== | ===Stride=== | ||
Strided convolution is good for decreasing the resolution and increasing the receptive field. | |||
Typically, you can choose between using a strided convolution or using pooling to drop the resolution of images or increase the receptive field. | |||
The alternative to strided convolutions is to use dilated convolutions, like in Atrous Spatial Pyramid Pooling (ASPP). | |||
==Activation functions== | |||
Historically, people used to use ReLU or Sigmoid because it is smooth and constrains the range of the output. | |||
However, since AlexNet, people have found that ReLU works better and is faster to compute for both forward and backward passes. | |||
Typically, you should not include an activation or normalization at the output layer. Treat the outputs directly as logits. | |||
If necessary, you can add a sigmoid to constrain the range of the output. |
Latest revision as of 15:42, 5 February 2021
Hyperparameters are parameters to your network which you need to tune yourself.
These include learning rate, the number of nodes, kernel size, batch size, optimizer.
Below are some things I've found online or learned through trial and error.
Optimization
Learning Rate
You typically want the largest learning rate which leads to stable training. Having a learning rate which is too high can cause the loss to increase or stay steady. Some people use schedulers to change the learning rate over time.
Batch Size
A small batch size leads to noisy gradients. This noise is good for escaping saddle points but may mean it will take longer for your model to converge. Typically people pick batch sizes which are powers of 2 such as 8, 16, 32. For some models, a batch size of 1 will also work. However, bigger models like BERT-large seem to require larger batch sizes. You can increase batch size by adding more GPUs as long as the model fits on the GPU.
Training with large minibatches is bad for your health. More importantly, it's bad for your test error. Friends dont let friends use minibatches larger than 32 arxiv.org/abs/1804.07612
MLPs
Layers
Parameters
CNNs
Kernel Size
Historically, people used larger kernels like 27x27.
Since VGGnet, people now use smaller kernels like 3x3 or 4x4 along with more layers.
Stride
Strided convolution is good for decreasing the resolution and increasing the receptive field.
Typically, you can choose between using a strided convolution or using pooling to drop the resolution of images or increase the receptive field.
The alternative to strided convolutions is to use dilated convolutions, like in Atrous Spatial Pyramid Pooling (ASPP).
Activation functions
Historically, people used to use ReLU or Sigmoid because it is smooth and constrains the range of the output.
However, since AlexNet, people have found that ReLU works better and is faster to compute for both forward and backward passes.
Typically, you should not include an activation or normalization at the output layer. Treat the outputs directly as logits.
If necessary, you can add a sigmoid to constrain the range of the output.