\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Hyperparameters are parameters to your network which you need to tune yourself.

These include learning rate, the number of nodes, kernel size, batch size, optimizer.

Below are some things I've found online or learned through trial and error.


Optimization

Learning Rate

You typically want the largest learning rate which leads to stable training. Having a learning rate which is too high can cause the loss to increase or stay steady. Some people use schedulers to change the learning rate over time.

Batch Size

A small batch size leads to noisy gradients. This noise is good for escaping saddle points but may mean it will take longer for your model to converge. Typically people pick batch sizes which are powers of 2 such as 8, 16, 32. For some models, a batch size of 1 will also work. However, bigger models like BERT-large seem to require larger batch sizes. You can increase batch size by adding more GPUs as long as the model fits on the GPU.

Training with large minibatches is bad for your health.
More importantly, it's bad for your test error.
Friends dont let friends use minibatches larger than 32
arxiv.org/abs/1804.07612

Yann LeCun Tweet


MLPs

Layers

Parameters

CNNs

Kernel Size

Historically, people used larger kernels like 27x27.
Since VGGnet, people now use smaller kernels like 3x3 or 4x4 along with more layers.

Stride

Activation functions

Historically, people used to use ReLU or Sigmoid because it is smooth and constrains the range of the output.
However, since AlexNet, people have found that ReLU works better and is faster to compute for both forward and backward passes.

Typically, you should not include an activation or normalization at the output layer. Treat the outputs directly as logits.
If necessary, you can add a sigmoid to constrain the range of the output.