From David's Wiki
Jump to navigation Jump to search
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Hyperparameters are parameters to your network which you need to tune yourself.

These include learning rate, the number of nodes, kernel size, batch size, optimizer.

Below are some things I've found online or learned through trial and error.


Learning Rate

You typically want the largest learning rate which leads to stable training. Having a learning rate which is too high can cause the loss to increase or stay steady. Some people use schedulers to change the learning rate over time.

Batch Size

A small batch size leads to noisy gradients. This noise is good for escaping saddle points but may mean it will take longer for your model to converge. Typically people pick batch sizes which are powers of 2 such as 8, 16, 32. For some models, a batch size of 1 will also work. However, bigger models like BERT-large seem to require larger batch sizes. You can increase batch size by adding more GPUs as long as the model fits on the GPU.

Training with large minibatches is bad for your health.
More importantly, it's bad for your test error.
Friends dont let friends use minibatches larger than 32

Yann LeCun Tweet





Kernel Size