5,321
edits
Line 39: | Line 39: | ||
====[https://arxiv.org/pdf/1502.03167.pdf Batch Normalization]==== | ====[https://arxiv.org/pdf/1502.03167.pdf Batch Normalization]==== | ||
; Definitions: | ; Definitions: | ||
* Internal Covariate Shift - the change in distribution of network activations as network parameters change. | * Internal Covariate Shift - "the change in distribution of network activations as network parameters change" | ||
* | * Whitening - "linearly transformed to have zero means and unit variances, and decorrelated" | ||
The idea is to do a element-wise normalization per minibatch to improve training speed. | |||
<pre> | |||
B <- get minibatch | |||
u_B <- calculate minimatch mean | |||
sigma2_B <- calculate minibatch variance | |||
x_hat <- (x - u_B) / sqrt(sigma2_B + eps) // normalize | |||
y <- gamma*x + b // scale and shift | |||
</pre> | |||
* <math>\epsilon</math> is added for numerical stability (e.g. if sigma is small) | |||
* In the algorithm, <math>\gamma</math> and <math>\beta</math> are learned parameters. | |||
* The final scale and shift allow the entire BN process to be an identity transformation | |||
** The BN paper gives a sigmoid activation as an example which is linear around 0. | |||
:: Batch Normalization would almost eliminate the non-linearity with sigmoid | |||
====Leaky Relu==== | ====Leaky Relu==== |