Deep Learning: Difference between revisions
Line 1,878: | Line 1,878: | ||
===Recurrent Neural Networks (RNNs)=== | ===Recurrent Neural Networks (RNNs)=== | ||
Hidden state: <math>h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)</math> | Hidden state: <math>h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)</math> | ||
Prediction at t: <math>y_t = W h_{t}</math> | Prediction at t: <math>y_t = W_{hy} h_{t}</math> | ||
;Backpropagation through time | |||
If <math>W</math> has largest singular value < 1, then gradient vanishes. | |||
If <math>W</math> has largest singular value > 1, then gradient explodes. | |||
Typically, gradient vanishes because of initialization of <math>W</math>. | |||
===Long Short Term Memory (LSTMs)=== | |||
LSTM has several gates: | |||
* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math> | |||
* Forget gate: <math>f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)</math> | |||
* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math> | |||
* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math> | |||
* Hidden state: <math>h_t = o_t \odot tanh(c_t)</math> | |||
==Misc== | ==Misc== |