Jump to content

Deep Learning: Difference between revisions

Line 1,886: Line 1,886:


===Long Short Term Memory (LSTMs)===
===Long Short Term Memory (LSTMs)===
Goal is to solve the vanishing and exploding gradient problem.
LSTM has several gates:
LSTM has several gates:
* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math>
* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math>
Line 1,891: Line 1,893:
* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math>
* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math>
* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>
* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>
* Hidden state: <math>h_t = o_t \odot tanh(c_t)</math>
* Hidden state: <math>h_t = o_t \odot \tanh(c_t)</math>
 
 
;Bidirectional RNNs
First LSTM takes input in correct order. 
Second LSTM takes input in reverse order. 
Concatenate outputs from both LSTMs.
 
===Attention===
Goal: To help memorize long source sentences in machine translation.
 
;Encoder-decoder attention
 
;Self-Attention
 
===Transformer===
;Positional encoding
 
;Self-Attention
Have queries, keys, and values. 
Multiply queries with keys, pass through softmax. Then times values. 
Yields attention of every work with respect to another. 
Initially, transformers had n=8 heads giving 8 queries, keys, and values.
 
;Architecture
Stack encoders.


==Misc==
==Misc==