Deep Learning: Difference between revisions

Deep Learning (view source)

667 bytes added , 19 November 2020

5,332

edits

@@ Line 1,886: / Line 1,886: @@
 ===Long Short Term Memory (LSTMs)===
+Goal is to solve the vanishing and exploding gradient problem.
 LSTM has several gates:
 * Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math>
@@ Line 1,891: / Line 1,893: @@
 * Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math>
 * Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>
-* Hidden state: <math>h_t = o_t \odot tanh(c_t)</math>
+* Hidden state: <math>h_t = o_t \odot \tanh(c_t)</math>
+;Bidirectional RNNs
+First LSTM takes input in correct order.
+Second LSTM takes input in reverse order.
+Concatenate outputs from both LSTMs.
+===Attention===
+Goal: To help memorize long source sentences in machine translation.
+;Encoder-decoder attention
+;Self-Attention
+===Transformer===
+;Positional encoding
+;Self-Attention
+Have queries, keys, and values.
+Multiply queries with keys, pass through softmax. Then times values.
+Yields attention of every work with respect to another.
+Initially, transformers had n=8 heads giving 8 queries, keys, and values.
+;Architecture
+Stack encoders.
 ==Misc==