Transformer (machine learning model): Difference between revisions

Transformer (machine learning model) (view source)

562 bytes added , 23 November 2020

5,322

edits

@@ Line 18: / Line 18: @@
 * <math>V</math> represents values
 The attention block can be represented as the following equation:
-* <math>\operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math>
+* <math>Z = \operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math>
+Each word embedded gets its own key, query, and value vectors which are stacked alongside those of other words to form K, Q, and V matrices.
+These K, Q, and V matrices are computed by multiplying the embedding matrix <math>X</math> with weights <math>W_{K}, W_{Q}, W_{V}</math>.
+In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.
+These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next encoder block.
 ===Encoder===