Transformer (machine learning model): Difference between revisions

Transformer (machine learning model) (view source)

388 bytes added , 23 November 2020

5,322

edits

@@ Line 29: / Line 29: @@
 In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.
 These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block.
+;Self attention
+The encoder and parts of the decoder, use self-attention which means the keys, values, and queries are all generated from the embedding.
+;Encoder-decoder attention
+The decoder also uses encoder-decoder attention where the keys and values come from the encoder (i.e. the input sentence) but the queries come from the decoder input (i.e. the previously-generated output).
 ===Encoder===