Transformer (machine learning model): Difference between revisions

Transformer (machine learning model) (view source)

93 bytes added , 26 January 2023

5,322

edits

@@ Line 48: / Line 48: @@
 The encoder-decoder attention gets its keys and values from the output of the last encoder block.
 The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own.
+Note that in the decoder, you need to mask out the attention blocks to be lower-triangular.
 ==Code==