Transformer (machine learning model): Difference between revisions
Line 48: | Line 48: | ||
The encoder-decoder attention gets its keys and values from the output of the last encoder block. | The encoder-decoder attention gets its keys and values from the output of the last encoder block. | ||
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | ||
Note that in the decoder, you need to mask out the attention blocks to be lower-triangular. | |||
==Code== | ==Code== |