Jump to content

Transformer (machine learning model): Difference between revisions

 
Line 48: Line 48:
The encoder-decoder attention gets its keys and values from the output of the last encoder block.   
The encoder-decoder attention gets its keys and values from the output of the last encoder block.   
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own.
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own.
Note that in the decoder, you need to mask out the attention blocks to be lower-triangular.


==Code==
==Code==