Transformer (machine learning model): Difference between revisions

 
(3 intermediate revisions by the same user not shown)
Line 46: Line 46:
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.   
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.   
As with the encoder, each layer is followed by an add-and-normalize residual connection.   
As with the encoder, each layer is followed by an add-and-normalize residual connection.   
The encoder-decoder attention generates its keys and values from the output of the last encoder block.
The encoder-decoder attention gets its keys and values from the output of the last encoder block
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own.
 
Note that in the decoder, you need to mask out the attention blocks to be lower-triangular.


==Code==
==Code==
Line 56: Line 59:
* [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer]
* [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer]
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher]
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher]
* [https://arxiv.org/abs/2207.09238 Formal Algorithms for Transformers (Arxiv 2022)]
==Followup work==
* [https://arxiv.org/abs/2112.05682 Memory-efficient attention] reduces the memory overhead of an attention layer to a constant amount (specifically, a scalar and a vector the size of one output feature).
** This processes queries sequentially, is good for weaker GPUs where memory is limited and computation is less parallel due to fewer cores.


==References==
==References==