Transformer (machine learning model): Difference between revisions
(2 intermediate revisions by the same user not shown) | |||
Line 48: | Line 48: | ||
The encoder-decoder attention gets its keys and values from the output of the last encoder block. | The encoder-decoder attention gets its keys and values from the output of the last encoder block. | ||
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | ||
Note that in the decoder, you need to mask out the attention blocks to be lower-triangular. | |||
==Code== | ==Code== | ||
Line 57: | Line 59: | ||
* [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | * [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | ||
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher] | * [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher] | ||
* [https://arxiv.org/abs/2207.09238 Formal Algorithms for Transformers (Arxiv 2022)] | |||
==Followup work== | |||
* [https://arxiv.org/abs/2112.05682 Memory-efficient attention] reduces the memory overhead of an attention layer to a constant amount (specifically, a scalar and a vector the size of one output feature). | |||
** This processes queries sequentially, is good for weaker GPUs where memory is limited and computation is less parallel due to fewer cores. | |||
==References== | ==References== |