Transformer (machine learning model): Difference between revisions
(5 intermediate revisions by the same user not shown) | |||
Line 34: | Line 34: | ||
;Encoder-decoder attention | ;Encoder-decoder attention | ||
The decoder also uses encoder-decoder attention where the keys and values are | The decoder also uses encoder-decoder attention where the keys and values are from the output embedding of the encoder (i.e. the input sentence) but the queries are generated from the decoder input (i.e. the previously-generated output). | ||
===Encoder=== | ===Encoder=== | ||
Line 46: | Line 46: | ||
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | ||
As with the encoder, each layer is followed by an add-and-normalize residual connection. | As with the encoder, each layer is followed by an add-and-normalize residual connection. | ||
The encoder-decoder attention | The encoder-decoder attention gets its keys and values from the output of the last encoder block. | ||
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | |||
Note that in the decoder, you need to mask out the attention blocks to be lower-triangular. | |||
==Code== | ==Code== | ||
See [[Hugging Face]] | See [[Hugging Face]] which contains many pretrained transfromers such as Bert. | ||
==Resources== | ==Resources== | ||
Line 56: | Line 59: | ||
* [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | * [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | ||
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher] | * [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher] | ||
* [https://arxiv.org/abs/2207.09238 Formal Algorithms for Transformers (Arxiv 2022)] | |||
==Followup work== | |||
* [https://arxiv.org/abs/2112.05682 Memory-efficient attention] reduces the memory overhead of an attention layer to a constant amount (specifically, a scalar and a vector the size of one output feature). | |||
** This processes queries sequentially, is good for weaker GPUs where memory is limited and computation is less parallel due to fewer cores. | |||
==References== | ==References== |