Transformer (machine learning model): Difference between revisions
(8 intermediate revisions by the same user not shown) | |||
Line 29: | Line 29: | ||
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | ||
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block. | These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block. | ||
;Self attention | |||
The encoder and parts of the decoder, use self-attention which means the keys, values, and queries are all generated from the embedding. | |||
;Encoder-decoder attention | |||
The decoder also uses encoder-decoder attention where the keys and values are from the output embedding of the encoder (i.e. the input sentence) but the queries are generated from the decoder input (i.e. the previously-generated output). | |||
===Encoder=== | ===Encoder=== | ||
Line 38: | Line 44: | ||
===Decoder=== | ===Decoder=== | ||
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | ||
As with the encoder, each layer is followed by an add-and-normalize residual connection. | |||
The encoder-decoder attention gets its keys and values from the output of the last encoder block. | |||
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | |||
Note that in the decoder, you need to mask out the attention blocks to be lower-triangular. | |||
==Code== | ==Code== | ||
See [[Hugging Face]] | See [[Hugging Face]] which contains many pretrained transfromers such as Bert. | ||
==Resources== | ==Resources== | ||
Line 48: | Line 59: | ||
* [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | * [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | ||
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher] | * [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher] | ||
* [https://arxiv.org/abs/2207.09238 Formal Algorithms for Transformers (Arxiv 2022)] | |||
==Followup work== | |||
* [https://arxiv.org/abs/2112.05682 Memory-efficient attention] reduces the memory overhead of an attention layer to a constant amount (specifically, a scalar and a vector the size of one output feature). | |||
** This processes queries sequentially, is good for weaker GPUs where memory is limited and computation is less parallel due to fewer cores. | |||
==References== | ==References== |