Jump to content

Transformer (machine learning model): Difference between revisions

Line 29: Line 29:
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.   
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.   
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block.
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block.
;Self attention
The encoder and parts of the decoder, use self-attention which means the keys, values, and queries are all generated from the embedding.
;Encoder-decoder attention
The decoder also uses encoder-decoder attention where the keys and values come from the encoder (i.e. the input sentence) but the queries come from the decoder input (i.e. the previously-generated output).


===Encoder===
===Encoder===