5,322
edits
Line 29: | Line 29: | ||
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | ||
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block. | These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block. | ||
;Self attention | |||
The encoder and parts of the decoder, use self-attention which means the keys, values, and queries are all generated from the embedding. | |||
;Encoder-decoder attention | |||
The decoder also uses encoder-decoder attention where the keys and values come from the encoder (i.e. the input sentence) but the queries come from the decoder input (i.e. the previously-generated output). | |||
===Encoder=== | ===Encoder=== |