Jump to content

Transformer (machine learning model): Difference between revisions

Line 46: Line 46:
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.   
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.   
As with the encoder, each layer is followed by an add-and-normalize residual connection.   
As with the encoder, each layer is followed by an add-and-normalize residual connection.   
The encoder-decoder attention generates its keys and values from the output of the last encoder block.
The encoder-decoder attention gets its keys and values from the output of the last encoder block
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own.


==Code==
==Code==