5,322
edits
Line 46: | Line 46: | ||
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | ||
As with the encoder, each layer is followed by an add-and-normalize residual connection. | As with the encoder, each layer is followed by an add-and-normalize residual connection. | ||
The encoder-decoder attention | The encoder-decoder attention gets its keys and values from the output of the last encoder block. | ||
The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own. | |||
==Code== | ==Code== |