Transformer (machine learning model): Difference between revisions

Transformer (machine learning model) (view source)

111 bytes added , 23 November 2020

5,322

edits

@@ Line 46: / Line 46: @@
 Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.
 As with the encoder, each layer is followed by an add-and-normalize residual connection.
-The encoder-decoder attention generates its keys and values from the output of the last encoder block.
+The encoder-decoder attention gets its keys and values from the output of the last encoder block.
+The same keys and values are passed to all encoder-decoder layers rather than having each layer generate its own.
 ==Code==