Transformer (machine learning model): Difference between revisions

Transformer (machine learning model) (view source)

473 bytes added , 23 November 2020

5,322

edits

@@ Line 9: / Line 9: @@
 attention and feedforward sublayers.<br>
 [[File:Transformer architecture.png|500px]]
+===Positional Encoding===
+The positional encoding is a sine wave which is added (not concatenated) to the word embeddings. See the original paper for details.
+Today, people use learned embeddings.
 ===Attention===
 Attention is the main contribution of the transformer architecture.<br>
@@ Line 24: / Line 28: @@
 In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.
-These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next encoder block.
+These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block.
 ===Encoder===
@@ Line 30: / Line 34: @@
 The encoder is comprised of N=6 blocks, each with 2 layers.<br>
 Each block contains a multi-headed attention layer followed by a feed-forward layer.<br>
+The feed-forward layer is applied to each word individually.<br>
+Each of the two layers of the encoder is a residual layer with an add-and-normalize residual connection.
 ===Decoder===
+Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.
 ==Code==