5,322
edits
Line 9: | Line 9: | ||
attention and feedforward sublayers.<br> | attention and feedforward sublayers.<br> | ||
[[File:Transformer architecture.png|500px]] | [[File:Transformer architecture.png|500px]] | ||
===Positional Encoding=== | |||
The positional encoding is a sine wave which is added (not concatenated) to the word embeddings. See the original paper for details. | |||
Today, people use learned embeddings. | |||
===Attention=== | ===Attention=== | ||
Attention is the main contribution of the transformer architecture.<br> | Attention is the main contribution of the transformer architecture.<br> | ||
Line 24: | Line 28: | ||
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | ||
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next | These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block. | ||
===Encoder=== | ===Encoder=== | ||
Line 30: | Line 34: | ||
The encoder is comprised of N=6 blocks, each with 2 layers.<br> | The encoder is comprised of N=6 blocks, each with 2 layers.<br> | ||
Each block contains a multi-headed attention layer followed by a feed-forward layer.<br> | Each block contains a multi-headed attention layer followed by a feed-forward layer.<br> | ||
The feed-forward layer is applied to each word individually.<br> | |||
Each of the two layers of the encoder is a residual layer with an add-and-normalize residual connection. | |||
===Decoder=== | ===Decoder=== | ||
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer. | |||
==Code== | ==Code== |