Jump to content

Transformer (machine learning model): Difference between revisions

Line 9: Line 9:
attention and feedforward sublayers.<br>
attention and feedforward sublayers.<br>
[[File:Transformer architecture.png|500px]]
[[File:Transformer architecture.png|500px]]
===Positional Encoding===
The positional encoding is a sine wave which is added (not concatenated) to the word embeddings. See the original paper for details. 
Today, people use learned embeddings.
===Attention===
===Attention===
Attention is the main contribution of the transformer architecture.<br>
Attention is the main contribution of the transformer architecture.<br>
Line 24: Line 28:


In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.   
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>.   
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next encoder block.
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next layer of the block.


===Encoder===
===Encoder===
Line 30: Line 34:
The encoder is comprised of N=6 blocks, each with 2 layers.<br>
The encoder is comprised of N=6 blocks, each with 2 layers.<br>
Each block contains a multi-headed attention layer followed by a feed-forward layer.<br>
Each block contains a multi-headed attention layer followed by a feed-forward layer.<br>
The feed-forward layer is applied to each word individually.<br>
Each of the two layers of the encoder is a residual layer with an add-and-normalize residual connection.


===Decoder===
===Decoder===
 
Each decoder consists of a self-attention, an encoder-decoder attention, and a feed-forward layer.


==Code==
==Code==