5,322
edits
No edit summary |
|||
Line 11: | Line 11: | ||
===Attention=== | ===Attention=== | ||
Attention is the main contribution of the transformer architecture.<br> | Attention is the main contribution of the transformer architecture.<br> | ||
[[File:Transformer attention.png|500px]] | [[File:Transformer attention.png|500px]]<br> | ||
The attention block outputs a weighted average of values in a dictionary of key-value pairs.<br> | The attention block outputs a weighted average of values in a dictionary of key-value pairs.<br> | ||
In the image above:<br> | In the image above:<br> | ||
Line 19: | Line 19: | ||
The attention block can be represented as the following equation: | The attention block can be represented as the following equation: | ||
* <math>\operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math> | * <math>\operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math> | ||
===Encoder=== | ===Encoder=== | ||
The receives as input the input embedding added to a positional encoding.<br> | The receives as input the input embedding added to a positional encoding.<br> |