Transformer (machine learning model): Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
{{main | Wikipedia: Transformer (machine learning model)}} | {{main | Wikipedia: Transformer (machine learning model)}} | ||
[https://arxiv.org/abs/1706.03762 Attention is all you need paper]<br> | [https://arxiv.org/abs/1706.03762 Attention is all you need paper]<br> | ||
A neural network architecture by Google. | A neural network architecture by Google.<br> | ||
It is currently the best at NLP tasks and has mostly replaced RNNs for these tasks. | It is currently the best at NLP tasks and has mostly replaced RNNs for these tasks. | ||
==Architecture== | |||
The Transformer uses an encoder-decoder architecture. | |||
Both the encoder and decoder are comprised of multiple identical layers which have | |||
attention and feedforward sublayers.<br> | |||
[[File:Transformer architecture.png|500px]] | |||
===Attention=== | |||
Attention is the main contribution of the transformer architecture.<br> | |||
[[File:Transformer attention.png|500px]] | |||
The attention block outputs a weighted average of values in a dictionary of key-value pairs.<br> | |||
In the image above:<br> | |||
* <math>Q</math> represents queries (each query is a vector) | |||
* <math>K</math> represents keys | |||
* <math>V</math> represents values | |||
The attention block can be represented as the following equation: | |||
* <math>\operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math> | |||
===Encoder=== | |||
The receives as input the input embedding added to a positional encoding.<br> | |||
The encoder is comprised of N=6 layers, each with 2 sublayers.<br> | |||
Each layer contains a multi-headed attention sublayer followed by a feed-forward sublayer.<br> | |||
Both sublayers are residual blocks.<br> | |||
===Decoder=== | |||
==Resources== | |||
;Guides and explanations | ;Guides and explanations | ||
* [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | * [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer] | ||
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video] | * [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video] |
Revision as of 19:32, 5 December 2019
Attention is all you need paper
A neural network architecture by Google.
It is currently the best at NLP tasks and has mostly replaced RNNs for these tasks.
Architecture
The Transformer uses an encoder-decoder architecture.
Both the encoder and decoder are comprised of multiple identical layers which have
attention and feedforward sublayers.
Attention
Attention is the main contribution of the transformer architecture.
The attention block outputs a weighted average of values in a dictionary of key-value pairs.
In the image above:
- \(\displaystyle Q\) represents queries (each query is a vector)
- \(\displaystyle K\) represents keys
- \(\displaystyle V\) represents values
The attention block can be represented as the following equation:
- \(\displaystyle \operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V\)
Encoder
The receives as input the input embedding added to a positional encoding.
The encoder is comprised of N=6 layers, each with 2 sublayers.
Each layer contains a multi-headed attention sublayer followed by a feed-forward sublayer.
Both sublayers are residual blocks.
Decoder
Resources
- Guides and explanations