Transformer (machine learning model)

Revision as of 15:20, 23 November 2020 by David (talk | contribs) (→‎Attention)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Attention is all you need paper
A neural network architecture by Google.
It is currently the best at NLP tasks and has mostly replaced RNNs for these tasks.

Architecture

The Transformer uses an encoder-decoder architecture. Both the encoder and decoder are comprised of multiple identical layers which have attention and feedforward sublayers.
 

Attention

Attention is the main contribution of the transformer architecture.
 
The attention block outputs a weighted average of values in a dictionary of key-value pairs.
In the image above:

  • \(\displaystyle Q\) represents queries (each query is a vector)
  • \(\displaystyle K\) represents keys
  • \(\displaystyle V\) represents values

The attention block can be represented as the following equation:

  • \(\displaystyle Z = \operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V\)

Each word embedded gets its own key, query, and value vectors which are stacked alongside those of other words to form K, Q, and V matrices.
These K, Q, and V matrices are computed by multiplying the embedding matrix \(\displaystyle X\) with weights \(\displaystyle W_{K}, W_{Q}, W_{V}\).

In multi-headed attention, you have multiple \(\displaystyle W_{K}, W_{Q}, W_{V}\) and get an output for each attention head \(\displaystyle Z_i\).
These are concatenated and multiplied by another weight matrix to form the output \(\displaystyle Z\), the input to the next encoder block.

Encoder

The entire encoder receives as input the input embedding added to a positional encoding.
The encoder is comprised of N=6 blocks, each with 2 layers.
Each block contains a multi-headed attention layer followed by a feed-forward layer.

Decoder

Code

See Hugging Face

Resources

Guides and explanations

References