Jump to content

Transformer (machine learning model): Difference between revisions

Line 18: Line 18:
* <math>V</math> represents values
* <math>V</math> represents values
The attention block can be represented as the following equation:
The attention block can be represented as the following equation:
* <math>\operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math>
* <math>Z = \operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math>
 
Each word embedded gets its own key, query, and value vectors which are stacked alongside those of other words to form K, Q, and V matrices. 
These K, Q, and V matrices are computed by multiplying the embedding matrix <math>X</math> with weights <math>W_{K}, W_{Q}, W_{V}</math>.
 
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. 
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next encoder block.


===Encoder===
===Encoder===