5,322
edits
Line 18: | Line 18: | ||
* <math>V</math> represents values | * <math>V</math> represents values | ||
The attention block can be represented as the following equation: | The attention block can be represented as the following equation: | ||
* <math>\operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math> | * <math>Z = \operatorname{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V</math> | ||
Each word embedded gets its own key, query, and value vectors which are stacked alongside those of other words to form K, Q, and V matrices. | |||
These K, Q, and V matrices are computed by multiplying the embedding matrix <math>X</math> with weights <math>W_{K}, W_{Q}, W_{V}</math>. | |||
In multi-headed attention, you have multiple <math>W_{K}, W_{Q}, W_{V}</math> and get an output for each attention head <math>Z_i</math>. | |||
These are concatenated and multiplied by another weight matrix to form the output <math>Z</math>, the input to the next encoder block. | |||
===Encoder=== | ===Encoder=== |