5,321
edits
(→S) |
(→A) |
||
Line 2: | Line 2: | ||
==A== | ==A== | ||
* Activation function - A nonlinear function applied after every linear layer in a neural network. Typically ReLU but can also be tanh or sine. | |||
* Adam optimizer - A popular gradient descent optimizer which includes momentum and per-parameter learning rates. | |||
* Attention - An component of [[Transformer_(machine_learning_model)|transformers]] which involves computing the product of query and key embeddings to compute the interaction between elements. | * Attention - An component of [[Transformer_(machine_learning_model)|transformers]] which involves computing the product of query and key embeddings to compute the interaction between elements. | ||
==B== | ==B== |