Transformer (machine learning model): Difference between revisions

Line 58: Line 58:
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher]
* [https://www.youtube.com/watch?v=iDulhoQ2pro Youtube Video by Yannic Kilcher]
* [https://arxiv.org/abs/2207.09238 Formal Algorithms for Transformers (Arxiv 2022)]
* [https://arxiv.org/abs/2207.09238 Formal Algorithms for Transformers (Arxiv 2022)]
==Followup work==
* [https://arxiv.org/abs/2112.05682 Memory-efficient attention] reduces the memory overhead of an attention layer to a constant amount (specifically, a scalar and a vector the size of one output feature).
** This processes queries sequentially, is good for weaker GPUs where memory is limited and computation is less parallel due to fewer cores.


==References==
==References==